Skip to content

Getting Started

Ollama Intern MCP is a local-only MCP server. Claude Code calls it; it routes work onto a local Ollama model; the result lands in a durable artifact on disk.

This page takes you from zero to one real tool call in about five minutes.

  • Node.js 18 or newer (20 LTS recommended — matches CI).
  • Ollama installed and running at http://127.0.0.1:11434.
  • Claude Code (or any MCP-capable client).

The models this server drives run on your machine — you need enough VRAM (or system RAM for CPU inference) to load them. Ballpark figures, honest:

ModelRoleVRAM (GPU)RAM (CPU fallback)
hermes3:8bDefault workhorse — Instant / Workhorse / Deep~6 GB~16 GB
nomic-embed-textEmbed tier~500 MB~1 GB

See Ollama’s hardware notes for the full picture — quantization, context length, and concurrent loaded models all move the number.

Profile hints:

  • dev-rtx5080 — tested on RTX 5080 (16 GB VRAM). Default.
  • dev-rtx5080-qwen3 — same hardware, Qwen 3 alternate rail.
  • m5-max — tuned for M5 Max MacBook Pro (128 GB unified memory); swaps the ladder to heavier Qwen 3 tiers.

Alternate tiers live in each profile file under src/profiles/. If hermes3:8b can’t fit, drop to a smaller Ollama model or use a lighter profile — ollama ps will show what’s actually resident.

Most users do not install globally. The recommended path is the Claude Code MCP config block below, which runs the server on demand via npx.

If you want the binary on your PATH for ad-hoc use, you can still install it globally:

Terminal window
npm install -g ollama-intern-mcp

Add this block to your Claude Code MCP server config:

{
"mcpServers": {
"ollama-intern": {
"command": "npx",
"args": ["-y", "ollama-intern-mcp"],
"env": {
"OLLAMA_HOST": "http://127.0.0.1:11434",
"INTERN_PROFILE": "dev-rtx5080"
}
}
}
}

The INTERN_PROFILE picks a hardware profile — see Envelope & tiers for the full table. dev-rtx5080 is the default developer profile and runs the validated hermes3:8b ladder.

Same block, written to:

  • macOS — ~/Library/Application Support/Claude/claude_desktop_config.json
  • Windows — %APPDATA%\Claude\claude_desktop_config.json

The default dev-rtx5080 profile collapses all three work tiers (Instant / Workhorse / Deep) onto hermes3:8b, plus nomic-embed-text for the Embed tier. One pull covers everything:

Terminal window
ollama pull hermes3:8b
ollama pull nomic-embed-text
export OLLAMA_MAX_LOADED_MODELS=2
export OLLAMA_KEEP_ALIVE=-1

Four tiers, top to bottom:

TierDefault modelUsed by
Instanthermes3:8bclassify, extract, triage_logs
Workhorsehermes3:8bsummarize_fast, draft, briefs
Deephermes3:8bsummarize_deep, research, packs
Embednomic-embed-textembed, embed_search, corpus tools

Other profiles (dev-rtx5080-qwen3, m5-max) swap the ladder — see Envelope & tiers.

Restart Claude Code so it re-reads the MCP config, then ask it to run the smallest possible call:

Use the ollama-intern ollama_classify tool to classify
"the build failed because the API returned 502" into one of
["infra", "code", "user-error"]. Show me the envelope.

You should get back a uniform envelope like this:

{
"result": { "label": "infra", "confidence": 0.9 },
"tier_used": "instant",
"model": "hermes3:8b",
"hardware_profile": "dev-rtx5080",
"tokens_in": 42,
"tokens_out": 8,
"elapsed_ms": 380,
"residency": { "in_vram": true, "evicted": false }
}

That is the intern working. Every tool in the server returns this same envelope shape. Every call is appended as one NDJSON line to ~/.ollama-intern/log.ndjson.