Skip to content

Comparison

Every local-LLM MCP server leads with token savings. We lead with what the intern produces. This page lays out where that framing pays off, where it doesn’t, and where other tools are straight-up better.

No hyperbole, no “competitive landscape” slide. If a row is a blank for us, it’s a blank.

Featureollama-intern-mcphoutini-lmmcp-local-llmraw Ollama HTTPClaude-direct
Tier routing (Instant / Workhorse / Deep / Embed)
Evidence-backed briefs with server-side citation strippingpartial (prompt-level)
Durable markdown + JSON artifacts on disk
Living corpora (manifest v2, incremental refresh, :latest drift detection)
Uniform envelope across every tool
Structured errors (code / message / hint / retryable)
Batch atoms (items: [{id, text}], one envelope per batch)
Residency visibility (evicted, size_vram_bytes)partial (via /api/ps)
NDJSON call log at a known path
source_path mode — server reads the file, caller never pre-reads
Path-safety (allowed_roots, ..-rejection, protected-path writes)
Streaming tool responsespartialpartial
Pure vector database (HNSW / pgvector)
Full code-execution sandbox
Cloud model access

Claude calls ollama_classify; the server picks Instant. Claude calls ollama_summarize_deep; the server picks Deep. The caller never writes “use model X.” This is job-shaped on purpose — a caller who reaches for ollama_chat has skipped the job-shaped surface and is saying “I don’t know what I want.”

Other local-LLM MCPs ask the caller to pick the model. Fine for experimentation, brittle for delegation.

incident_brief, repo_brief, change_brief each require evidence the server can validate. Claims that cite unknown evidence ids are stripped before return. Thin evidence flags weak: true with coverage notes instead of smoothing into fake narrative.

Claude-direct can approximate this in prompt — the gap is that ours is server-enforced, not prompt-enforced. A model that ignores the instruction still can’t sneak an unvalidated citation through.

incident_pack / repo_pack / change_pack each write <slug>.md + <slug>.json under ~/.ollama-intern/artifacts/. You open them tomorrow, diff them next week, export them into a handbook with artifact_export_to_path. The markdown is rendered by code, not prompt — shape is deterministic.

No other tool in this category writes a file and hands you the path. Every competitor returns a blob of model output and trusts you to save it.

corpus_index builds a searchable chunk store under ~/.ollama-intern/corpora/<name>/. corpus_refresh rebuilds only what’s changed (file content hash). corpus_search does BM25 + RRF fusion. corpus_answer cites chunk ids server-side-validated against what’s actually stored. Manifest v2 tracks the resolved embed model — if Ollama silently promotes nomic-embed-text:latest under you, ollama_corpus_refresh surfaces embed_model_resolved_drift.

A pure vector DB is better for 100M-vector workloads. Ours is better for “I have 300 files I want cited answers against, and I don’t want to run pgvector.”

Every tool returns { result, tier_used, model, hardware_profile, tokens_in, tokens_out, elapsed_ms, residency }. No tool invents a new shape. The envelope is the audit trail.

Other servers return “whatever the model said.” Parsing that at scale means hand-written glue per tool.

On failure every tool returns { error: true, code, message, hint, retryable }. code names are stable across releases — treat them like an API. Never a raw stack trace.

Raw Ollama returns HTTP errors. Prompt-based tools return prose failures you have to parse.

classify, extract, triage_logs accept items: [{ id, text }] and return one envelope per batch. Duplicate ids refused. This is where bulk work happens — 10-item classify in one round-trip instead of 10 round-trips with 10 prewarms and 10 envelopes.

Every envelope carries residency from Ollama’s /api/ps. When evicted: true or size_vram_bytes < size_bytes, the caller knows inference dropped 5–10× and why. Raw Ollama exposes the same data but you have to correlate it by hand.

classify and extract accept source_path — a file the server reads itself. Claude never pre-reads the file, which means Claude’s context budget isn’t spent on bytes the intern is about to read anyway.

No other delegation tool in the MCP ecosystem does this. It’s small, and it’s the right optimization.

We don’t stream. Packs emit pack_step events to the NDJSON log so a tailing operator can see pipeline progress, but the MCP response is single-shot. If you need mid-call token streams, use raw Ollama or Claude-direct.

corpus_* is not pgvector. It’s BM25 + RRF fusion over chunks with metadata, sized for repo-scale corpora (thousands to tens of thousands of chunks). Use pgvector / Qdrant / Weaviate for hundred-millions of vectors.

ollama_draft runs a compile check on code drafts when language is known (tsc, eslint, pytest on the Tools-ops agent’s batch_proof_check). It’s a surface, not a sandbox — it will not execute arbitrary code against arbitrary inputs. Use a dedicated sandbox (e2b, Docker) if you need that.

Local-only. Every call goes to http://127.0.0.1:11434 or nowhere. If you need GPT-4 / Claude / Gemini, drive them directly or use a cloud-MCP — this server is not that shape.

  • Bulk classification / extraction / triage over many items — ollama-intern-mcp. Batch atoms.
  • “What just happened” incident writeup with evidence — ollama-intern-mcp. incident_pack.
  • Repo onboarding brief — ollama-intern-mcp. repo_pack or repo_brief.
  • Mid-call token streaming — raw Ollama, or drive Claude directly.
  • Tens of millions of vectors — pgvector / Qdrant / Weaviate. Not us.
  • One-off question to a local model — raw Ollama HTTP. Don’t reach for ollama-intern-mcp for a single ad-hoc chat.
  • Cloud model access — Claude-direct or a cloud-MCP.