Skip to content

Ollama Cloud (optional)

Local 8B models are the hardware bottleneck most people hit. Ollama Cloud serves 600B-class models behind the same /api/* surface, so you can route the heavy tools to a far stronger model and free up local VRAM — while keeping local as an always-on fallback.

Set the two vars in your MCP client’s env block (Claude Code shown):

{
"mcpServers": {
"ollama-intern": {
"command": "npx",
"args": ["-y", "ollama-intern-mcp"],
"env": {
"OLLAMA_CLOUD_PRIMARY": "1",
"OLLAMA_API_KEY": "sk-...your-key...",
"INTERN_PROFILE": "dev-rtx5080"
}
}
}
}

When cloud is on, the generative tiers (instant / workhorse / deep) go to the cloud model; embeddings always stay local (Ollama Cloud serves no embedding models, so the corpus/embed tools are unaffected). A circuit breaker decides the backend per call:

  • Healthy → cloud serves the call.
  • Transient cloud failure (timeout / 5xx / 429 / network) → fall back to the local profile, breaker counts the failure. After 3 consecutive failures it opens for 20s, then admits a single probe.
  • Bad key (401/403) → a sticky “misconfigured” breaker surfaces the failure loudly instead of degrading silently forever.
  • Retired/typo’d model id (404) → surfaced, not silently swapped for a local model.

The local profile (INTERN_PROFILE) is the fallback ladder, so keep its models pulled. Backend resolution happens first (a near-instant breaker check), then the existing tier-degradation runs within the chosen backend — the two never chain into a slow timeout ladder.

Every envelope reports which backend served the call:

{
...envelope,
backend: "cloud" | "local",
degraded?: true,
degrade_reason?: "cloud_timeout" | "cloud_5xx" | "cloud_rate_limited"
| "cloud_unreachable" | "cloud_auth_failed" | "circuit_open"
}

residency is null for cloud-served calls (the stateless cloud has no local-VRAM residency). A backend_fallback line lands in ~/.ollama-intern/log.ndjson on every cloud→local fallback — watch the rate, not just per-call state:

Terminal window
ollama_log_tail --filter_kind backend_fallback

ollama-intern-mcp doctor shows a Cloud (primary) block with reachability + auth status. Note: cloud /api/tags lists public models without gating on the key, so doctor reports auth: unverified (checked on first call) until a real call validates the key (a bad key then trips the sticky breaker).

VarDefaultPurpose
OLLAMA_CLOUD_PRIMARY(unset)The opt-in switch. 1/true/yes/on enables cloud-primary. Unset = local-only, zero egress.
OLLAMA_API_KEY(unset)Bearer key for Ollama Cloud. Required when cloud is enabled (fail-fast at startup if missing).
OLLAMA_CLOUD_HOSThttps://ollama.comCloud base host.
INTERN_CLOUD_MODELminimax-m3:cloudCloud model for instant + workhorse + deep.
INTERN_CLOUD_DEEP_MODEL(= INTERN_CLOUD_MODEL)Optional deep-tier-only override, e.g. deepseek-v3.1:671b.
INTERN_CLOUD_TIMEOUT_{INSTANT,WORKHORSE,DEEP}_MS30000 / 120000 / 300000Per-tier cloud-attempt timeouts.
INTERN_CLOUD_NUM_CTX32768Context-window cap for cloud calls (cloud bills by GPU-time; the cap controls cost).

Big cloud models run far slower per token than a local 8B (seconds, not milliseconds) — a quality upgrade, not a speed one. That’s why the cloud tiers use a generous timeout ladder (instant 30s / workhorse 120s / deep 300s). If short classify/extract calls feel sluggish, set INTERN_CLOUD_MODEL to a smaller-but-fast cloud model, or keep cloud for the heavy tiers only.

Routing to Ollama Cloud sends prompts to a third party. Ollama’s privacy policy states cloud prompts are processed transiently, not retained beyond the request, and not used for training — but it is still egress, which is why it’s opt-in and disclosed. Local-only mode (the default) sends nothing off the box. See SECURITY.md §11 for the full threat-model entry.