Skip to content

gpu-container

A GPU-enabled container exposes the device. A model-aware runtime decides what lives in VRAM, pinned RAM, and NVMe. gpu-container is that runtime’s planner: it profiles your rig and model, emits an explicit placement plan, proves it with a measured receipt, and refuses when the plan would thrash.

Not “Docker VRAM overflow.” CUDA Unified-Memory oversubscription is unavailable on Windows/WSL2 (NVIDIA-confirmed) and catastrophically slow even on Linux for autoregressive decode. The product is explicit, declared placement — that’s the moat.

profile ─▶ plan ─▶ (launch under) watchdog ─▶ receipt
│ ▲
concentration ───────┘ (de-risk the per-expert lane, before you build for it)

Five small commands, each doing one thing and returning a verdict-coded exit status:

CommandDoes
gpu-container-profileMeasure the rig (VRAM, PCIe, NVMe, pinnable RAM, CPU bandwidth) + the model
gpu-container-planCompute explicit placement + a calibrated throughput forecast; ship or refuse
gpu-container-receiptVerify a plan against a real llama-bench run; write a calibration point back
gpu-container-concentrationDe-risk the per-expert cache — measure routing concentration first
gpu-container-watchdogSupervise a GPU job; abort on a host-memory/power/VRAM breach
  • Explicit placement — every byte has a declared home; no runtime demand-paging magic.
  • MoE expert tiering (flagship) — shared/attention layers in VRAM, experts in CPU RAM via llama.cpp --n-cpu-moe.
  • Measured receipts — a real run verifies the forecast against a roofline ceiling and a calibrated band; the receipt feeds the next plan (the recalibration loop).
  • Honest refusal — no plan clears the >1 tok/s floor? It refuses, and explains why.
  • Routing de-risk — before building per-expert caching, measure whether the routing is even skewed enough to cache.
  • Rig-safety watchdog — supervise every GPU job so a bad plan can’t take the machine down.

Status: beta. The profiler, planner, recalibration loop, routing de-risk gate, and watchdog are built and proven on an RTX 5090 / WSL2 rig. llama.cpp is the integrated backend; the placement math is backend-agnostic.