gpu-container
A GPU-enabled container exposes the device. A model-aware runtime decides what lives in VRAM, pinned RAM, and NVMe. gpu-container is that runtime’s planner: it profiles your rig and model, emits an explicit placement plan, proves it with a measured receipt, and refuses when the plan would thrash.
Not “Docker VRAM overflow.” CUDA Unified-Memory oversubscription is unavailable on Windows/WSL2 (NVIDIA-confirmed) and catastrophically slow even on Linux for autoregressive decode. The product is explicit, declared placement — that’s the moat.
The pipeline
Section titled “The pipeline”profile ─▶ plan ─▶ (launch under) watchdog ─▶ receipt │ ▲ concentration ───────┘ (de-risk the per-expert lane, before you build for it)Five small commands, each doing one thing and returning a verdict-coded exit status:
| Command | Does |
|---|---|
gpu-container-profile | Measure the rig (VRAM, PCIe, NVMe, pinnable RAM, CPU bandwidth) + the model |
gpu-container-plan | Compute explicit placement + a calibrated throughput forecast; ship or refuse |
gpu-container-receipt | Verify a plan against a real llama-bench run; write a calibration point back |
gpu-container-concentration | De-risk the per-expert cache — measure routing concentration first |
gpu-container-watchdog | Supervise a GPU job; abort on a host-memory/power/VRAM breach |
What makes a plan trustworthy
Section titled “What makes a plan trustworthy”- Explicit placement — every byte has a declared home; no runtime demand-paging magic.
- MoE expert tiering (flagship) — shared/attention layers in VRAM, experts in CPU RAM via llama.cpp
--n-cpu-moe. - Measured receipts — a real run verifies the forecast against a roofline ceiling and a calibrated band; the receipt feeds the next plan (the recalibration loop).
- Honest refusal — no plan clears the >1 tok/s floor? It refuses, and explains why.
- Routing de-risk — before building per-expert caching, measure whether the routing is even skewed enough to cache.
- Rig-safety watchdog — supervise every GPU job so a bad plan can’t take the machine down.
Where to go
Section titled “Where to go”- Getting Started — install + the end-to-end walkthrough.
- CLI Reference — every command, flag, and exit code.
- The MoE Lane — the flagship expert-tiering architecture.
- Routing De-risk — the concentration gate and the Qwen3 result.
- Rig Safety — the watchdog and the incident that created it.
- Reference — exit codes, the safety envelope, and FAQs.
Status: beta. The profiler, planner, recalibration loop, routing de-risk gate, and watchdog are built and proven on an RTX 5090 / WSL2 rig. llama.cpp is the integrated backend; the placement math is backend-agnostic.