Run the largest useful model your GPU can honestly support.
A GPU-enabled container exposes the device; gpu-container decides what lives in VRAM, pinned RAM, and NVMe. It profiles the rig + model, emits an explicit placement plan, proves it with a measured receipt — and refuses when the plan would thrash. Not 'Docker VRAM overflow': explicit, declared placement is the moat.
Install (PyPI)
pip install "gpu-container[host]"
Install (npx)
npx gpu-container --help
Plan
gpu-container-plan --profile profile.json --model-config qwen3.json
Explicit placement, measured proof, honest refusal
Why a gpu-container plan is trustworthy on a single personal rig.
Explicit placement
Every byte has a declared home — VRAM, pinned RAM, or NVMe. CUDA Unified-Memory oversubscription is unavailable on Windows/WSL2 (NVIDIA-confirmed); declared placement is the path.
MoE expert tiering
The flagship lane: shared/attention layers in VRAM, experts routed to CPU RAM via llama.cpp --n-cpu-moe. Proven live on Qwen3-30B-A3B.
Measured receipts
A real llama-bench run verifies the plan against a roofline ceiling + calibrated band — and writes a calibration point back so the next plan is sharper.
Honest refusal
No plan clears >1 tok/s? It refuses and explains why. NVMe is the cold-expert lane, not a dense-weight-streaming lane — sub-1 tok/s there is physics.
Routing de-risk gate
Before building per-expert caching, measure whether a model's routing is even skewed enough to cache. Qwen3 routes near-uniform — so the cache is on hold, with evidence.
Rig-safety watchdog
Supervise a GPU job: poll GPU power/temp/VRAM + host memory, and abort on a hard breach (kill the job, or the whole VM). Born from a real incident.
Usage
Install
pip install "gpu-container[host]"
# or, zero Python:
npx gpu-container --help Profile → plan
# in the target container (honest, measured inputs):
gpu-container-profile --bench-dir /bench -o profile.json
gpu-container-plan --profile profile.json --model-config qwen3.json --quant gguf-q4_k_m Launch under the watchdog
gpu-container-watchdog run --on-breach kill-job --peaks-out peaks.json -- \
docker run --rm --gpus all -v "E:/AI-Models:/models" llama.cpp:full-cuda \
llama-bench -m /models/model.gguf --n-cpu-moe 0 -o json > bench.json Prove it with a receipt
gpu-container-receipt --plan plan.json --bench bench.json --peaks peaks.json \
--model-name Qwen3-30B-A3B --calibration-dir ./calib -o receipt.json