Memory-placement planner · beta

Run the largest useful model your GPU can honestly support.

A GPU-enabled container exposes the device; gpu-container decides what lives in VRAM, pinned RAM, and NVMe. It profiles the rig + model, emits an explicit placement plan, proves it with a measured receipt — and refuses when the plan would thrash. Not 'Docker VRAM overflow': explicit, declared placement is the moat.

Get started Read the Handbook

Install (PyPI)

pip install "gpu-container[host]"

Install (npx)

npx gpu-container --help

Plan

gpu-container-plan --profile profile.json --model-config qwen3.json

Explicit placement, measured proof, honest refusal

Why a gpu-container plan is trustworthy on a single personal rig.

Explicit placement

Every byte has a declared home — VRAM, pinned RAM, or NVMe. CUDA Unified-Memory oversubscription is unavailable on Windows/WSL2 (NVIDIA-confirmed); declared placement is the path.

MoE expert tiering

The flagship lane: shared/attention layers in VRAM, experts routed to CPU RAM via llama.cpp --n-cpu-moe. Proven live on Qwen3-30B-A3B.

Measured receipts

A real llama-bench run verifies the plan against a roofline ceiling + calibrated band — and writes a calibration point back so the next plan is sharper.

Honest refusal

No plan clears >1 tok/s? It refuses and explains why. NVMe is the cold-expert lane, not a dense-weight-streaming lane — sub-1 tok/s there is physics.

Routing de-risk gate

Before building per-expert caching, measure whether a model's routing is even skewed enough to cache. Qwen3 routes near-uniform — so the cache is on hold, with evidence.

Rig-safety watchdog

Supervise a GPU job: poll GPU power/temp/VRAM + host memory, and abort on a hard breach (kill the job, or the whole VM). Born from a real incident.

Usage

Install

pip install "gpu-container[host]"
# or, zero Python:
npx gpu-container --help

Profile → plan

# in the target container (honest, measured inputs):
gpu-container-profile --bench-dir /bench -o profile.json
gpu-container-plan --profile profile.json --model-config qwen3.json --quant gguf-q4_k_m

Launch under the watchdog

gpu-container-watchdog run --on-breach kill-job --peaks-out peaks.json -- \
  docker run --rm --gpus all -v "E:/AI-Models:/models" llama.cpp:full-cuda \
    llama-bench -m /models/model.gguf --n-cpu-moe 0 -o json > bench.json

Prove it with a receipt

gpu-container-receipt --plan plan.json --bench bench.json --peaks peaks.json \
  --model-name Qwen3-30B-A3B --calibration-dir ./calib -o receipt.json