Skip to content

The MoE Lane

Mixture-of-Experts models are the high-leverage case: Mixtral-8×7B is 47B params but only 13B active per token; DeepSeek-V3 is 671B with 37B active. Most experts sit cold. Without placement you OOM (all experts in VRAM) or crawl (all in RAM). With placement: hot in VRAM, warm in RAM, cold on NVMe — fits and fast.

TierHoldsBound by
VRAMalways-active shared/attention layers, hot experts, KV working setphysical VRAM (hard ceiling)
Pinned RAMwarm experts, KV spill, prefetch stagingPCIe (~50 GB/s sustained, measured)
NVMecold experts (gated), cold KVrandom-QD1 (≪ sequential — the number that matters)

No magic overflow — every byte has a declared home. CUDA Unified-Memory oversubscription is unavailable on Windows/WSL2 (NVIDIA-confirmed) and the wrong tool for decode even on Linux.

llama.cpp --n-cpu-moe N keeps attention + shared layers in VRAM and routes the first N MoE layers’ experts to CPU RAM (computed on CPU, KTransformers-style — not PCIe streaming). Raise N to fit, lower for speed. The planner computes the minimal N that fits VRAM, and refuses if even all-experts-on-CPU won’t clear the floor.

Proven live on Qwen3-30B-A3B-Q4_K_M (RTX 5090, in-container):

Nceiling tok/smeasured decoderealizedfloor
0738302.441% (overhead-bound)cleared
246941.961% (CPU-bw-bound)cleared
483620.456%cleared

The closed-form decode estimate is a roofline ceiling — peak bandwidth, zero overhead, a true upper bound. Real decode is a fraction of it. So the planner emits a calibrated forecast with a band:

ceiling (closed form) × realized efficiency (measured) = calibrated forecast ± band

A receipt records measured ÷ ceiling as a calibration point; the next plan scales its ceiling by the fitted efficiency. The verifier is a real llama-bench run — a different mechanism than the planner’s math.

Per-expert tiering is gated on a measurement

Section titled “Per-expert tiering is gated on a measurement”

True hot/warm/cold tiering within a layer needs a runtime expert-slot cache, because stock llama.cpp stores a layer’s experts as one fused tensor (-ot is per-layer only). Before building that cache for a model, the concentration gate measures whether the routing is even skewed enough to be worth it. On Qwen3-30B-A3B it isn’t (near-uniform) — so the per-layer hot tier ships now, and the per-expert cache waits for a model that passes the gate.