Routing De-risk
The per-expert cache is the highest-leverage idea in the product if the routing is skewed enough to exploit. If routing is near-uniform, a hot-expert cache buys almost nothing. So before building anything, measure. The concentration gate turns “should we build the per-expert cache?” into a number.
gpu-container-concentration --imatrix imatrix.gguf --model-name Qwen3-30B-A3BTwo measures
Section titled “Two measures”Given an activation trace (which experts fired, per layer, over a representative workload):
hot_frac_for_coverage— the fraction of a layer’s experts that must be resident (hottest-first) to capture 90% of its routing mass. The actionable number; maps to a cache size. Low = concentrated (build it); ≈ 0.90 = uniform (a cache wins nothing).concentration_score = 1 − normalized_entropy— a threshold-free [0,1] skew measure (0 = uniform, 1 = one expert). A robust cross-check onhot_frac.
cache_helps is a convenience gate (median hot_frac < threshold); the numbers are the output.
The capture path
Section titled “The capture path”# in the prebuilt llama.cpp container, N=0 (all experts in VRAM — the safe config):llama-imatrix -m model.gguf -f corpus.txt -ngl 99 --no-ppl -o imatrix.ggufllama-imatrix records a per-MoE-layer ffn_down_exps.weight.counts array — exactly the per-expert selection counts the gate needs. --imatrix reads them directly (via the optional gguf package); --trace accepts a hand-built L×E counts JSON (dependency-free).
The real result — Qwen3 routes near-uniform
Section titled “The real result — Qwen3 routes near-uniform”| workload | tokens | experts for 90% coverage | concentration | top expert | cache_helps |
|---|---|---|---|---|---|
| diverse (prose/code/math) | ~1k | 65/128 (51%) | 0.111 | 4.3% | no |
| narrow (single-domain Python) | ~8k | 58/128 (45%) | 0.154 | 6.3% | barely |
Verdict: the per-expert cache is NOT worth building for Qwen3-30B-A3B. Even the narrow workload needs ~45% of experts resident, with no dominant expert. The likely cause is by design — modern MoEs train with load-balancing auxiliary losses that spread routing evenly, training away the very skew a cache would exploit.
The caveat that travels with every verdict
Section titled “The caveat that travels with every verdict”Concentration is workload-dependent. Request-level skew flattens toward uniform across diverse prompts (MoE-Infinity, arXiv:2401.14361) — so a trace is only valid for the workload it was cut from. If your production workload is narrow and repetitive, cut your trace from that.
When to revisit
Section titled “When to revisit”Build the per-expert tier only when both hold: (1) a model/workload passes with a low hot_frac (≈ < 0.25), and (2) per-layer placement leaves decode below the calibrated band. Re-run the gate per target model — it’s cheap (one N=0 imatrix pass).