Skip to content

Rig Safety

gpu-container exists to run big models on a personal rig — which means it must never take that rig down. The watchdog is the net.

A too-large MoE benchmark drove host memory to 92–98% and throttled the machine for over a minute. The lesson, institutionalized: size + watch every GPU run, and abort the instant a hard threshold is crossed. On a WSL2 rig the abort of record is wsl --shutdown (instant; frees all VM RAM in ~5s).

Terminal window
gpu-container-watchdog --json # one-shot, machine-readable
gpu-container-watchdog --watch --on-breach wsl-shutdown # autonomous abort (opt-in)

Reads GPU power/temp/VRAM (nvidia-smi, worst-case across all GPUs) + host memory (psutil). Emits ok / warn / abort → exit 0 / 5 / 7. The default action is alert — it surfaces a breach, it never auto-kills.

Supervisor — run a job under the watchdog

Section titled “Supervisor — run a job under the watchdog”
Terminal window
gpu-container-watchdog run --on-breach kill-job --peaks-out peaks.json -- <command...>

Launches the command as a child, polls metrics in parallel, and on a hard breach takes the action. kill-job terminates just the child (a soft abort); wsl-shutdown is the catastrophic case. This makes “run a GPU job safely” one self-monitoring command — the recommended way to run any GPU job.

--peaks-out records the run’s peak envelope (peak power / host-mem / VRAM, stayed_within_envelope), which gpu-container-receipt --peaks folds into the receipt — proof a run stayed inside the limits.

  • A missing metric is None, never 0 — the watchdog never invents a safe-looking reading.
  • mem_source tags the vantagewindows-host vs wsl2-vm/linux. The incident metric is the host; run the watchdog on the Windows host for true coverage (in a container, psutil only sees the VM, which can sit calm while the host starves).
  • Conservative VRAM source — the watchdog reads nvidia-smi memory.used (includes driver-reserved), so it over-counts rather than under-counts. The profiler uses pynvml v2 (separates reserved); they agree on total.
  • Stable exit codes0/5/7 are a scriptable contract.

Defaults (tuned for an RTX 5090 / 64 GB / WSL2 28 GB rig), overridable via flags or --config watchdog.json:

ThresholdDefaultAborts when
--power-max95%GPU power ≥ this % of the board limit
--temp-max87°CGPU temperature ≥ this
--vram-max98%VRAM ≥ this %
--host-mem-max90%host memory ≥ this % (the incident metric)
--host-avail-min2000 MiBfree host/VM RAM below this

A live offload run must fit VRAM-resident (~26 GB) + CPU experts (≤ ~15 GB) = ≤ ~40 GB total. N=0 (all-VRAM) is the proven-safe case. Bigger models (gpt-oss-120b, GLM-4.5-Air, Qwen3-235B) are paper-only — the planner validates them, you don’t run them live on a 64 GB rig.