Skip to content

Reliability Gauntlets

RunForge ships with a repeatable reliability suite — ten gauntlets that validate queueing, pause/resume, cancellation, crash recovery, fair scheduling, disk resilience, desktop reconnect, and GPU behavior. You can run them locally to verify that everything works correctly on your machine.

  • Python 3.10+
  • A workspace containing data/train.csv (a small CSV is sufficient)
  • Use --dry-run for deterministic, fast verification

Goal: 4 jobs queued; at most 2 running concurrently.

Start the daemon with a parallelism limit of 2 and enqueue a sweep that produces 4 runs. Monitor with queue-status to confirm that no more than 2 jobs run at the same time.

Pass criteria:

  • Never more than 2 running jobs simultaneously
  • The group progresses to a terminal state (completed)
  • Each run produces logs.txt and result.json

Goal: pausing stops new jobs from starting; resuming continues the queue.

Enqueue a sweep (6+ runs recommended), then pause the group. While paused, queued runs stay queued and no new jobs start. Running jobs may finish, but nothing new launches. After resuming, queued runs start again.

Pass criteria:

  • While paused: queued jobs remain queued, no new jobs start
  • After resume: queued runs begin executing

Goal: cancelling a group marks queued jobs as cancelled; running jobs complete or cancel cleanly.

Cancel a group mid-execution and verify that the group ends in a canceled state. No jobs should remain stuck in running indefinitely. State must be consistent across group.json and the queue.

Pass criteria:

  • Group ends in canceled
  • No jobs remain stuck in running
  • State is consistent in group.json and queue

Goal: restarting after a daemon crash recovers the queue and resolves stale locks.

While jobs are active, kill the daemon process. Restart it and verify that it detects the stale lock/heartbeat, takes over, and resolves orphaned running jobs (marking them as failed, canceled, or requeued depending on policy). Remaining queued jobs should continue.

Pass criteria:

  • Daemon detects stale lock/heartbeat and takes over
  • Orphaned running jobs are resolved per policy
  • Remaining jobs continue to completion

Goal: a single run interleaves with a large sweep instead of waiting.

Enqueue a 10-run sweep, then immediately enqueue a single run. The single run should start early — not after all sweep runs complete. This validates round-robin fairness in the scheduler.

Pass criteria:

  • The single run starts early, consistent with round-robin fairness

Goal: a missing run folder fails that job; the daemon continues with the rest.

Enqueue several jobs, then manually delete one queued run’s folder before it starts. That job should fail with a clear reason, while all other jobs proceed normally.

Pass criteria:

  • The affected job becomes failed with a clear reason
  • Other jobs proceed without disruption

Goal: the desktop app reattaches to live state after reopening.

With the daemon running and jobs active, close RunForge Desktop. Reopen it and verify that the UI correctly renders queue status, group progress, and stale heartbeat warnings (if the daemon stopped while the app was closed).

Pass criteria:

  • Desktop renders queue status and group progress correctly
  • Stale heartbeat warning appears if the daemon stopped

Goal: GPU fallback is explicit and explained.

On a machine without a GPU (or with GPU detection disabled), create a run request with device.type = "gpu". The run should complete on CPU, and the result manifest should record:

  • effective_config.device.type = "cpu"
  • effective_config.device.gpu_reason = "no_gpu_detected"

Pass criteria:

  • Execution completes on CPU
  • Result manifest records the fallback reason
  • RF token includes [RF:DEVICE=CPU gpu_reason=no_gpu_detected]

Goal: GPU jobs respect the gpu_slots limit.

Start the daemon with --gpu-slots 1 and enqueue 2 GPU jobs. At most 1 GPU job should run at any time. The second GPU job waits until the first completes. CPU jobs are unaffected by the GPU slot constraint.

Pass criteria:

  • At most 1 GPU job running at any time
  • Second GPU job waits until first completes
  • CPU jobs unaffected by GPU slot limit

Goal: CPU and GPU jobs progress concurrently without starvation.

Start the daemon with --max-parallel 4 --gpu-slots 1. Enqueue a 4-run GPU sweep and a 4-run CPU sweep. CPU jobs should start immediately (up to the parallel limit minus GPU jobs in use). GPU jobs run one at a time. Both sweeps should make progress concurrently — CPU jobs should not wait for all GPU jobs to finish.

Pass criteria:

  • CPU jobs start immediately (up to max_parallel - gpu_in_use)
  • GPU jobs run one at a time (gpu_slots=1)
  • Both sweeps make concurrent progress
  • No starvation in either direction

When debugging gauntlet failures, these files contain the relevant state:

FilePurpose
.runforge/queue/queue.jsonJob states and scheduling
.runforge/queue/daemon.jsonDaemon heartbeat and status
.runforge/groups/<gid>/group.jsonGroup summary and run entries
.runforge/runs/<run-id>/logs.txtExecution logs
.runforge/runs/<run-id>/result.jsonRun outcome and effective config
GauntletFocusAvailable since
G1max_parallel enforcementv0.3.5+
G2Pause / Resumev0.3.5+
G3Cancel determinismv0.3.5+
G4Crash recoveryv0.3.5+
G5Fair schedulingv0.3.5+
G6Disk drift resiliencev0.3.5+
G7Desktop reconnectv0.3.5+
G8GPU fallbackv0.4.0+
G9GPU exclusivityv0.4.0+
G10Mixed workloadv0.4.0+