Skip to content

Cancel and Recovery

Phase 4 introduces three lifecycle controls: cancel an active training run, recover a missing index, and a workspace-trust guard on Python subprocess spawn. This page covers each end-to-end.

A long-running training session can be cancelled at any time using the Cancel Active Training command. RunForge propagates VS Code’s native CancellationToken through to the Python subprocess, gives Python a 5-second window to flush partial state and write a durable cancel marker, and falls back to SIGKILL if the grace window expires.

  1. While a run is in progress, open the Command Palette (Ctrl+Shift+P)
  2. Run RunForge: Cancel Active Training
  3. Watch the progress notification — you’ll see a “Cancelling… N s” countdown as Python winds down

You can also dismiss the in-progress notification’s cancel affordance — both paths fire the same CancellationToken.

running → user fires cancel → TS arms 5s SIGKILL timer + sends SIGTERM
→ Python receives SIGTERM
→ emits cancelling{seconds_remaining: 5..0} per second
→ flushes partial state, atomically writes .cancelled marker
→ emits run_cancelled (graceful: true)
→ exits non-zero

The cancel state is determined by artifacts on disk and events observed during the run, never by process-exit timing. After Python exits — whether gracefully or by SIGKILL — RunForge classifies the terminal state:

StateDetector
Cancelled (graceful).cancelled marker present OR run_cancelled event was observed
Cancelled (forced)Cancel intent fired, neither marker nor event landed, non-zero exit
Completedartifacts_written event observed AND run.json exists (training finished before cancel could land)

When Python writes a run successfully but the index update fails, the run is “orphaned” — it exists on disk but is invisible to Browse Runs and other observability commands. The Recover Index command walks .ml/runs/, re-reads each run.json, and re-appends any missing run to .ml/outputs/index.json.

  1. Open the Command Palette
  2. Run RunForge: Recover Index
  3. Review the structured Recovery Report that opens — it lists scanned run dirs, already-indexed counts, recovered runs, skipped runs, and explicitly-excluded cancelled runs
  • Idempotent. Repeated calls do not duplicate entries (keyed on run_id).
  • Read-only. Recovery never modifies run.json or any other artifact — only index.json is rebuilt.
  • Cancelled runs stay out. Runs with a .cancelled marker but no run.json are NOT added to the index. They appear under RecoveryReport.cancelled_excluded and remain visible in the orphan picker.
  • Stale orphan markers cleaned up. A successful re-recovery deletes the .index-orphan marker (the run is now indexed; the orphan signal is stale).

The report is a canonical TS type (RecoveryReport in src/types.ts) consumed by both the command writer and the markdown render:

FieldDescription
scanned_run_dirsTotal run dirs walked
already_indexedCount of runs present in index.json before recovery
recoveredRuns newly added to the index this call
skippedRun dirs that could not be recovered (corrupt/missing run.json, read error) — each carries { run_dir, error: 'CORRUPT_RUN_JSON' | 'MISSING_RUN_JSON' | 'READ_ERROR', message }
cancelled_excludedCancelled runs explicitly NOT indexed
recovered_atISO 8601 UTC timestamp

Each recovered or cancelled_excluded entry carries a best-effort reason classification (index_orphan_marker, cancelled, or pre_existing_orphan).

Phase 4 extends the Stage C orphan banner across all 7 observability commands. Whenever an orphan run is detected, the banner offers Recover Index as a one-click remediation.

Python subprocess spawn (training, version check, GPU probe, dataset inspect, artifact inspect) requires VS Code workspace trust. This guard exists because RunForge executes user-controlled Python code from runforge.pythonPath and runforge.mlRunnerModule, which are themselves workspace-settable.

When you open a workspace for the first time, VS Code prompts you to grant or deny trust. Granting trust enables RunForge to spawn Python.

If you are running RunForge in an untrusted workspace, training and other commands that require Python will fail with a structured error pointing you to the Manage Workspace Trust UI. Grant trust there, then retry the command.

For the full security model, see TRUST_MODEL.md.

  • Reference — full command list and settings, including Cancel Active Training and Recover Index
  • Getting Started — first-run setup and the standard training workflow