Skip to content

Recovery Runbook

The long-running commands write to append-only ledgers and surface structured errors when a seam fails. This page is the partial-failure runbook: which seam failed, how to read what was written, and how to recover without re-running the whole chain.

Each section is shaped:

  • Symptom — what you see at the terminal or in artifacts.
  • Cause — which seam failed, and what is durable.
  • Recover — the command(s) to make forward progress.

Symptom. research-os review <section> exits with ReviewerCascadeFailedError (code REVIEWER_CASCADE_FAILED, retryable: true). The stderr line names each failed reviewer: Multi-pass review: every reviewer failed. <reviewer-name>: <reason> | ....

Cause. Every configured reviewer in the cascade failed on the same window (e.g., Ollama daemon down, model not pulled, or transient timeout). Previously written review.json records are append-only and durable; the failure is at the current window boundary, not retroactive.

Recover.

Terminal window
# 1. Confirm the reviewer backend is live.
curl http://localhost:11434/api/version
ollama list # confirm the configured model is present
# 2. Re-run the same command. Append-only records survive.
research-os review <section> --triaged-only --preset hermes-two-pass --profile hermes-two-pass

If the cascade names a specific reviewer (e.g., ollama-intern), inspect that reviewer’s reason field. The hint on ReviewerCascadeFailedError lists the failed reviewers so you can target the root cause.


Gather — one URL fails inside a multi-URL run

Section titled “Gather — one URL fails inside a multi-URL run”

Symptom. research-os gather <section> reports fetchedFailed > 0 in its summary, or the run exits cleanly but the source-card directory is missing entries for some URLs.

Cause. Stage B B-A-001 made gather partial-write-resilient. A fetchOnce throw mid-loop now: (1) writes a synthetic failure receipt for the failing URL, (2) flushes the accumulated source-id batch immediately so prior successes become durable, and (3) continues with the next URL. The batch flush no longer drops in-flight source-ids on mid-loop failure.

Recover. Inspect evidence/fetch-log.jsonl to find which URLs failed:

Terminal window
# Find URLs with non-ok outcomes
grep -E '"fetch_outcome":"(network_error|http_error|extraction_failed)"' \
<pack>/evidence/fetch-log.jsonl

Then re-run gather with the failed URLs only — the same URL produces the same source-id deterministically, so the existing source-card directory is not duplicated:

Terminal window
research-os gather <section> --url <failed-url> --url <other-failed-url>

Symptom. research-os pack publish --to <path> exits 2 with a verify-pack-style error after writing to the target. The error mentions hash mismatch, freeze-receipt, final-report, or a specific corrupted artifact.

Cause. The post-write verification pass re-hashes the canonical artifacts in the target and refuses if anything doesn’t reproduce. This catches a corrupted copy, a manifest that doesn’t match the receipt sha256, or an orphan-artifact violation. The target is left as-is so you can inspect.

Recover. Inspect the named file in the target. If the source pack is clean (re-run verify-pack.mjs against it), publish with --force to clear-and-replace:

Terminal window
# Verify the source pack first
node research-packs/scripts/verify-pack.mjs <source-pack>
# Re-publish with --force
research-os pack publish \
--from <source-pack> \
--to <target> \
--force

Edit upstream artifacts (claims, sources, synthesis) or sibling files instead. See pack publish for the full admission contract.


Index build — malformed JSONL or source-card file

Section titled “Index build — malformed JSONL or source-card file”

Symptom. research-os index build completes but stderr shows structured warnings: malformed_jsonl (path, 1-based line number, reason), malformed_source_card (path, reason), or section_index_failed (section id, reason).

Cause. Stage B B-A-002 made the indexer per-record-resilient. One malformed JSONL tail line or one bad evidence/source-cards/*.json no longer crashes the entire build. tryReadJsonl is wrapped per-line, readSourceCards is wrapped per-file, and indexSection is wrapped per- section. Healthy records still index.

Recover. Locate the named file + line, fix or truncate, then rebuild:

Terminal window
# Find malformed tail line (warnings include 1-based line number)
sed -n '<N>p' <pack>/<reported-path>
# Truncate trailing malformed line if it is the last line
head -n <N-1> <pack>/<reported-path> > <pack>/<reported-path>.fixed
mv <pack>/<reported-path>.fixed <pack>/<reported-path>
# Rebuild — idempotent
rm .research-os/index.sqlite # only required on SCHEMA_VERSION bump (see known-limitations)
research-os index build --all

Symptom. node scripts/reviewer-calibration.mjs --runs 3 ... writes per-run receipts to <profile>/runs/run-NNN.json but exits before the aggregate receipt is produced, or the aggregate disagrees with one specific run.

Cause. Each run is independent and writes its receipt before moving to the next. The aggregate seeded-v1.{json,md} is written only after all runs complete; partial-progress runs are durable on disk. Recurring-failure detection in the aggregate (median-based PASS/FAIL bars) intentionally demotes profiles that fail a majority of runs.

Recover. Inspect the per-run receipts to find the failing run, then either re-run only the failing run (advanced) or re-run the whole --runs N batch (canonical):

Terminal window
# Inspect per-run receipts
ls calibration/reviewer-profiles/<profile>/runs/
# Re-run the full batch — receipts overwrite the prior run-NNN.json
node scripts/reviewer-calibration.mjs \
--model hermes3:8b --two-pass --runs 3 \
--profile <profile>

A failed aggregate verdict is information, not an error — research-os refuses to trust a reviewer profile when repeated seeded failures do not support trust (Law 13). The hermes-two-pass-deterministic=failed canonical receipt is the mechanism working, not a bug.


Freeze — refusal with stable reason_code

Section titled “Freeze — refusal with stable reason_code”

Symptom. research-os freeze exits 2 and writes audits/freeze-refusal.{json,md} (instead of freeze-receipt.{json,md}). The refusal carries reasons[] prose and a reason_records[] array with stable codes (Stage B B-C-003).

Cause. Freeze is the final integrity lock (Law 15). It refuses unless every condition is met: audit ready_for_synthesis, handoff synthesis_ready, all five synthesis files exist, final-report cites only accepted claim_ids, all active waivers disclosed, all canonical artifacts parse cleanly.

Recover. Read the reason_records[] array — each record has a stable reason_code and a reason_message. The CLI surfaces a generated next_actions list keyed off the codes (no substring matching). Common recovery paths:

reason_codeRecovery
FREEZE_PACK_AUDIT_NOT_READYRe-run research-os audit; address blockers it surfaces
FREEZE_HANDOFF_NOT_READYRe-run research-os cowork handoff after audit clears
FREEZE_FINAL_REPORT_NO_CITATIONSAdd citations to synthesis/final-report.md
FREEZE_UNKNOWN_CLAIM_CITEDRemove or correct the cited claim id
FREEZE_UNACCEPTED_CITEDCite only claims accepted by review (Law 7)
FREEZE_REPAIR_CLAIM_CITEDRe-review or remove citation to the repair-state claim
FREEZE_UNRESOLVED_CONTRADICTION_UNDISCLOSEDResolve via contradict resolve or disclose in synthesis
FREEZE_WAIVER_UNDISCLOSEDDisclose active waivers in synthesis/decision-brief.md
FREEZE_MISSING_GATERun research-os gate <section> for each section
FREEZE_MISSING_REQUIRED_ARTIFACTRe-run the upstream command that produces the missing artifact
FREEZE_MISSING_SYNTHESIS_ARTIFACTRun research-os synth workspace (requires handoff synthesis_ready)
FREEZE_MALFORMED_ARTIFACTRepair the named artifact and re-run freeze

After recovery, re-run research-os freeze. The refusal artifact is overwritten by the next run; the receipt artifact is written only on PASS.