Recovery Runbook
The long-running commands write to append-only ledgers and surface structured errors when a seam fails. This page is the partial-failure runbook: which seam failed, how to read what was written, and how to recover without re-running the whole chain.
Each section is shaped:
- Symptom — what you see at the terminal or in artifacts.
- Cause — which seam failed, and what is durable.
- Recover — the command(s) to make forward progress.
Review — cascade failure mid-section
Section titled “Review — cascade failure mid-section”Symptom. research-os review <section> exits with
ReviewerCascadeFailedError (code REVIEWER_CASCADE_FAILED,
retryable: true). The stderr line names each failed reviewer:
Multi-pass review: every reviewer failed. <reviewer-name>: <reason> | ....
Cause. Every configured reviewer in the cascade failed on the same window
(e.g., Ollama daemon down, model not pulled, or transient timeout). Previously
written review.json records are append-only and durable; the failure is at
the current window boundary, not retroactive.
Recover.
# 1. Confirm the reviewer backend is live.curl http://localhost:11434/api/versionollama list # confirm the configured model is present
# 2. Re-run the same command. Append-only records survive.research-os review <section> --triaged-only --preset hermes-two-pass --profile hermes-two-passIf the cascade names a specific reviewer (e.g., ollama-intern), inspect
that reviewer’s reason field. The hint on ReviewerCascadeFailedError lists
the failed reviewers so you can target the root cause.
Gather — one URL fails inside a multi-URL run
Section titled “Gather — one URL fails inside a multi-URL run”Symptom. research-os gather <section> reports fetchedFailed > 0 in
its summary, or the run exits cleanly but the source-card directory is missing
entries for some URLs.
Cause. Stage B B-A-001 made gather partial-write-resilient. A
fetchOnce throw mid-loop now: (1) writes a synthetic failure receipt for
the failing URL, (2) flushes the accumulated source-id batch immediately so
prior successes become durable, and (3) continues with the next URL. The
batch flush no longer drops in-flight source-ids on mid-loop failure.
Recover. Inspect evidence/fetch-log.jsonl to find which URLs failed:
# Find URLs with non-ok outcomesgrep -E '"fetch_outcome":"(network_error|http_error|extraction_failed)"' \ <pack>/evidence/fetch-log.jsonlThen re-run gather with the failed URLs only — the same URL produces the same source-id deterministically, so the existing source-card directory is not duplicated:
research-os gather <section> --url <failed-url> --url <other-failed-url>Pack publish — verify-fail after write
Section titled “Pack publish — verify-fail after write”Symptom. research-os pack publish --to <path> exits 2 with a
verify-pack-style error after writing to the target. The error mentions
hash mismatch, freeze-receipt, final-report, or a specific corrupted
artifact.
Cause. The post-write verification pass re-hashes the canonical artifacts in the target and refuses if anything doesn’t reproduce. This catches a corrupted copy, a manifest that doesn’t match the receipt sha256, or an orphan-artifact violation. The target is left as-is so you can inspect.
Recover. Inspect the named file in the target. If the source pack is
clean (re-run verify-pack.mjs against it), publish with --force to
clear-and-replace:
# Verify the source pack firstnode research-packs/scripts/verify-pack.mjs <source-pack>
# Re-publish with --forceresearch-os pack publish \ --from <source-pack> \ --to <target> \ --forceEdit upstream artifacts (claims, sources, synthesis) or sibling files instead. See pack publish for the full admission contract.
Index build — malformed JSONL or source-card file
Section titled “Index build — malformed JSONL or source-card file”Symptom. research-os index build completes but stderr shows structured
warnings: malformed_jsonl (path, 1-based line number, reason),
malformed_source_card (path, reason), or section_index_failed (section
id, reason).
Cause. Stage B B-A-002 made the indexer per-record-resilient. One
malformed JSONL tail line or one bad evidence/source-cards/*.json no
longer crashes the entire build. tryReadJsonl is wrapped per-line,
readSourceCards is wrapped per-file, and indexSection is wrapped per-
section. Healthy records still index.
Recover. Locate the named file + line, fix or truncate, then rebuild:
# Find malformed tail line (warnings include 1-based line number)sed -n '<N>p' <pack>/<reported-path>
# Truncate trailing malformed line if it is the last linehead -n <N-1> <pack>/<reported-path> > <pack>/<reported-path>.fixedmv <pack>/<reported-path>.fixed <pack>/<reported-path>
# Rebuild — idempotentrm .research-os/index.sqlite # only required on SCHEMA_VERSION bump (see known-limitations)research-os index build --allCalibration — one run of --runs N fails
Section titled “Calibration — one run of --runs N fails”Symptom. node scripts/reviewer-calibration.mjs --runs 3 ... writes
per-run receipts to <profile>/runs/run-NNN.json but exits before the
aggregate receipt is produced, or the aggregate disagrees with one specific
run.
Cause. Each run is independent and writes its receipt before moving to
the next. The aggregate seeded-v1.{json,md} is written only after all
runs complete; partial-progress runs are durable on disk. Recurring-failure
detection in the aggregate (median-based PASS/FAIL bars) intentionally
demotes profiles that fail a majority of runs.
Recover. Inspect the per-run receipts to find the failing run, then
either re-run only the failing run (advanced) or re-run the whole --runs N
batch (canonical):
# Inspect per-run receiptsls calibration/reviewer-profiles/<profile>/runs/
# Re-run the full batch — receipts overwrite the prior run-NNN.jsonnode scripts/reviewer-calibration.mjs \ --model hermes3:8b --two-pass --runs 3 \ --profile <profile>A failed aggregate verdict is information, not an error — research-os
refuses to trust a reviewer profile when repeated seeded failures do not
support trust (Law 13). The hermes-two-pass-deterministic=failed canonical
receipt is the mechanism working, not a bug.
Freeze — refusal with stable reason_code
Section titled “Freeze — refusal with stable reason_code”Symptom. research-os freeze exits 2 and writes
audits/freeze-refusal.{json,md} (instead of freeze-receipt.{json,md}).
The refusal carries reasons[] prose and a reason_records[] array with
stable codes (Stage B B-C-003).
Cause. Freeze is the final integrity lock (Law 15). It refuses unless
every condition is met: audit ready_for_synthesis, handoff
synthesis_ready, all five synthesis files exist, final-report cites only
accepted claim_ids, all active waivers disclosed, all canonical artifacts
parse cleanly.
Recover. Read the reason_records[] array — each record has a stable
reason_code and a reason_message. The CLI surfaces a generated
next_actions list keyed off the codes (no substring matching). Common
recovery paths:
reason_code | Recovery |
|---|---|
FREEZE_PACK_AUDIT_NOT_READY | Re-run research-os audit; address blockers it surfaces |
FREEZE_HANDOFF_NOT_READY | Re-run research-os cowork handoff after audit clears |
FREEZE_FINAL_REPORT_NO_CITATIONS | Add citations to synthesis/final-report.md |
FREEZE_UNKNOWN_CLAIM_CITED | Remove or correct the cited claim id |
FREEZE_UNACCEPTED_CITED | Cite only claims accepted by review (Law 7) |
FREEZE_REPAIR_CLAIM_CITED | Re-review or remove citation to the repair-state claim |
FREEZE_UNRESOLVED_CONTRADICTION_UNDISCLOSED | Resolve via contradict resolve or disclose in synthesis |
FREEZE_WAIVER_UNDISCLOSED | Disclose active waivers in synthesis/decision-brief.md |
FREEZE_MISSING_GATE | Run research-os gate <section> for each section |
FREEZE_MISSING_REQUIRED_ARTIFACT | Re-run the upstream command that produces the missing artifact |
FREEZE_MISSING_SYNTHESIS_ARTIFACT | Run research-os synth workspace (requires handoff synthesis_ready) |
FREEZE_MALFORMED_ARTIFACT | Repair the named artifact and re-run freeze |
After recovery, re-run research-os freeze. The refusal artifact is
overwritten by the next run; the receipt artifact is written only on PASS.
Related pages
Section titled “Related pages”- CLI Reference — full command surface.
- Known limitations — v1.0 disclosed gaps.
- pack publish — admission contract + refusal cases.
- Reviewer calibration — multi-run receipts, status labels.
- Workflow chain — the 16-step chain from discover to freeze.