Skip to content

Reviewer Calibration

v0.5.0 makes reviewer calibration durable. A reviewer profile is not trusted because it ran once; it earns a status through structured seeded-failure receipts and multi-run aggregation.

Product guardrail: research-os can now refuse to trust a reviewer profile when repeated seeded failures do not support trust.

No trusted baseline admitted (v0.5.0). The three canonical profiles shipped:

ProfileStatus
hermes-two-passfailed — aggregate, 3 runs
mistral-nemo-two-passconditional_pass — aggregate, 3 runs
hermes-single-passcomparison_only

trusted_baseline is earned, not assumed. Single-run receipts exist for quick local checks; aggregate receipts (3+ runs, median-based bars) are the trust artifact.


A calibration receipt is a Zod-validated JSON file (seeded-v1.json) plus an operator-readable Markdown sibling (seeded-v1.md). They are written by the calibration harness and live at:

calibration/reviewer-profiles/<profile-name>/seeded-v1.{json,md}

The receipt records:

  • The model and architecture used
  • Per-category recall (any-flag + strict) across 5 failure categories
  • PASS/FAIL against 7 hard bars and 1 soft bar
  • A four-valued status label (see below)
  • Honest disclosure of which decisions the fixture cannot test

From the research-os repo root:

Terminal window
# Quick single-run (local check — backward-compat behavior)
node scripts/reviewer-calibration.mjs --model hermes3:8b --two-pass --profile hermes-two-pass
# Canonical 3-run aggregate (production evidence — use this for admission decisions)
node scripts/reviewer-calibration.mjs --model hermes3:8b --two-pass --profile hermes-two-pass --runs 3
# Single-pass — architectural comparison only
node scripts/reviewer-calibration.mjs --model hermes3:8b --profile hermes-single-pass --mode comparison-only
# Different model, two-pass, 3-run aggregate
node scripts/reviewer-calibration.mjs --model mistral-nemo:12b --two-pass --profile mistral-nemo-two-pass --runs 3

When --runs <n> is specified (n ≥ 2), the harness:

  1. Rebuilds the fixture from scratch on each run (per-run isolation).
  2. Writes per-run receipts to <profile>/runs/run-001.jsonrun-00N.json.
  3. Aggregates across runs using median-based PASS/FAIL bars.
  4. Writes the aggregate receipt to <profile>/seeded-v1.{json,md}.

When --runs 1 (or omitted), the harness writes the single-run receipt directly to <profile>/seeded-v1.{json,md} — no runs/ subdirectory is created.

ModeWhen to use
--runs 1 (default)Quick local check; iterating on a new profile; comparison experiments
--runs 3+Canonical admission evidence; any receipt you plan to ship or commit

The per-category any-flag floor (≥50% per category) is unreliable at N=1 because 1–2 seeds per category makes a single missed claim determinative. Multi-run median aggregation absorbs this variance without lowering the bar (F-50 stabilization).


LabelMeaning
trusted_baselineCanonical Hermes two-pass + aggregate PASS + median FP = 0 + no recurring bar failures. The reference profile.
conditional_passAggregate passes all bars but carries explicit caution (FP at ceiling, non-Hermes model, or recurring-failure demotion).
failedAny hard aggregate bar fails. Not admitted.
comparison_onlyExplicit --mode comparison-only, or single-pass Hermes (auto-assigned). Architectural comparison only — does NOT vouch for production.

trusted_baselineconditional_pass. Do not treat them as interchangeable.

Mistral (mistral-nemo-two-pass) is NOT promoted to trusted_baseline regardless of aggregate outcome. It carries conditional_pass as the honest ceiling for a non-reference model.

An aggregate receipt may carry a recurring_bar_failures list. If any hard bar FAILed in ≥ ⌈N/2⌉ individual runs, that bar appears in this list — even if the median across all runs happened to pass.

A non-empty recurring_bar_failures demotes a Hermes two-pass result from trusted_baseline to conditional_pass. The intent: one lucky median cannot mask a bar that was systematically unreliable across most runs.

Example: 3 runs where decision_vocab_completeness FAILed in runs 1 and 2 but PASSed in run 3 (median = PASS). The recurring-failure check catches this as 2/3 ≥ ⌈3/2⌉ = 2 — the bar goes into recurring_bar_failures and the profile is conditional_pass, not trusted_baseline.

When recurring_bar_failures is non-empty: inspect the per-run receipts in runs/ to see which runs caused the failures, and whether it is a fixture-scale sampling issue or a genuine model reliability problem.


Hard bars (all must PASS for overall PASS)

Section titled “Hard bars (all must PASS for overall PASS)”
BarThreshold
FP ceiling≤ 1/5 good claims falsely flagged
Any-flag recall≥ 65% of bad claims receive any finding
Per-category any-flagEach category with ≥ 2 seeds must have ≥ 50% any-flag recall
Strict recall≥ 20% of bad claims matched with expected category
Decision vocab≥ 4/6 (single-pass) or ≥ 3/6 (two-pass) unique decisions produced
Latency hard≤ 20 min total runtime
Empty/malformed0 malformed LLM responses

Latency soft (≤ 10 min) is WARN-only — never blocks overall PASS.

The two-pass bar is lower (3/6 vs 4/6) because narrow_critic severity escalation collapses the needs_human_review path into harder decisions. Two-pass profiles structurally produce narrower decision vocabularies — the bar reflects that.


Three receipts ship with v0.5.0 under calibration/reviewer-profiles/:

ProfileModelArchitectureStatus
hermes-two-passhermes3:8btwo-passfailed (aggregate, 3 runs) — see CHANGELOG
mistral-nemo-two-passmistral-nemo:12btwo-passconditional_pass (aggregate, 3 runs)
hermes-single-passhermes3:8bsingle-passcomparison_only

The hermes-single-pass receipt is comparison_only (auto-assigned via --mode comparison-only). It illuminates the marginal contribution of narrow_critic vs single-pass architecture. It does not vouch for production use.


Limit: needs_contradiction_mapping is unreachable

Section titled “Limit: needs_contradiction_mapping is unreachable”

The seeded-v1 fixture does not seed unmapped_contradiction findings, so needs_contradiction_mapping can never appear in any calibration run output. Every receipt’s unreachable_decisions array discloses this honestly.

This means a profile’s decision-vocabulary coverage of needs_contradiction_mapping cannot be measured against this fixture. Fixture expansion is deferred to v0.6.


v0.6.0 adds reviewer_options to ReviewProfilePresetSchema, allowing operators to carry temperature, seed, and other Ollama sampling parameters into every OllamaInternReviewer construction via research.yaml profile config — no manual flag injection required.

review_profiles:
hermes-two-pass-deterministic:
mode: two_pass
general_model: hermes3:8b
critic_model: hermes3:8b
review_window: 30
reviewer_options:
temperature: 0
seed: 7

Using the profile:

Terminal window
research-os review <section> \
--preset hermes-two-pass-deterministic \
--profile hermes-two-pass-deterministic

The hermes-two-pass-deterministic preset ships as a built-in in DEFAULT_REVIEW_PROFILES (status: experimental). Add it to your pack’s research.yaml under review_profiles to activate it, or rely on the built-in default if your pack does not override review_profiles.

Deterministic settings reduce variance but do NOT guarantee trust. The canonical hermes-two-pass-deterministic aggregate receipt shows failed — a structural model- capability gap in decision vocabulary (2/6 decisions produced; requires 3/6). Deterministic settings make the per-run data stable and self-documenting; they do not improve the model’s decision-vocabulary coverage.

What the receipt discloses: review.json carries reviewer_options directly on the snapshot; review.md renders a ## Reviewer options section with stable key order (num_ctx, temperature, seed, top_p, top_k, repeat_penalty). The receipt is self-documenting without requiring a secondary profile lookup.

Full evidence trail: docs/experiment-6-proof.md


When review-promote is called without explicit --calibration-* flags, it checks for a receipt at <pack>/calibration/reviewer-profiles/<profile>/seeded-v1.json and auto-populates calibration_summary in review-active.json.

The lookup is pack-relative — it uses the --pack <dir> argument, not the terminal’s current working directory.

Canonical vs pack-copy: the canonical receipts live in the research-os repo. Packs carry their own copy at <pack>/calibration/.... To enable auto-population in a pack, copy the relevant receipt into the pack directory.

Invalid-receipt fail behavior: if a receipt is present but fails JSON parse or Zod schema validation, review-promote fails with:

research-os: Invalid calibration receipt at <path>: <reason>

Do not delete the receipt to silence this — fix the receipt content.