Reviewer Calibration
v0.5.0 makes reviewer calibration durable. A reviewer profile is not trusted because it ran once; it earns a status through structured seeded-failure receipts and multi-run aggregation.
Product guardrail: research-os can now refuse to trust a reviewer profile when repeated seeded failures do not support trust.
No trusted baseline admitted (v0.5.0). The three canonical profiles shipped:
| Profile | Status |
|---|---|
hermes-two-pass | failed — aggregate, 3 runs |
mistral-nemo-two-pass | conditional_pass — aggregate, 3 runs |
hermes-single-pass | comparison_only |
trusted_baseline is earned, not assumed. Single-run receipts exist for quick local
checks; aggregate receipts (3+ runs, median-based bars) are the trust artifact.
What a calibration receipt is
Section titled “What a calibration receipt is”A calibration receipt is a Zod-validated JSON file (seeded-v1.json) plus an
operator-readable Markdown sibling (seeded-v1.md). They are written by the
calibration harness and live at:
calibration/reviewer-profiles/<profile-name>/seeded-v1.{json,md}The receipt records:
- The model and architecture used
- Per-category recall (any-flag + strict) across 5 failure categories
- PASS/FAIL against 7 hard bars and 1 soft bar
- A four-valued status label (see below)
- Honest disclosure of which decisions the fixture cannot test
Running the calibration harness
Section titled “Running the calibration harness”From the research-os repo root:
# Quick single-run (local check — backward-compat behavior)node scripts/reviewer-calibration.mjs --model hermes3:8b --two-pass --profile hermes-two-pass
# Canonical 3-run aggregate (production evidence — use this for admission decisions)node scripts/reviewer-calibration.mjs --model hermes3:8b --two-pass --profile hermes-two-pass --runs 3
# Single-pass — architectural comparison onlynode scripts/reviewer-calibration.mjs --model hermes3:8b --profile hermes-single-pass --mode comparison-only
# Different model, two-pass, 3-run aggregatenode scripts/reviewer-calibration.mjs --model mistral-nemo:12b --two-pass --profile mistral-nemo-two-pass --runs 3When --runs <n> is specified (n ≥ 2), the harness:
- Rebuilds the fixture from scratch on each run (per-run isolation).
- Writes per-run receipts to
<profile>/runs/run-001.json…run-00N.json. - Aggregates across runs using median-based PASS/FAIL bars.
- Writes the aggregate receipt to
<profile>/seeded-v1.{json,md}.
When --runs 1 (or omitted), the harness writes the single-run receipt directly
to <profile>/seeded-v1.{json,md} — no runs/ subdirectory is created.
When to use single-run vs multi-run
Section titled “When to use single-run vs multi-run”| Mode | When to use |
|---|---|
--runs 1 (default) | Quick local check; iterating on a new profile; comparison experiments |
--runs 3+ | Canonical admission evidence; any receipt you plan to ship or commit |
The per-category any-flag floor (≥50% per category) is unreliable at N=1 because 1–2 seeds per category makes a single missed claim determinative. Multi-run median aggregation absorbs this variance without lowering the bar (F-50 stabilization).
Status labels
Section titled “Status labels”| Label | Meaning |
|---|---|
trusted_baseline | Canonical Hermes two-pass + aggregate PASS + median FP = 0 + no recurring bar failures. The reference profile. |
conditional_pass | Aggregate passes all bars but carries explicit caution (FP at ceiling, non-Hermes model, or recurring-failure demotion). |
failed | Any hard aggregate bar fails. Not admitted. |
comparison_only | Explicit --mode comparison-only, or single-pass Hermes (auto-assigned). Architectural comparison only — does NOT vouch for production. |
trusted_baseline ≠ conditional_pass. Do not treat them as interchangeable.
Mistral (mistral-nemo-two-pass) is NOT promoted to trusted_baseline regardless
of aggregate outcome. It carries conditional_pass as the honest ceiling for a
non-reference model.
Recurring-failure demotion
Section titled “Recurring-failure demotion”An aggregate receipt may carry a recurring_bar_failures list. If any hard bar
FAILed in ≥ ⌈N/2⌉ individual runs, that bar appears in this list — even if the
median across all runs happened to pass.
A non-empty recurring_bar_failures demotes a Hermes two-pass result from
trusted_baseline to conditional_pass. The intent: one lucky median cannot mask
a bar that was systematically unreliable across most runs.
Example: 3 runs where decision_vocab_completeness FAILed in runs 1 and 2 but
PASSed in run 3 (median = PASS). The recurring-failure check catches this as
2/3 ≥ ⌈3/2⌉ = 2 — the bar goes into recurring_bar_failures and the profile is
conditional_pass, not trusted_baseline.
When recurring_bar_failures is non-empty: inspect the per-run receipts in
runs/ to see which runs caused the failures, and whether it is a fixture-scale
sampling issue or a genuine model reliability problem.
Hard bars (all must PASS for overall PASS)
Section titled “Hard bars (all must PASS for overall PASS)”| Bar | Threshold |
|---|---|
| FP ceiling | ≤ 1/5 good claims falsely flagged |
| Any-flag recall | ≥ 65% of bad claims receive any finding |
| Per-category any-flag | Each category with ≥ 2 seeds must have ≥ 50% any-flag recall |
| Strict recall | ≥ 20% of bad claims matched with expected category |
| Decision vocab | ≥ 4/6 (single-pass) or ≥ 3/6 (two-pass) unique decisions produced |
| Latency hard | ≤ 20 min total runtime |
| Empty/malformed | 0 malformed LLM responses |
Latency soft (≤ 10 min) is WARN-only — never blocks overall PASS.
Architecture-aware decision bar
Section titled “Architecture-aware decision bar”The two-pass bar is lower (3/6 vs 4/6) because narrow_critic severity escalation
collapses the needs_human_review path into harder decisions. Two-pass profiles
structurally produce narrower decision vocabularies — the bar reflects that.
Canonical receipts (v0.5.0)
Section titled “Canonical receipts (v0.5.0)”Three receipts ship with v0.5.0 under calibration/reviewer-profiles/:
| Profile | Model | Architecture | Status |
|---|---|---|---|
hermes-two-pass | hermes3:8b | two-pass | failed (aggregate, 3 runs) — see CHANGELOG |
mistral-nemo-two-pass | mistral-nemo:12b | two-pass | conditional_pass (aggregate, 3 runs) |
hermes-single-pass | hermes3:8b | single-pass | comparison_only |
The hermes-single-pass receipt is comparison_only (auto-assigned via --mode comparison-only).
It illuminates the marginal contribution of narrow_critic vs single-pass architecture.
It does not vouch for production use.
Limit: needs_contradiction_mapping is unreachable
Section titled “Limit: needs_contradiction_mapping is unreachable”The seeded-v1 fixture does not seed unmapped_contradiction findings, so
needs_contradiction_mapping can never appear in any calibration run output.
Every receipt’s unreachable_decisions array discloses this honestly.
This means a profile’s decision-vocabulary coverage of needs_contradiction_mapping
cannot be measured against this fixture. Fixture expansion is deferred to v0.6.
Deterministic reviewer profile
Section titled “Deterministic reviewer profile”v0.6.0 adds reviewer_options to ReviewProfilePresetSchema, allowing operators to
carry temperature, seed, and other Ollama sampling parameters into every
OllamaInternReviewer construction via research.yaml profile config — no manual
flag injection required.
review_profiles: hermes-two-pass-deterministic: mode: two_pass general_model: hermes3:8b critic_model: hermes3:8b review_window: 30 reviewer_options: temperature: 0 seed: 7Using the profile:
research-os review <section> \ --preset hermes-two-pass-deterministic \ --profile hermes-two-pass-deterministicThe hermes-two-pass-deterministic preset ships as a built-in in DEFAULT_REVIEW_PROFILES
(status: experimental). Add it to your pack’s research.yaml under review_profiles
to activate it, or rely on the built-in default if your pack does not override
review_profiles.
Deterministic settings reduce variance but do NOT guarantee trust. The canonical
hermes-two-pass-deterministic aggregate receipt shows failed — a structural model-
capability gap in decision vocabulary (2/6 decisions produced; requires 3/6). Deterministic
settings make the per-run data stable and self-documenting; they do not improve the
model’s decision-vocabulary coverage.
What the receipt discloses: review.json carries reviewer_options directly on the
snapshot; review.md renders a ## Reviewer options section with stable key order
(num_ctx, temperature, seed, top_p, top_k, repeat_penalty). The receipt is
self-documenting without requiring a secondary profile lookup.
Full evidence trail: docs/experiment-6-proof.md
Auto-population in review-promote
Section titled “Auto-population in review-promote”When review-promote is called without explicit --calibration-* flags, it
checks for a receipt at <pack>/calibration/reviewer-profiles/<profile>/seeded-v1.json
and auto-populates calibration_summary in review-active.json.
The lookup is pack-relative — it uses the --pack <dir> argument, not the
terminal’s current working directory.
Canonical vs pack-copy: the canonical receipts live in the research-os repo.
Packs carry their own copy at <pack>/calibration/.... To enable auto-population
in a pack, copy the relevant receipt into the pack directory.
Invalid-receipt fail behavior: if a receipt is present but fails JSON parse
or Zod schema validation, review-promote fails with:
research-os: Invalid calibration receipt at <path>: <reason>Do not delete the receipt to silence this — fix the receipt content.