Evaluation and Scaling

DeltaMind’s quality is measured empirically, not by vibes. Four fixture classes, seven scoreboard metrics, and systematic model sweeps provide the evidence.

Transcript fixtures

Each fixture is a realistic conversation transcript with gold labels — expected deltas and items that a correct extractor should produce.

Clean linear (9 turns)

A straightforward coding session. Decisions are explicit, constraints are stated clearly, tasks are concrete. This is the easy case — if an extractor fails here, something is fundamentally broken.

Messy real (12 turns)

A realistic brainstorming session with tangents, corrections, and implicit decisions. Tests whether the extractor can handle natural conversation patterns.

Pathological (14 turns)

Deliberately designed to trigger canonization failures. Hedged language, conditional plans, reversed decisions, questions that sound like decisions. The safety stress test.

Revision mini-pack (13 turns)

Seven revision scenarios: direct replacement, relaxation, tightening, reversal, partial rollback, wrong-target trap, implicit amendment. Tests the revision ontology specifically.

Long variants (56-62 turns)

Extended versions of the first three fixtures. Tests scaling behavior — whether item growth is sublinear, whether dedup holds, whether quality degrades with length.

Scoreboard metrics

The pipeline produces a scoreboard after every run:

Metric	What it measures
Precision	Accepted / total candidates. Are emitted deltas valid?
Recall	Matched expected / total expected. Were expected deltas found?
Premature canonization rate	Speculation promoted to decision. The safety metric.
Bad target rate	Revision/supersession pointing at wrong item.
Duplicate emission rate	Re-emitting equivalent deltas from chatter.
Reconciler rejection rate	Kernel refusing bad proposals.
Cost per accepted delta	Characters processed per useful state change.

Per-kind breakdown

The scoreboard also reports precision and recall per delta kind. This reveals where extraction is strong and where it’s weak:

goal_set — 100% recall across all models. Solved.
constraint_added — 50-100% recall. Basically solved.
decision_made — 0-50% recall. The main recall drag.
decision_revised — 0% for most models. The real sinkhole.

Match class distribution

When evaluating against expected deltas, each match is classified:

Exact — same item ID (the extractor found the same item the gold label expected)
Semantic — different ID but same semantic ID (equivalent meaning, different label)
Fuzzy — word overlap above threshold but not semantic match
Missed — expected delta not found at all

Scaling results

The core scaling thesis: state grows sublinearly while transcript grows linearly.

Metric	Short (9-14 turns)	Long (56-62 turns)
Savings vs raw	29%	52%
Items vs turns	~linear	sublinear (2.9x items for 5x turns)
State-change density	92%	56% (sparser at scale)
Query score	6/6	6/6
Overhead vs gold	1.69x	2.36x (under 3x threshold)

The compression improvement with length is the key finding. Short transcripts can inflate (metadata overhead exceeds savings). But by 56+ turns, DeltaMind is compressing 4-8x while maintaining full query capability and provenance.

Why this happens

Most turns in a long conversation are elaboration, not mutation. The extractor identifies the sparse state changes and ignores the rest. The ratio of state-relevant turns to total turns drops as the conversation grows.

This is why summaries fail at scale — they compress everything equally, including the important parts. DeltaMind compresses by identifying what matters and discarding what doesn’t.

Model sweep results

Three models tested across four fixtures:

Model	Clean precision	Clean recall	Pathological canon	Safety
gemma2:9b	100%	78%	0%	SAFE
phi4:14b	86%	67%	0%	SAFE
qwen2.5:14b	100%	56%	0%	SAFE
llama3.1:8b	100%	45%	14.3%	UNSAFE

gemma2:9b became the default based on these results: best precision/recall balance, zero canonization, smaller and faster than 14B alternatives.

llama3.1:8b was blocked after it promoted hedged “Use Redis” to decision_made on the pathological fixture. A model that canonizes in controlled testing will canonize in production.

Dogfood results

Three realistic session types processed through the full pipeline:

Session	Turns	Items	Compression	Save/load stable
Coding (auth refactor)	35	7	22% of raw	Yes
Product (CLI planning)	28	7	25% of raw	Yes
Messy (brainstorming)	32	8	18% of raw	Yes

All gates pass: zero false canonization, zero hypothesis promotion, zero advisory boundary leaks, round-trip stable save/load.