Skip to content

Evaluation and Scaling

DeltaMind’s quality is measured empirically, not by vibes. Four fixture classes, seven scoreboard metrics, and systematic model sweeps provide the evidence.

Transcript fixtures

Each fixture is a realistic conversation transcript with gold labels — expected deltas and items that a correct extractor should produce.

Clean linear (9 turns)

A straightforward coding session. Decisions are explicit, constraints are stated clearly, tasks are concrete. This is the easy case — if an extractor fails here, something is fundamentally broken.

Messy real (12 turns)

A realistic brainstorming session with tangents, corrections, and implicit decisions. Tests whether the extractor can handle natural conversation patterns.

Pathological (14 turns)

Deliberately designed to trigger canonization failures. Hedged language, conditional plans, reversed decisions, questions that sound like decisions. The safety stress test.

Revision mini-pack (13 turns)

Seven revision scenarios: direct replacement, relaxation, tightening, reversal, partial rollback, wrong-target trap, implicit amendment. Tests the revision ontology specifically.

Long variants (56-62 turns)

Extended versions of the first three fixtures. Tests scaling behavior — whether item growth is sublinear, whether dedup holds, whether quality degrades with length.

Scoreboard metrics

The pipeline produces a scoreboard after every run:

MetricWhat it measures
PrecisionAccepted / total candidates. Are emitted deltas valid?
RecallMatched expected / total expected. Were expected deltas found?
Premature canonization rateSpeculation promoted to decision. The safety metric.
Bad target rateRevision/supersession pointing at wrong item.
Duplicate emission rateRe-emitting equivalent deltas from chatter.
Reconciler rejection rateKernel refusing bad proposals.
Cost per accepted deltaCharacters processed per useful state change.

Per-kind breakdown

The scoreboard also reports precision and recall per delta kind. This reveals where extraction is strong and where it’s weak:

  • goal_set — 100% recall across all models. Solved.
  • constraint_added — 50-100% recall. Basically solved.
  • decision_made — 0-50% recall. The main recall drag.
  • decision_revised — 0% for most models. The real sinkhole.

Match class distribution

When evaluating against expected deltas, each match is classified:

  • Exact — same item ID (the extractor found the same item the gold label expected)
  • Semantic — different ID but same semantic ID (equivalent meaning, different label)
  • Fuzzy — word overlap above threshold but not semantic match
  • Missed — expected delta not found at all

Scaling results

The core scaling thesis: state grows sublinearly while transcript grows linearly.

MetricShort (9-14 turns)Long (56-62 turns)
Savings vs raw29%52%
Items vs turns~linearsublinear (2.9x items for 5x turns)
State-change density92%56% (sparser at scale)
Query score6/66/6
Overhead vs gold1.69x2.36x (under 3x threshold)

The compression improvement with length is the key finding. Short transcripts can inflate (metadata overhead exceeds savings). But by 56+ turns, DeltaMind is compressing 4-8x while maintaining full query capability and provenance.

Why this happens

Most turns in a long conversation are elaboration, not mutation. The extractor identifies the sparse state changes and ignores the rest. The ratio of state-relevant turns to total turns drops as the conversation grows.

This is why summaries fail at scale — they compress everything equally, including the important parts. DeltaMind compresses by identifying what matters and discarding what doesn’t.

Model sweep results

Three models tested across four fixtures:

ModelClean precisionClean recallPathological canonSafety
gemma2:9b100%78%0%SAFE
phi4:14b86%67%0%SAFE
qwen2.5:14b100%56%0%SAFE
llama3.1:8b100%45%14.3%UNSAFE

gemma2:9b became the default based on these results: best precision/recall balance, zero canonization, smaller and faster than 14B alternatives.

llama3.1:8b was blocked after it promoted hedged “Use Redis” to decision_made on the pathological fixture. A model that canonizes in controlled testing will canonize in production.

Dogfood results

Three realistic session types processed through the full pipeline:

SessionTurnsItemsCompressionSave/load stable
Coding (auth refactor)35722% of rawYes
Product (CLI planning)28725% of rawYes
Messy (brainstorming)32818% of rawYes

All gates pass: zero false canonization, zero hypothesis promotion, zero advisory boundary leaks, round-trip stable save/load.