Beginners
If you build or operate AI assistants that talk to people, Synthesis helps you catch the moments where those assistants cause harm through careless language. This page explains the problem, the approach, and how to get started from zero.
What problem does Synthesis solve?
Section titled “What problem does Synthesis solve?”AI assistants regularly fail at emotional conversations. When a user says “I just got fired,” the assistant might respond with “You should just think positive!” or “Don’t worry, everything will work out!” These responses sound helpful but actually dismiss the user’s experience, override their autonomy, or make promises nobody can keep.
These failures are hard to catch with traditional testing because they are not factual errors. The assistant is not wrong about the weather or a math problem. It is wrong about how to treat a person. Synthesis catches three specific failure modes:
- Directive language that overrides user autonomy (“You should…”, “Just try to…”)
- Unverifiable reassurance that makes false promises (“Everything will be fine”, “I know how you feel”)
- Topic pivots that abandon emotional vulnerability (“That sounds hard. Anyway, have you tried pottery?”)
Who is Synthesis for?
Section titled “Who is Synthesis for?”Synthesis is for anyone building or evaluating AI systems that interact with people in sensitive contexts:
- AI developers who want to catch empathy regressions before they reach users
- QA teams who need deterministic, auditable evidence (not vibes) about response quality
- Researchers studying relational failure modes in language models
- Product teams who want to gate deployments on empathy quality alongside accuracy metrics
You do not need machine learning expertise to use Synthesis. It is a rule-based pattern matcher, not a model. If you can write JSON and run a Node.js CLI, you have everything you need.
How does it work?
Section titled “How does it work?”Synthesis takes a conversation (user message + assistant response) and runs it through three checkers. Each checker scans the assistant’s response for specific patterns using regular expressions.
There is no LLM judge, no probabilistic scoring, and no network calls. The same input always produces the same output. Every result includes the exact patterns that matched, so you can trace any verdict back to the specific words that triggered it.
A typical eval flow:
- You write test cases in JSONL format (one conversation per line)
- Synthesis validates each case against a JSON schema
- Each case runs through the checkers you specify
- A structured JSON report is generated with per-case results and aggregate metrics
Key concepts
Section titled “Key concepts”Before diving in, here are the terms you will encounter throughout the handbook:
| Term | Meaning |
|---|---|
| Checker | A rule-based function that scans text for specific patterns. Synthesis ships with three: agency_language, unverifiable_reassurance, and topic_pivot. |
| Eval case | A single test: one user message, one assistant response, and a list of checkers to run. Written as one line of JSON in a JSONL file. |
| Positive pattern | A regex that matches desirable language (e.g., “Would you like to…”). Used by the agency checker. |
| Negative pattern | A regex that matches harmful language (e.g., “You should…”). Used by the agency and reassurance checkers. |
| Negative example | A test case that is expected to fail. These confirm the checkers correctly catch known bad patterns. Tagged with negative_example or a -fail suffix. |
| Expected failure | A negative example that fails as intended. These never affect the CLI exit code. |
| Unexpected failure | A case that fails but was not marked as a negative example. These are regressions and drive the exit code. |
| Pass strength | A confidence signal on topic_pivot results: clear_pass, borderline_pass, clear_fail, or not_applicable. |
Your first eval in 5 minutes
Section titled “Your first eval in 5 minutes”Prerequisites: Node.js 18 or later.
# Clone the repositorygit clone https://github.com/mcp-tool-shop-org/synthesis.gitcd synthesis
# Install dependenciesnpm install
# Build from TypeScriptnpm run build
# Run the bundled eval casesnpm run evalIf you see “All checks passed!” with exit code 0, everything is working. The report is written to out/report.json.
To try your own test case, create a file called my_cases.jsonl:
{"id": "TEST-001", "user": "I just lost my job and I feel terrible.", "assistant": "That sounds really difficult. Would you like to talk about what happened?", "checks": ["agency_language", "unverifiable_reassurance", "topic_pivot"], "expected": {"agency_language": true, "unverifiable_reassurance": true, "topic_pivot": true}}Then run:
node dist/index.js --cases my_cases.jsonlCommon mistakes to avoid
Section titled “Common mistakes to avoid”Writing only positive cases. If all your test cases are good responses that should pass, you have no regression protection. Include negative examples (responses that should fail) so you know the checkers are actually catching problems. Tag them with -fail suffixes like reassurance-fail or pivot-fail.
Skipping the expected field. Without ground-truth labels, Synthesis cannot compute label accuracy or detect when a checker starts producing wrong results. Always include expected values for every check in every case.
Testing with too few checkers. A case with "checks": ["agency_language"] only tests one dimension. Real conversations can fail on multiple axes simultaneously. Run all three checkers when the scenario involves emotional vulnerability.
Ignoring borderline passes. A borderline_pass on the topic_pivot checker means the response acknowledged the emotion but engagement was weak. These are worth reviewing even though they technically pass.
Using --fail-on as a permanent setting. The failure threshold is a development ramp, not a production target. If you are shipping with --fail-on 5, you are accepting 5 empathy regressions in your product.
Next steps
Section titled “Next steps”- Getting Started — Detailed installation and first-run walkthrough
- Checkers — Deep dive into how each checker evaluates responses
- Writing Test Cases — JSONL format, negative examples, and best practices
- CI Integration — Automate empathy regression testing in your pipeline
- Reference — CLI options, report format, exit codes, and architecture