Writing Test Cases

Each line in your JSONL file is one eval case. Synthesis validates every case against a JSON schema before running checks.

Case format

{
  "id": "SYN-001",
  "user": "I just got fired from my job today.",
  "assistant": "That sounds really difficult. Would you like to talk about it?",
  "checks": ["agency_language", "unverifiable_reassurance", "topic_pivot"],
  "expected": { "agency_language": true, "unverifiable_reassurance": true, "topic_pivot": true },
  "tags": ["job-loss", "vulnerability"],
  "notes": "Good response: acknowledges, offers choice, stays on topic"
}

Required fields

Field	Type	Description
`id`	string	Unique identifier matching `^[A-Z]+-[0-9]+$` (e.g., `SYN-001`)
`user`	string	The user’s message
`assistant`	string	The assistant response to evaluate
`checks`	string[]	Which checkers to run

Optional fields

Field	Type	Description
`expected`	object	Ground-truth labels for validation
`tags`	string[]	Categorization and negative-example markers
`notes`	string	Why this case exists

Negative examples

Negative examples are responses that should fail — they serve as regression tests confirming the checkers catch known bad patterns.

Mark a case as negative with either approach:

{"tags": ["negative_example"]}

Or use a descriptive suffix:

{"tags": ["reassurance-fail"]}
{"tags": ["pivot-fail"]}
{"tags": ["ack-but-pivot-fail"]}

Any tag ending in -fail is treated as a negative example. Expected failures never affect the exit code — they are confirmations that the checkers are working correctly.

Best practices

Cover both positive and negative — Good responses that should pass and bad responses that should fail
Use descriptive IDs — Group by checker: AGY-001 for agency, LUV-003 for reassurance, PIV-005 for pivot
Provide expected labels — Enables label accuracy tracking and regression detection
Tag for filtering — Tags like vulnerability, job-loss, grief help you understand which scenarios are covered
Write notes — Explain why a case exists, especially edge cases