Dataset Workflow
This page walks through one complete dataset production path: from an existing project with curated, canon-bound records to a versioned export package and eval pack.
Prerequisites
Section titled “Prerequisites”Before starting, you need a project with:
- Curated records (at least some
approvedwith per-dimension scores) - Canon binding (
sdlab bindhas been run, producing pass/fail/partial assertions) - A valid project config (
sdlab project doctorpasses)
If you don’t have curated records yet, see Getting Started first.
Step 1: Check eligibility
Section titled “Step 1: Check eligibility”Before creating a snapshot, audit which records qualify for inclusion.
sdlab eligibility audit --project my-projectThis evaluates every record against the default selection profile:
- Requires human judgment (no un-reviewed records)
- Requires
approvedstatus - Requires canon binding (at least one assertion)
- Requires minimum 50% pass ratio
The audit shows your eligibility rate and breaks down exclusion reasons. Near-miss records (1 failing check) are highlighted as improvement opportunities.
Step 2: Create a snapshot
Section titled “Step 2: Create a snapshot”A snapshot freezes a deterministic selection of eligible records at a point in time.
sdlab snapshot create --project my-projectWhat this produces:
snapshots/<id>/snapshot.json— manifest with config fingerprint, counts, selection profilesnapshots/<id>/included.jsonl— one line per included record with reason tracesnapshots/<id>/excluded.jsonl— one line per excluded record with exclusion reasonssnapshots/<id>/summary.json— lane and faction distribution
Key property: the same project config and records always produce the same snapshot. The config fingerprint (SHA-256 of all 5 config files) proves this.
Use sdlab snapshot list and sdlab snapshot diff <a> <b> to track changes over time.
Step 3: Build a split
Section titled “Step 3: Build a split”A split assigns snapshot records to train/val/test partitions.
sdlab split build --project my-projectTwo laws govern splitting:
- Subject isolation — records sharing a subject family always land in the same split. Family is determined by
identity.subject_name, lineage chain, or ID suffix stripping. - Lane balance — families are grouped by primary lane, shuffled with a seeded PRNG, and assigned to maintain target ratios per lane.
Default profile: 80/10/10 split, seed 42, subject-isolated strategy.
Verify the split:
sdlab split audit <split-id> --project my-projectThe audit checks for subject leakage (must be zero) and reports lane balance across partitions.
Step 4: Generate a dataset card
Section titled “Step 4: Generate a dataset card”sdlab card generate --project my-projectThis produces dataset-card.md and dataset-card.json in the project root, documenting:
- Selection criteria and eligibility profile
- Split strategy and partition sizes
- Lane balance table
- Quality gates (constitution rules, rubric dimensions)
- Provenance chain
Step 5: Build an export package
Section titled “Step 5: Build an export package”An export package is a self-contained dataset directory.
sdlab export build --project my-projectOutput structure:
exports/<id>/ manifest.json # snapshot ref, split ref, profile, checksums metadata.jsonl # one record per line (full provenance + judgment + canon) images/ # symlinks to approved images (use --copy for real copies) splits/ train.jsonl val.jsonl test.jsonl dataset-card.md dataset-card.json checksums.txt # BSD format: SHA256 (<path>) = <hash> summary.jsonThe manifest stores everything needed to rebuild: snapshot ref, split ref, export profile, and config fingerprint. This is a reproducibility contract.
Use sdlab export list to see all packages.
Step 6: Build an eval pack
Section titled “Step 6: Build an eval pack”Eval packs are canon-aware test instruments for future model verification.
sdlab eval-pack build --project my-projectFour task types:
| Task | Purpose | Records |
|---|---|---|
| Lane coverage | Best approved records per lane (highest pass ratio) | Representative set |
| Forbidden drift | Rejected/borderline records with violated rules | What the model must NOT produce |
| Anchor/gold | Highest pass-ratio records per faction | Gold standard references |
| Subject continuity | Same-subject record groups | Identity consistency testing |
Use sdlab eval-pack show <id> to inspect the pack contents.
Step 7: Hand off to repo-dataset (optional)
Section titled “Step 7: Hand off to repo-dataset (optional)”sdlab defines the dataset. repo-dataset renders it into specialized training formats.
# From the export package, produce format-specific outputsrepo-dataset visual generate ./projects/my-project --format trlrepo-dataset visual generate ./projects/my-project --format llavaThis boundary matters: sdlab decides what is in the dataset and how it is split. repo-dataset never makes inclusion decisions — it only converts the canonical package into downstream formats.
Full example: Star Freight
Section titled “Full example: Star Freight”# Clone the repogit clone https://github.com/mcp-tool-shop-org/style-dataset-labcd style-dataset-lab
# Validate the projectsdlab project doctor --project star-freight
# Run the full dataset spinesdlab snapshot create --project star-freightsdlab split build --project star-freightsdlab split audit <split-id> --project star-freightsdlab export build --project star-freightsdlab eval-pack build --project star-freightsdlab card generate --project star-freightStar Freight results: 839/1,182 eligible (71%), 667 train / 88 val / 84 test, zero subject leakage, 417 isolated families, 78 eval records across 4 task types.
Custom profiles
Section titled “Custom profiles”Override defaults by creating profile JSON files in your project:
selection-profiles/<name>.json— eligibility criteriasplit-profiles/<name>.json— split ratios, seed, strategyexport-profiles/<name>.json— metadata fields, image strategy
Then pass --profile <name> to the relevant command.