Skip to content

Dataset Workflow

This page walks through one complete dataset production path: from an existing project with curated, canon-bound records to a versioned export package and eval pack.

Before starting, you need a project with:

  • Curated records (at least some approved with per-dimension scores)
  • Canon binding (sdlab bind has been run, producing pass/fail/partial assertions)
  • A valid project config (sdlab project doctor passes)

If you don’t have curated records yet, see Getting Started first.

Before creating a snapshot, audit which records qualify for inclusion.

Terminal window
sdlab eligibility audit --project my-project

This evaluates every record against the default selection profile:

  • Requires human judgment (no un-reviewed records)
  • Requires approved status
  • Requires canon binding (at least one assertion)
  • Requires minimum 50% pass ratio

The audit shows your eligibility rate and breaks down exclusion reasons. Near-miss records (1 failing check) are highlighted as improvement opportunities.

A snapshot freezes a deterministic selection of eligible records at a point in time.

Terminal window
sdlab snapshot create --project my-project

What this produces:

  • snapshots/<id>/snapshot.json — manifest with config fingerprint, counts, selection profile
  • snapshots/<id>/included.jsonl — one line per included record with reason trace
  • snapshots/<id>/excluded.jsonl — one line per excluded record with exclusion reasons
  • snapshots/<id>/summary.json — lane and faction distribution

Key property: the same project config and records always produce the same snapshot. The config fingerprint (SHA-256 of all 5 config files) proves this.

Use sdlab snapshot list and sdlab snapshot diff <a> <b> to track changes over time.

A split assigns snapshot records to train/val/test partitions.

Terminal window
sdlab split build --project my-project

Two laws govern splitting:

  1. Subject isolation — records sharing a subject family always land in the same split. Family is determined by identity.subject_name, lineage chain, or ID suffix stripping.
  2. Lane balance — families are grouped by primary lane, shuffled with a seeded PRNG, and assigned to maintain target ratios per lane.

Default profile: 80/10/10 split, seed 42, subject-isolated strategy.

Verify the split:

Terminal window
sdlab split audit <split-id> --project my-project

The audit checks for subject leakage (must be zero) and reports lane balance across partitions.

Terminal window
sdlab card generate --project my-project

This produces dataset-card.md and dataset-card.json in the project root, documenting:

  • Selection criteria and eligibility profile
  • Split strategy and partition sizes
  • Lane balance table
  • Quality gates (constitution rules, rubric dimensions)
  • Provenance chain

An export package is a self-contained dataset directory.

Terminal window
sdlab export build --project my-project

Output structure:

exports/<id>/
manifest.json # snapshot ref, split ref, profile, checksums
metadata.jsonl # one record per line (full provenance + judgment + canon)
images/ # symlinks to approved images (use --copy for real copies)
splits/
train.jsonl
val.jsonl
test.jsonl
dataset-card.md
dataset-card.json
checksums.txt # BSD format: SHA256 (<path>) = <hash>
summary.json

The manifest stores everything needed to rebuild: snapshot ref, split ref, export profile, and config fingerprint. This is a reproducibility contract.

Use sdlab export list to see all packages.

Eval packs are canon-aware test instruments for future model verification.

Terminal window
sdlab eval-pack build --project my-project

Four task types:

TaskPurposeRecords
Lane coverageBest approved records per lane (highest pass ratio)Representative set
Forbidden driftRejected/borderline records with violated rulesWhat the model must NOT produce
Anchor/goldHighest pass-ratio records per factionGold standard references
Subject continuitySame-subject record groupsIdentity consistency testing

Use sdlab eval-pack show <id> to inspect the pack contents.

Step 7: Hand off to repo-dataset (optional)

Section titled “Step 7: Hand off to repo-dataset (optional)”

sdlab defines the dataset. repo-dataset renders it into specialized training formats.

Terminal window
# From the export package, produce format-specific outputs
repo-dataset visual generate ./projects/my-project --format trl
repo-dataset visual generate ./projects/my-project --format llava

This boundary matters: sdlab decides what is in the dataset and how it is split. repo-dataset never makes inclusion decisions — it only converts the canonical package into downstream formats.

Terminal window
# Clone the repo
git clone https://github.com/mcp-tool-shop-org/style-dataset-lab
cd style-dataset-lab
# Validate the project
sdlab project doctor --project star-freight
# Run the full dataset spine
sdlab snapshot create --project star-freight
sdlab split build --project star-freight
sdlab split audit <split-id> --project star-freight
sdlab export build --project star-freight
sdlab eval-pack build --project star-freight
sdlab card generate --project star-freight

Star Freight results: 839/1,182 eligible (71%), 667 train / 88 val / 84 test, zero subject leakage, 417 isolated families, 78 eval records across 4 task types.

Override defaults by creating profile JSON files in your project:

  • selection-profiles/<name>.json — eligibility criteria
  • split-profiles/<name>.json — split ratios, seed, strategy
  • export-profiles/<name>.json — metadata fields, image strategy

Then pass --profile <name> to the relevant command.