Skip to content

Score Format

A VocalScore is the canonical input to the render engine. It is plain JSON, validated at every trust boundary by src/types/scoreSchema.ts (Zod). The same shape is accepted by the REST API (POST /api/render), by the CLI tools (play-score, compare), and by the cockpit’s piano roll.

{
"formatVersion": "1.0.0",
"bpm": 120,
"notes": [ ... ],
"lyrics": { ... },
"phonemes": [ ... ],
"lanes": { ... }
}
FieldTypeRequiredNotes
formatVersionsemver stringoptionalDefaults to "1.0.0". The engine rejects scores whose version is not in SUPPORTED_SCORE_VERSIONS.
bpmfinite number > 0requiredBeats per minute. Used by quantization helpers and by phonemizer alignment; the engine does not warp time.
notesarray of VocalNoterequiredThe pitched voices to render. Empty array is legal (renders silence + tail).
lyrics{text, language?}optionalSource text for the phonemize endpoint. Stored but not synthesized directly; phonemes drive timbre.
phonemesarray of eventsoptionalPre-computed phoneme timeline. When absent the engine uses the timbre field on each note.
lanesLanesObjectoptionalTime-varying automation curves applied across the score (dynamics, breathiness, timbreMorph).
{
"id": "n1",
"startSec": 0.0,
"durationSec": 0.5,
"midi": 60,
"velocity": 0.8,
"timbre": "ah",
"vibrato": { "rateHz": 5.5, "depthCents": 50, "onsetSec": 0.2 },
"portamentoSec": 0.05,
"pan": 0.0
}
FieldTypeRequiredConstraints
idnon-empty stringrequiredStable identifier. Used by the cockpit to track edits.
startSecfinite number ≥ 0requiredSeconds from start of score.
durationSecfinite number > 0requiredLength of the note.
midifinite number, 0..127requiredMIDI note number (60 = middle C).
velocityfinite number, 0..1optionalDefault 0.8.
timbrenon-empty stringoptionalPreset-specific timbre id (e.g. "ah", "oo", "ee").
vibratoobjectoptionalrateHz, depthCents, onsetSec — all finite, ≥ 0.
portamentoSecfinite number ≥ 0optionalPitch-glide duration into this note.
panfinite number, -1..1optionalStereo pan when rendering to 2 channels. Ignored on mono renders.
{ "text": "hello world", "language": "en" }
FieldTypeRequiredNotes
textstringrequiredRaw lyric text.
languagestringoptionalISO language tag. Used by the G2P backend; default "en".

Each entry is a phoneme event aligned in time. Generated by POST /api/phonemize and persisted into the score so re-rendering is reproducible.

{
"tSec": 0.0,
"durSec": 0.18,
"phoneme": "AH",
"kind": "vowel",
"timbreHint": "ah",
"strength": 0.85
}
FieldTypeRequiredNotes
tSecfinite number ≥ 0requiredStart time of this phoneme.
durSecfinite number > 0requiredPhoneme duration.
phonemenon-empty stringrequiredPhoneme label (engine vocabulary).
kind"vowel" | "consonant"requiredRouting hint.
timbreHintnon-empty stringoptionalPreferred timbre for vowels; engine may override.
strengthfinite number, 0..1optionalVowel strength for consonant-to-vowel transitions.

Three automation lanes layer over the rendered audio:

{
"dynamics": [ { "tSec": 0.0, "value": 0.6 }, { "tSec": 1.2, "value": 1.0 } ],
"breathiness":[ { "tSec": 0.0, "value": 0.0 }, { "tSec": 0.5, "value": 0.4 } ],
"timbreMorph": {
"ah": [ { "tSec": 0.0, "value": 1.0 }, { "tSec": 0.5, "value": 0.0 } ],
"oo": [ { "tSec": 0.0, "value": 0.0 }, { "tSec": 0.5, "value": 1.0 } ]
}
}
LaneValue rangeEffect
dynamicsany finite numberMultiplier on output gain — linear, not dB.
breathiness0..1Mix-in noise residual amplitude. 0 = pure tonal, 1 = breathy.
timbreMorphper-timbre 0..1Cross-fade weights across the preset’s timbres. Weights are normalised internally before mixing.

Each lane is a sorted array of { tSec, value } breakpoints; the engine linearly interpolates between them.

A two-bar score with lyrics, phonemes, and a breathiness automation lane:

{
"formatVersion": "1.0.0",
"bpm": 120,
"notes": [
{ "id": "n1", "startSec": 0.0, "durationSec": 0.5, "midi": 60, "velocity": 0.8, "timbre": "ah" },
{ "id": "n2", "startSec": 0.5, "durationSec": 0.5, "midi": 64, "velocity": 0.8, "timbre": "ee" },
{ "id": "n3", "startSec": 1.0, "durationSec": 1.0, "midi": 67, "velocity": 0.9, "timbre": "oo",
"vibrato": { "rateHz": 5.5, "depthCents": 50, "onsetSec": 0.2 }, "portamentoSec": 0.04 }
],
"lyrics": { "text": "la la la", "language": "en" },
"phonemes": [
{ "tSec": 0.00, "durSec": 0.05, "phoneme": "L", "kind": "consonant", "strength": 0.7 },
{ "tSec": 0.05, "durSec": 0.45, "phoneme": "AH", "kind": "vowel", "timbreHint": "ah" },
{ "tSec": 0.55, "durSec": 0.45, "phoneme": "AH", "kind": "vowel", "timbreHint": "ee" },
{ "tSec": 1.05, "durSec": 0.95, "phoneme": "AH", "kind": "vowel", "timbreHint": "oo" }
],
"lanes": {
"breathiness": [ { "tSec": 0.0, "value": 0.0 }, { "tSec": 1.5, "value": 0.3 } ]
}
}

Save this file as score.json and render it from the CLI:

Terminal window
npm run play-score -- --score score.json --preset kokoro-am-michael --out song.wav

…or POST it to /api/render:

Terminal window
curl -X POST http://localhost:3000/api/render \
-H 'Content-Type: application/json' \
-d @<(jq -n --slurpfile s score.json '{score: $s[0], config: {presetId: "kokoro-am-michael", maxPolyphony: 4, deterministic: "exact", rngSeed: 123}}')

formatVersion is gated against SUPPORTED_SCORE_VERSIONS (see src/types/scoreSchema.ts). A score from a future version fails loud with UNSUPPORTED_SCORE_VERSION rather than silently dropping unknown fields. When the schema gains new required fields, the version is bumped and a migration note is added to the CHANGELOG.