Score Format
A VocalScore is the canonical input to the render engine. It is plain JSON, validated at every trust boundary by src/types/scoreSchema.ts (Zod). The same shape is accepted by the REST API (POST /api/render), by the CLI tools (play-score, compare), and by the cockpit’s piano roll.
Top-level shape
Section titled “Top-level shape”{ "formatVersion": "1.0.0", "bpm": 120, "notes": [ ... ], "lyrics": { ... }, "phonemes": [ ... ], "lanes": { ... }}| Field | Type | Required | Notes |
|---|---|---|---|
formatVersion | semver string | optional | Defaults to "1.0.0". The engine rejects scores whose version is not in SUPPORTED_SCORE_VERSIONS. |
bpm | finite number > 0 | required | Beats per minute. Used by quantization helpers and by phonemizer alignment; the engine does not warp time. |
notes | array of VocalNote | required | The pitched voices to render. Empty array is legal (renders silence + tail). |
lyrics | {text, language?} | optional | Source text for the phonemize endpoint. Stored but not synthesized directly; phonemes drive timbre. |
phonemes | array of events | optional | Pre-computed phoneme timeline. When absent the engine uses the timbre field on each note. |
lanes | LanesObject | optional | Time-varying automation curves applied across the score (dynamics, breathiness, timbreMorph). |
VocalNote
Section titled “VocalNote”{ "id": "n1", "startSec": 0.0, "durationSec": 0.5, "midi": 60, "velocity": 0.8, "timbre": "ah", "vibrato": { "rateHz": 5.5, "depthCents": 50, "onsetSec": 0.2 }, "portamentoSec": 0.05, "pan": 0.0}| Field | Type | Required | Constraints |
|---|---|---|---|
id | non-empty string | required | Stable identifier. Used by the cockpit to track edits. |
startSec | finite number ≥ 0 | required | Seconds from start of score. |
durationSec | finite number > 0 | required | Length of the note. |
midi | finite number, 0..127 | required | MIDI note number (60 = middle C). |
velocity | finite number, 0..1 | optional | Default 0.8. |
timbre | non-empty string | optional | Preset-specific timbre id (e.g. "ah", "oo", "ee"). |
vibrato | object | optional | rateHz, depthCents, onsetSec — all finite, ≥ 0. |
portamentoSec | finite number ≥ 0 | optional | Pitch-glide duration into this note. |
pan | finite number, -1..1 | optional | Stereo pan when rendering to 2 channels. Ignored on mono renders. |
lyrics
Section titled “lyrics”{ "text": "hello world", "language": "en" }| Field | Type | Required | Notes |
|---|---|---|---|
text | string | required | Raw lyric text. |
language | string | optional | ISO language tag. Used by the G2P backend; default "en". |
phonemes
Section titled “phonemes”Each entry is a phoneme event aligned in time. Generated by POST /api/phonemize and persisted into the score so re-rendering is reproducible.
{ "tSec": 0.0, "durSec": 0.18, "phoneme": "AH", "kind": "vowel", "timbreHint": "ah", "strength": 0.85}| Field | Type | Required | Notes |
|---|---|---|---|
tSec | finite number ≥ 0 | required | Start time of this phoneme. |
durSec | finite number > 0 | required | Phoneme duration. |
phoneme | non-empty string | required | Phoneme label (engine vocabulary). |
kind | "vowel" | "consonant" | required | Routing hint. |
timbreHint | non-empty string | optional | Preferred timbre for vowels; engine may override. |
strength | finite number, 0..1 | optional | Vowel strength for consonant-to-vowel transitions. |
lanes — automation
Section titled “lanes — automation”Three automation lanes layer over the rendered audio:
{ "dynamics": [ { "tSec": 0.0, "value": 0.6 }, { "tSec": 1.2, "value": 1.0 } ], "breathiness":[ { "tSec": 0.0, "value": 0.0 }, { "tSec": 0.5, "value": 0.4 } ], "timbreMorph": { "ah": [ { "tSec": 0.0, "value": 1.0 }, { "tSec": 0.5, "value": 0.0 } ], "oo": [ { "tSec": 0.0, "value": 0.0 }, { "tSec": 0.5, "value": 1.0 } ] }}| Lane | Value range | Effect |
|---|---|---|
dynamics | any finite number | Multiplier on output gain — linear, not dB. |
breathiness | 0..1 | Mix-in noise residual amplitude. 0 = pure tonal, 1 = breathy. |
timbreMorph | per-timbre 0..1 | Cross-fade weights across the preset’s timbres. Weights are normalised internally before mixing. |
Each lane is a sorted array of { tSec, value } breakpoints; the engine linearly interpolates between them.
Complete example
Section titled “Complete example”A two-bar score with lyrics, phonemes, and a breathiness automation lane:
{ "formatVersion": "1.0.0", "bpm": 120, "notes": [ { "id": "n1", "startSec": 0.0, "durationSec": 0.5, "midi": 60, "velocity": 0.8, "timbre": "ah" }, { "id": "n2", "startSec": 0.5, "durationSec": 0.5, "midi": 64, "velocity": 0.8, "timbre": "ee" }, { "id": "n3", "startSec": 1.0, "durationSec": 1.0, "midi": 67, "velocity": 0.9, "timbre": "oo", "vibrato": { "rateHz": 5.5, "depthCents": 50, "onsetSec": 0.2 }, "portamentoSec": 0.04 } ], "lyrics": { "text": "la la la", "language": "en" }, "phonemes": [ { "tSec": 0.00, "durSec": 0.05, "phoneme": "L", "kind": "consonant", "strength": 0.7 }, { "tSec": 0.05, "durSec": 0.45, "phoneme": "AH", "kind": "vowel", "timbreHint": "ah" }, { "tSec": 0.55, "durSec": 0.45, "phoneme": "AH", "kind": "vowel", "timbreHint": "ee" }, { "tSec": 1.05, "durSec": 0.95, "phoneme": "AH", "kind": "vowel", "timbreHint": "oo" } ], "lanes": { "breathiness": [ { "tSec": 0.0, "value": 0.0 }, { "tSec": 1.5, "value": 0.3 } ] }}Save this file as score.json and render it from the CLI:
npm run play-score -- --score score.json --preset kokoro-am-michael --out song.wav…or POST it to /api/render:
curl -X POST http://localhost:3000/api/render \ -H 'Content-Type: application/json' \ -d @<(jq -n --slurpfile s score.json '{score: $s[0], config: {presetId: "kokoro-am-michael", maxPolyphony: 4, deterministic: "exact", rngSeed: 123}}')Versioning and migrations
Section titled “Versioning and migrations”formatVersion is gated against SUPPORTED_SCORE_VERSIONS (see src/types/scoreSchema.ts). A score from a future version fails loud with UNSUPPORTED_SCORE_VERSION rather than silently dropping unknown fields. When the schema gains new required fields, the version is bumped and a migration note is added to the CHANGELOG.
See also
Section titled “See also”- API Reference — REST endpoint that consumes this shape.
- CLI Reference —
play-score,compare,inspectall accept this format. - Voice Presets — the
timbrevalues valid for each preset.