API Reference
Vocal Synth Engine exposes a REST API and two WebSocket endpoints. All endpoints are served from the same Express server.
Authentication
Section titled “Authentication”Authentication is optional. When the AUTH_TOKEN environment variable is set, protected endpoints require a bearer token:
Authorization: Bearer <your-token>Endpoints marked “Auth: Yes” in the table below are protected when AUTH_TOKEN is configured. When unset, all endpoints are open.
Tokens can be supplied two ways:
- Header:
Authorization: Bearer <your-token> - Query parameter:
?token=<your-token>(useful for<audio>src URLs where headers cannot be set)
Rate limiting
Section titled “Rate limiting”The render endpoint (/api/render) enforces a per-IP rate limit. Defaults to 20 requests per minute. Override with the RATE_LIMIT_RPM environment variable. Exceeding the limit returns HTTP 429.
REST endpoints
Section titled “REST endpoints”Health
Section titled “Health”| Path | /api/health |
| Method | GET |
| Auth | No |
| Description | Server health, version string, and uptime in seconds. |
List presets
Section titled “List presets”| Path | /api/presets |
| Method | GET |
| Auth | No |
| Description | Returns all voice presets with timbres, pitch ranges, and metadata. |
Phonemize
Section titled “Phonemize”| Path | /api/phonemize |
| Method | POST |
| Auth | Yes |
| Description | Convert lyrics text to a sequence of phoneme events. |
Request body:
{ "text": "hello world"}Render
Section titled “Render”| Path | /api/render |
| Method | POST |
| Auth | Yes |
| Description | Render a VocalScore to WAV. Returns a URL for retrieving the audio. |
Request body:
{ "score": { "bpm": 120, "notes": [ { "id": "n1", "startSec": 0, "durationSec": 1, "midi": 60, "velocity": 0.8 } ] }, "config": { "presetPath": "presets/kokoro-af-heart", "sampleRateHz": 48000, "blockSize": 2048, "maxPolyphony": 8, "rngSeed": 42, "defaultTimbre": "AH", "deterministic": "exact" }}Response:
{ "ok": true, "durationSec": 1.1, "telemetry": { ... }, "provenance": { ... }, "audioUrl": "/api/renders/last/audio.wav"}Score duration is capped at 60 seconds by default (override with MAX_RENDER_DURATION_SEC environment variable).
List renders
Section titled “List renders”| Path | /api/renders |
| Method | GET |
| Auth | Yes |
| Description | List all saved renders with metadata. |
Render audio
Section titled “Render audio”| Path | /api/renders/:id/audio.wav |
| Method | GET |
| Auth | Yes |
| Description | Download the rendered WAV file. |
Render score
Section titled “Render score”| Path | /api/renders/:id/score |
| Method | GET |
| Auth | Yes |
| Description | Retrieve the original score JSON used for this render. |
Render metadata
Section titled “Render metadata”| Path | /api/renders/:id/meta |
| Method | GET |
| Auth | Yes |
| Description | Render metadata including preset, polyphony, seed, and timing. |
Render telemetry
Section titled “Render telemetry”| Path | /api/renders/:id/telemetry |
| Method | GET |
| Auth | Yes |
| Description | Performance telemetry: peak dBFS, real-time factor, click count. |
Render provenance
Section titled “Render provenance”| Path | /api/renders/:id/provenance |
| Method | GET |
| Auth | Yes |
| Description | Provenance data: commit SHA, score hash, WAV hash, engine config. |
WebSocket endpoints
Section titled “WebSocket endpoints”Live mode
Section titled “Live mode”| Path | /ws |
| Purpose | Single-user note playback with real-time audio streaming. |
The live WebSocket accepts note-on and note-off messages and streams PCM audio blocks back to the client. The cockpit UI’s Live tab uses this endpoint.
Jam sessions
Section titled “Jam sessions”| Path | /ws/jam |
| Purpose | Multi-user collaborative sessions with recording. |
The jam WebSocket uses a structured JSON protocol. See the Cockpit and Jams page for the full protocol table and session lifecycle.
Error responses
Section titled “Error responses”API errors return a JSON object with an ok: false field and error details. Structured errors include a machine-readable code and a message:
{ "ok": false, "code": "PRESET_NOT_FOUND", "message": "Preset 'presets/unknown' not found", "available": ["default-voice", "bright-lab", "kokoro-af-heart"]}General validation errors return a simpler shape:
{ "ok": false, "error": "Missing score"}| Field | Type | Description |
|---|---|---|
ok | boolean | Always false on errors |
code | string | Machine-readable error code (present on structured errors) |
message | string | Human-readable description (present on structured errors) |
error | string | Error message (present on general validation errors) |
HTTP status codes follow standard conventions: 400 for bad requests, 401 for missing/invalid auth, 404 for not found, 429 for rate limited, 500 for server errors.
Environment variables
Section titled “Environment variables”| Variable | Default | Description |
|---|---|---|
AUTH_TOKEN | (unset) | Optional bearer token to protect API endpoints |
PORT | 4321 | Server port |
RATE_LIMIT_RPM | 20 | Max render requests per minute per IP |
MAX_RENDER_DURATION_SEC | 60 | Maximum allowed score duration in seconds |