Skip to content

Beginners Guide

New to vocal synthesis? This page walks through the fundamentals of Vocal Synth Engine, from core concepts to rendering your first singing voice.

Vocal Synth Engine is a TypeScript-based tool that generates singing voices from structured data. Instead of recording a singer, you describe notes, timing, and expression in JSON, and the engine produces a WAV audio file of a synthetic voice performing your composition. It runs as a local server with a browser-based interface for composing, live performance, and multi-user collaboration.

  • Game developers who need procedural or dynamic vocal audio without hiring voice talent
  • Music producers experimenting with synthetic vocal textures and additive synthesis
  • Creative coders building interactive audio experiences with WebSocket-based real-time streaming
  • Researchers exploring vocal synthesis, formant modeling, or phoneme-driven timbre control
  • Hobbyists curious about how singing voices can be constructed from harmonic partials and spectral data

No prior audio programming experience is required, but basic comfort with a terminal and JSON will help.

Before you begin, make sure you have:

  • Node.js 18 or later — check with node --version
  • npm (bundled with Node.js)
  • A modern browser (Chrome, Firefox, or Edge) for the cockpit UI
  • Basic terminal skills — you will run commands like npm ci and npm run dev
  • A text editor if you want to write score JSON by hand (VS Code, Sublime, etc.)

Optional but helpful:

  • A MIDI controller for live mode (any USB MIDI keyboard works)
  • curl for testing the REST API from the command line

Vocal synthesis generates singing voices from structured data rather than recorded audio. Vocal Synth Engine uses additive synthesis — it builds sound by stacking sine waves (harmonics) at integer multiples of a fundamental pitch. A spectral envelope shapes the relative loudness of each harmonic, encoding the formant structure that makes an “ah” sound different from an “ee.” A noise component adds breathiness and consonant texture.

The result is a controllable singing voice that responds to pitch, timing, timbre, and expression parameters in real time.

TermMeaning
VocalScoreA JSON object containing BPM, notes, optional lyrics, and automation lanes. This is the input to the render pipeline.
VocalNoteA single note in a score: MIDI pitch, start time, duration, velocity, and optional vibrato/portamento/timbre.
Voice presetA frozen analysis artifact containing harmonic magnitudes, spectral envelope, and noise floor data. Each preset represents a different singing voice.
TimbreA sub-voice within a preset. Many presets have three timbres (AH, EE, OO) representing different vowel shapes. The engine blends between them.
PhonemeA unit of speech sound (ARPAbet notation). The engine maps lyrics text to phonemes, which drive timbre selection and consonant synthesis.
Block sizeThe number of audio samples processed at once. Smaller blocks reduce latency; larger blocks improve throughput.
PolyphonyThe maximum number of simultaneous notes. When exceeded, the engine steals the oldest voice.

Start by cloning the repository, installing dependencies, and starting the dev server:

Terminal window
git clone https://github.com/mcp-tool-shop-org/vocal-synth-engine.git
cd vocal-synth-engine
npm ci
npm run dev

The server starts at http://localhost:4321. Open this URL in your browser to see the cockpit UI.

The fastest path to hearing a voice is through the cockpit: click and drag on the piano roll to create a note, choose a preset, and click Render. But you can also skip the UI entirely and render from the command line using a JSON score.

You do not need the cockpit UI to render audio. A VocalScore is plain JSON:

{
"bpm": 100,
"notes": [
{ "id": "n1", "startSec": 0.0, "durationSec": 0.8, "midi": 60, "velocity": 0.9 },
{ "id": "n2", "startSec": 1.0, "durationSec": 0.8, "midi": 64, "velocity": 0.85 },
{ "id": "n3", "startSec": 2.0, "durationSec": 1.2, "midi": 67, "velocity": 0.9 }
]
}

This creates three notes — C4, E4, G4 — forming a simple C major arpeggio. MIDI note 60 is middle C. Velocity ranges from 0 (silent) to 1 (full volume).

To render it, POST the score and a config object to the render API:

Terminal window
curl -X POST http://localhost:4321/api/render \
-H "Content-Type: application/json" \
-d '{
"score": {
"bpm": 100,
"notes": [
{ "id": "n1", "startSec": 0, "durationSec": 0.8, "midi": 60, "velocity": 0.9 },
{ "id": "n2", "startSec": 1, "durationSec": 0.8, "midi": 64, "velocity": 0.85 },
{ "id": "n3", "startSec": 2, "durationSec": 1.2, "midi": 67, "velocity": 0.9 }
]
},
"config": {
"presetPath": "presets/default-voice",
"sampleRateHz": 48000,
"blockSize": 2048,
"maxPolyphony": 4,
"rngSeed": 42,
"defaultTimbre": "default",
"deterministic": "exact"
}
}'

The response includes an audioUrl field. Fetch it to download the WAV.

Voice presets are the “singer” in the engine. Each preset directory under presets/ contains a manifest.json and binary .f32 files holding the spectral data.

Preset names follow a naming convention:

  • default-voice — a baseline female voice for testing
  • bright-lab — an experimental lab-created voice with bright formants
  • kokoro-af-* — American female voices (Aoede, Heart, Jessica, Sky)
  • kokoro-am-* — American male voices (Eric, Fenrir, Liam, Onyx)
  • kokoro-bf-* — British female voices (Alice, Emma, Isabella)
  • kokoro-bm-* — British male voices (George, Lewis)

Most Kokoro presets ship with three timbres named AH, EE, and OO, corresponding to open, front, and rounded vowel shapes. The engine blends between them based on phoneme data or the XY pad in live mode.

To list all available presets and their timbres:

Terminal window
curl http://localhost:4321/api/presets

Lyrics add vowel-aware timbre blending and consonant synthesis to a render. Add a lyrics field to your score:

{
"bpm": 100,
"notes": [
{ "id": "n1", "startSec": 0, "durationSec": 1.5, "midi": 60, "velocity": 0.9 }
],
"lyrics": {
"text": "hello",
"language": "en-US"
}
}

The engine runs a grapheme-to-phoneme pipeline: it looks up words in the CMU Pronouncing Dictionary first, then falls back to letter-based rules for unknown words. Vowel phonemes drive timbre blending (AH/EE/OO weights), and consonant phonemes trigger noise bursts shaped by profiles for fricatives, plosives, and nasals.

You can also call the phonemize endpoint directly to preview the conversion:

Terminal window
curl -X POST http://localhost:4321/api/phonemize \
-H "Content-Type: application/json" \
-d '{"text": "hello world"}'

Beyond offline rendering, the engine supports two real-time modes:

Live mode (/ws WebSocket) lets a single user play notes in real time. The cockpit’s Live tab provides a chromatic keyboard, MIDI input, an XY pad for timbre morphing, and recording. Audio streams back over the WebSocket as PCM blocks.

Jam sessions (/ws/jam WebSocket) support multiple users collaborating. One person creates a session as host, others join as guests. The host controls transport (play, stop, BPM), recording, and quantization. Guests can play notes on any track. All events are captured in an EventTape with participant attribution and can be exported to WAV.

To try live mode, start the dev server and open the cockpit in your browser. Switch to the Live tab and play notes with your keyboard or mouse.

Server will not start — Make sure you have Node.js 18 or later. Run node --version to check. Delete node_modules and run npm ci again if dependencies look corrupted.

No sound from live mode — Verify the WebSocket connection in your browser’s dev tools (Network tab, filter by WS). The /ws endpoint must show a connected state. Check that transport is set to “play” — notes are silent when transport is stopped.

Render returns 404 for preset — The preset path in config must match a directory under presets/. Use the /api/presets endpoint to list valid preset IDs.

Render returns 429 — You have exceeded the rate limit (20 requests per minute by default). Wait and retry, or increase the limit with the RATE_LIMIT_RPM environment variable.

Voice stealing clicks — When polyphony is exceeded, the engine steals the oldest voice with a 10ms crossfade. Increase maxPolyphony in your config or reduce the number of overlapping notes.

Different output on re-render — Make sure deterministic is set to "exact" and the rngSeed value is the same. The engine is fully deterministic when both are consistent.

Once you have rendered your first score, explore these handbook pages to go deeper:

  • Getting Started — install instructions, dev server setup, environment variables
  • Architecture — how the synthesis pipeline works, directory layout, signal flow
  • Voice Presets — browse the 15 bundled presets, understand timbre data, inspect preset manifests
  • Cockpit UI and Jam Sessions — use the piano roll editor, play live, collaborate in multi-user sessions
  • API Reference — REST endpoints, WebSocket protocol, authentication, error handling
TermDefinition
Additive synthesisBuilding sound by summing sine waves (harmonics) at integer multiples of a fundamental frequency.
ADSR / ASREnvelope shape controlling amplitude over time. This engine uses Attack-Sustain-Release (no separate decay phase).
ARPAbetA phonetic notation system used by the CMU Pronouncing Dictionary. Phonemes like AH, IY, and K represent speech sounds.
Block sizeThe number of audio samples processed per rendering step. Smaller blocks mean lower latency; larger blocks mean better throughput.
BPMBeats per minute — the tempo of a score or jam session.
BreathinessThe amount of noise mixed into the voice signal, controlled per note or via automation lanes.
DeterministicGiven the same inputs (score, preset, seed), the engine produces identical output every time.
EventTapeA recording of all note events during a jam session, with timestamps and participant attribution.
FormantA resonant frequency of the vocal tract that shapes vowel identity. The spectral envelope encodes formant structure.
G2P (grapheme-to-phoneme)Converting written text (like “hello”) into phoneme sequences (like HH-AH-L-OW).
HarmonicA sine wave at an integer multiple of the fundamental frequency. The first harmonic is the fundamental itself.
MIDI note numberA standard pitch identifier where 60 = middle C, 69 = A4 (440 Hz).
PolyphonyThe number of notes that can sound simultaneously. When exceeded, the engine steals the oldest voice.
PresetA frozen analysis artifact containing spectral data for a specific singing voice.
RNG seedA number that initializes the random number generator, ensuring reproducible noise patterns across renders.
Spectral envelopeA smooth curve describing the relative amplitude of harmonics, encoding vowel identity and voice character.
TimbreA sub-voice within a preset (e.g., AH, EE, OO) representing a vowel shape. The engine blends between timbres.
VocalScoreThe JSON input format describing notes, lyrics, tempo, and automation for a render.
Voice stealingWhen polyphony is exceeded, the engine recycles the oldest active voice with a short crossfade to avoid clicks.
WebSocketA persistent bidirectional connection used for real-time audio streaming and jam session communication.