Skip to content

Beginners Guide

sonic-runtime is a native audio engine that runs as a sidecar process. It handles audio playback, device management, and text-to-speech synthesis on behalf of sonic-core, which communicates with it over newline-delimited JSON on stdin/stdout.

sonic-runtime is not a standalone application. It expects a parent process (sonic-core) to launch it, send commands, and receive responses and events.

Before you begin, make sure you have:

  • .NET 8 SDK installed (download)
  • Windows (the v1 binary targets win-x64)
  • Git for cloning the repository

For synthesis (text-to-speech), you also need:

  • The Kokoro ONNX model file (~326 MB)
  • Voice embedding files (.bin format)
  • eSpeak-NG installed or available in the espeak/ directory

Playback works without any synthesis assets.

Clone the repository and build:

Terminal window
git clone https://github.com/mcp-tool-shop-org/sonic-runtime
cd sonic-runtime
dotnet build

To create a self-contained native executable (no .NET runtime needed on the target machine):

Terminal window
dotnet publish src/SonicRuntime -c Release -r win-x64

The output binary is at src/SonicRuntime/bin/Release/net8.0/win-x64/publish/SonicRuntime.exe.

Every piece of audio in sonic-runtime is tracked by an opaque handle (e.g., h_000000000001). You get a handle when you load an asset or synthesize speech, and use that handle for all subsequent operations (play, pause, stop, seek, volume, pan).

Handles are internal to sonic-runtime. The parent process (sonic-core) maps them to its own playback IDs — clients never see raw handles.

sonic-runtime communicates using ndjson-stdio-v1 — one JSON object per line on stdin (commands) and stdout (responses and events).

A request looks like:

{"id": 1, "method": "version"}

A response echoes the id:

{"id": 1, "result": {"name": "sonic-runtime", "version": "1.0.1", "protocol": "ndjson-stdio-v1"}}

Errors include a structured error object with a code, message, and retryable flag:

{"id": 2, "error": {"code": "invalid_source", "message": "Asset file not found", "retryable": false}}

Events are pushed by the runtime without a prior request and have no id:

{"event": "playback_ended", "data": {"handle": "h_000000000001", "reason": "completed"}}

All diagnostic logs go to stderr. stdout is exclusively for protocol messages.

sonic-runtime has three main engine components:

  1. PlaybackEngine — loads WAV files into OpenAL buffers, manages sources, handles play/pause/stop/seek/volume/pan/loop. Detects natural completion via 10ms polling.
  2. DeviceManager — enumerates real hardware audio output devices. Each playback can target a specific device.
  3. SynthesisEngine — converts text to speech using Kokoro ONNX. Pipeline: text normalization, eSpeak G2P, ONNX inference, WAV generation.

Here is a typical command sequence. Each line is one JSON object sent to stdin:

→ {"id":1,"method":"version"}
← {"id":1,"result":{"name":"sonic-runtime","version":"1.0.1","protocol":"ndjson-stdio-v1"}}
→ {"id":2,"method":"load_asset","params":{"asset_ref":"file:///C:/sounds/rain.wav"}}
← {"id":2,"result":{"handle":"h_000000000001"}}
→ {"id":3,"method":"play","params":{"handle":"h_000000000001","volume":0.8,"loop":true}}
← {"id":3,"result":null}
→ {"id":4,"method":"set_volume","params":{"handle":"h_000000000001","level":0.5,"fade_ms":500}}
← {"id":4,"result":null}
→ {"id":5,"method":"stop","params":{"handle":"h_000000000001"}}
← {"id":5,"result":null}
← {"event":"playback_ended","data":{"handle":"h_000000000001","reason":"stopped"}}

For synthesis:

→ {"id":6,"method":"synthesize","params":{"engine":"kokoro","voice":"af_heart","text":"Hello world","speed":1.0}}
← {"event":"synthesis_started","data":{"handle":"h_000000000002","engine":"kokoro","voice":"af_heart"}}
← {"id":6,"result":{"handle":"h_000000000002","duration_ms":850,"sample_rate":24000,"channels":1}}
← {"event":"synthesis_completed","data":{"handle":"h_000000000002","duration_ms":850,"inference_ms":270}}
→ {"id":7,"method":"play","params":{"handle":"h_000000000002"}}
← {"id":7,"result":null}

Before running synthesis, you can check that all required assets are in place using the validate_assets command:

→ {"id":1,"method":"validate_assets"}
← {"id":1,"result":{"valid":true,"errors":[],"warnings":[],"model":{"available":true,"path":"..."},"voices":{"available":true,"count":10,"voices":["af_heart","am_onyx",...]},"espeak":{"available":true,"path":"..."},"onnx_runtime":{"available":true,"path":"..."},"asset_root":"..."}}

If any asset is missing, the response includes an errors array and each asset check includes an error message and a hint telling you exactly what to do. For example, a missing model returns:

{"error": "kokoro.onnx not found in models/", "hint": "Download kokoro.onnx (FP32, ~326 MB) to C:\\publish\\models"}

You can also check the runtime health at any time:

→ {"id":2,"method":"get_health"}
← {"id":2,"result":{"status":"ok","uptime_ms":12345,"active_handles":0,"model_loaded":true,"voices_loaded":10,"espeak_available":true}}

sonic-runtime supports per-playback device routing. You can list available audio output devices and direct any playback to a specific one:

→ {"id":10,"method":"list_devices"}
← {"id":10,"result":[{"device_id":"openal_0_a1b2c3d4","name":"Speakers (Realtek)","kind":"output","is_default":true,"channels":2,"sample_rates":[44100,48000]},{"device_id":"openal_1_e5f6a7b8","name":"Headphones (USB)","kind":"output","is_default":false,"channels":2,"sample_rates":[44100,48000]}]}
→ {"id":11,"method":"play","params":{"handle":"h_000000000001","volume":0.8,"output_device_id":"openal_1_e5f6a7b8"}}
← {"id":11,"result":null}

Device IDs are opaque strings that change when hardware is reconnected. Always call list_devices before routing to a specific device.

Terminal window
dotnet test

The test suite covers all protocol methods, engine components, event emission, error handling, and version alignment. Tests that require real audio hardware or synthesis assets are isolated and use mock backends.

Error codeWhat it meansWhat to do
invalid_sourceThe WAV file path does not exist or is not a valid WAVCheck the asset_ref path. Only WAV files are supported.
playback_not_foundThe handle has already been stopped or never existedDo not reuse handles after stop. Load a new asset.
device_unavailableThe requested output device is not connectedCall list_devices first. Device IDs change when hardware is reconnected.
synthesis_model_missingThe models/kokoro.onnx file is not presentDownload the model from HuggingFace and place it in models/ next to the binary.
synthesis_voice_not_foundThe requested voice ID is not loadedCheck available voices with list_voices. Voice files must be .bin files in voices/.
synthesis_validation_failedBad input: wrong engine name, empty text, or speed out of rangeEngine must be “kokoro”. Text must not be empty. Speed must be 0.5-2.0.
  • Read the Architecture page to understand how the components fit together
  • Read the Protocol Reference for the complete list of commands and events
  • Read the Security page for the threat model