Architecture
Overview
Section titled “Overview”stdin (JSON) → CommandLoop → CommandDispatcher → Engine components → stdout (JSON) │ ├─ PlaybackEngine (OpenAL Soft) │ ├─ DeviceManager (hot-plug, enumeration) │ ├─ SynthesisEngine (Kokoro ONNX → WAV → playback) │ │ ├─ KokoroTokenizer (eSpeak G2P) │ │ ├─ KokoroInference (ONNX Runtime) │ │ └─ VoiceRegistry (style embeddings) │ └─ RuntimeState (handle tracking) └─ IEventWriter → stdout (unsolicited events)All diagnostic output goes to stderr. stdout is exclusively for protocol messages and events.
CommandLoop
Section titled “CommandLoop”The main entry point. Reads newline-delimited JSON from stdin, dispatches to CommandDispatcher, writes responses to stdout. Single-threaded command processing ensures deterministic ordering.
PlaybackEngine
Section titled “PlaybackEngine”Manages audio playback through OpenAL Soft (via Silk.NET bindings):
- Load — parse WAV files into OpenAL buffers
- Play — create OpenAL sources, attach buffers, start playback
- Controls — volume, pan, seek, fade, loop
- Completion — 10ms polling of
AL_SOURCE_STATEdetects when playback ends - Device routing — each playback can target a specific output device via separate OpenAL device/context pairs
DeviceManager
Section titled “DeviceManager”Enumerates audio output devices via ALC_ENUMERATE_ALL_EXT:
- Lists real hardware endpoints (not just “default”)
- Provides device ID ↔ OpenAL device name mapping
- Handles hot-plug events when devices connect/disconnect
SynthesisEngine
Section titled “SynthesisEngine”Text-to-speech pipeline using Kokoro ONNX:
- KokoroTokenizer — normalizes text (currency, titles, decimals), spawns eSpeak-NG for grapheme-to-phoneme conversion, maps IPA output to 178-token vocab, pads for model input
- VoiceRegistry — loads raw float32
.binvoice embeddings at startup (shape: 510 entries x 256 floats per voice). Selects a 256-float style vector based on token count - KokoroInference — runs the Kokoro ONNX model (lazy-loaded, held for process lifetime). Inputs: token IDs, style vector, speed. Output: float32 PCM at 24 kHz. Thread-safe via lock
- WavWriter — converts float32 samples to 16-bit PCM WAV
- PlaybackEngine — loads the generated WAV into an OpenAL buffer for playback
Performance: approximately 5x realtime on CPU. Speed parameter accepted in range 0.5-2.0.
RuntimeState
Section titled “RuntimeState”Tracks active handles, maps them to playback slots, and manages lifecycle. Handle format: h_XXXXXXXXXXXX (12-digit hex).
Event system
Section titled “Event system”Events flow from engine components to the parent process via IEventWriter. The CommandLoopEventWriter bridges engine events to the CommandLoop’s stdout writer using late binding (via Connect()) to break the circular dependency between engines and the loop.
Event types:
playback_ended— a playback completed naturally or was stopped. Data:handle,reason(“completed” or “stopped”)synthesis_started— TTS inference began. Data:handle,engine,voicesynthesis_completed— TTS inference finished. Data:handle,duration_ms,inference_mssynthesis_failed— TTS inference failed. Data:handle,code,message
Events are unsolicited JSON messages on stdout, distinct from request/response pairs (they have no id field). A NullEventWriter is available for test isolation.