Architecture
Voice Soundboard uses a Compiler / Graph / Engine architecture that cleanly separates what is said (intent, emotion, style) from how it is rendered (backend, audio format).
Pipeline
Section titled “Pipeline”compile_request("text", emotion="happy") | ControlGraph (pure data) | engine.synthesize(graph) | PCM audio (numpy array)The Three Layers
Section titled “The Three Layers”Compiler
Section titled “Compiler”The compiler transforms intent (text + emotion + style) into a ControlGraph. All feature
logic — emotions, styles, SSML parsing, presets — lives here.
from voice_soundboard.compiler import compile_request
graph = compile_request( "Hello world!", voice="af_bella", emotion="happy",)ControlGraph
Section titled “ControlGraph”An immutable data structure containing TokenEvents, SpeakerRefs, and prosody parameters.
This is the contract between the compiler and the engine.
from voice_soundboard.graph import GRAPH_VERSION, ControlGraph
assert GRAPH_VERSION == 1Engine
Section titled “Engine”The engine transforms graphs into PCM audio. It knows nothing about emotions, styles, or
presets — only how to synthesize a ControlGraph through a backend.
from voice_soundboard.engine import load_backend
backend = load_backend("kokoro")audio = backend.synthesize(graph)Why This Separation Matters
Section titled “Why This Separation Matters”- Features are free at runtime — emotion and style are already baked into the graph
- Engine stays tiny, fast, testable — it only does synthesis
- Backends are swappable without touching feature logic
- Graph is serializable — compile once, synthesize many times or on different machines
Architecture Invariants
Section titled “Architecture Invariants”These rules are enforced in tests and must never be violated:
-
Engine isolation:
engine/never imports fromcompiler/. The engine knows nothing about emotions, styles, or presets — onlyControlGraphs. -
Voice cloning boundary: Raw audio never reaches the engine. The compiler extracts speaker embeddings; the engine receives only embedding vectors via
SpeakerRef. -
Graph stability:
GRAPH_VERSION(currently 1) is bumped on breaking changes toControlGraph. Backends can check this for compatibility.
Package Structure
Section titled “Package Structure”voice_soundboard/├── graph/ # ControlGraph, TokenEvent, SpeakerRef├── compiler/ # Text -> Graph (all features live here)│ ├── text.py # Tokenization, normalization│ ├── emotion.py # Emotion -> prosody│ ├── style.py # Natural language style│ └── compile.py # Main entry point├── engine/ # Graph -> PCM (no features, just synthesis)│ └── backends/ # Kokoro, Piper, OpenAI, Coqui, ElevenLabs, Azure, Mock├── runtime/ # Streaming, timeline, ducking, batch, cache├── adapters/ # CLI, public API (thin wrappers)├── streaming/ # Incremental word-by-word synthesis├── conversation/ # Multi-speaker dialogue├── cloning/ # Speaker embedding extraction├── speakers/ # Speaker database├── realtime/ # Low-latency streaming engine├── plugins/ # Plugin architecture├── quality/ # Voice quality metrics├── formats/ # Audio format conversion, LUFS├── debug/ # Graph visualization, profiler├── testing/ # VoiceMock, AudioAssertions└── accessibility/ # Screen reader integration, captions