For Beginners
What Is Voice Soundboard?
Section titled “What Is Voice Soundboard?”Voice Soundboard is a Python text-to-speech library that turns plain text into spoken audio. It wraps multiple TTS backends (Kokoro, Piper, OpenAI, and others) behind a single API so you can generate speech with one line of code instead of wrestling with model files, sample rates, and audio formats yourself.
The library separates what you want to say (text, emotion, style) from how it gets rendered (which TTS engine, which voice model). This means you can switch from a local GPU backend to a cloud API without changing your application code.
Who Is This For?
Section titled “Who Is This For?”- Application developers adding voice output to chatbots, assistants, or accessibility features
- Hobbyists building voice-enabled side projects who want results in minutes, not hours
- AI/ML engineers who need a clean TTS abstraction they can slot into an LLM pipeline
No audio engineering background is required. If you can write basic Python and use pip, you are ready.
1. Prerequisites
Section titled “1. Prerequisites”You need Python 3.10 or newer. Check your version:
python --versionA virtual environment is recommended to keep dependencies isolated:
python -m venv .venv# Linux/macOSsource .venv/bin/activate# Windows.venv\Scripts\activateNo GPU is required. The Piper backend runs on CPU, and the built-in Mock backend needs no models at all.
2. Installation
Section titled “2. Installation”Install the core library from PyPI:
pip install voice-soundboardFor real audio output, install a backend. Piper is the easiest choice because it runs on CPU with no extra setup beyond downloading a voice model:
pip install voice-soundboard[piper]If you have an NVIDIA GPU available, Kokoro produces excellent quality:
pip install voice-soundboard[kokoro]To install everything at once (all backends):
pip install voice-soundboard[all]3. Your First Synthesis
Section titled “3. Your First Synthesis”Create a file called hello.py:
from voice_soundboard import VoiceEngine, Config
# Use the mock backend so this runs without model downloadsengine = VoiceEngine(Config(backend="mock"))result = engine.speak("Hello world! This is my first synthesis.")print(f"Audio saved to: {result.audio_path}")print(f"Duration: {result.duration_seconds:.2f}s")Run it:
python hello.pyThe Mock backend produces silence but exercises the full pipeline (compiler, graph, engine). This confirms your installation works before downloading real models.
To generate audible speech, switch the backend to "piper" or "kokoro" after installing the corresponding extra and downloading models (see the Backends page).
4. Key Concepts
Section titled “4. Key Concepts”Voice Soundboard uses a three-stage pipeline:
- Compiler — Turns your text, voice, emotion, and style choices into a
ControlGraph(pure data). All the “intelligence” lives here. - ControlGraph — An immutable data structure that describes exactly what to synthesize. It contains token events, speaker references, and prosody parameters.
- Engine — Takes a
ControlGraphand produces PCM audio through a backend (Kokoro, Piper, OpenAI, etc.). The engine knows nothing about emotions or styles; those have already been compiled away.
This separation means you can compile once and synthesize many times, swap backends without changing application code, and test the compiler in isolation without any TTS models.
5. Choosing Voices, Emotions, and Presets
Section titled “5. Choosing Voices, Emotions, and Presets”Voices
Section titled “Voices”Voice Soundboard ships with 28 built-in Kokoro voice definitions spanning American and British accents in male and female variants. Pass a voice ID to speak():
result = engine.speak("Good morning!", voice="af_bella") # American female, warmresult = engine.speak("Good morning!", voice="bm_george") # British male, authoritativeList available voices from the CLI:
voice-soundboard voicesEmotions
Section titled “Emotions”Emotions adjust pitch, speed, energy, and pause timing at compile time. They are baked into the graph before the engine ever sees them:
result = engine.speak("I passed the exam!", emotion="excited")result = engine.speak("I'm sorry to hear that.", emotion="sad")result = engine.speak("Stay alert.", emotion="serious")Available emotions include: neutral, happy, excited, joyful, enthusiastic, calm, peaceful, relaxed, sad, melancholy, angry, frustrated, fearful, anxious, surprised, confident, and serious.
Presets
Section titled “Presets”Presets bundle a voice, speed, and description into a single name:
result = engine.speak("Breaking news!", preset="announcer")result = engine.speak("Once upon a time...", preset="storyteller")Built-in presets: assistant, narrator, announcer, storyteller, whisper.
Natural Language Styles
Section titled “Natural Language Styles”Describe the voice quality in plain English and the compiler maps it to prosody adjustments:
result = engine.speak("Welcome back.", style="warmly and cheerfully")6. Using the CLI
Section titled “6. Using the CLI”The voice-soundboard command is installed automatically with the package:
# Synthesize textvoice-soundboard speak "Hello from the CLI!"
# Specify voice and speedvoice-soundboard speak "Breaking news!" --preset announcer --speed 1.1
# Discovery commandsvoice-soundboard voices # List all voicesvoice-soundboard presets # List all presetsvoice-soundboard emotions # List all emotionsThe CLI is a thin wrapper around the same VoiceEngine API you use in Python.
7. Common Mistakes
Section titled “7. Common Mistakes”-
Forgetting to download models. Installing
voice-soundboard[kokoro]gives you the Python bindings, but Kokoro also needs ONNX model files in amodels/directory. If you see a “model not found” error, revisit the Backends page for download commands. -
Using the wrong backend name. Backend names are lowercase strings:
"kokoro","piper","openai","mock". A typo like"Kokoro"or"piper-tts"will raise an error. Check withConfig(backend="mock")first to confirm your code works before switching to a real backend. -
Mixing up
voiceandpreset. A voice is a single speaker identity ("af_bella"). A preset bundles a voice with a speed and description ("announcer"). If you pass both, the explicitvoiceparameter wins. Pick one or the other to avoid confusion. -
Expecting word-level streaming.
compile_stream()yields graphs at sentence boundaries, not after every word. If your LLM produces a long paragraph without punctuation, the compiler waits until it sees a sentence-ending character. Add punctuation to your prompts for smoother streaming. -
Running Kokoro without a GPU. Kokoro uses ONNX Runtime and benefits heavily from GPU acceleration. On a CPU-only machine it will work but may be slow. Use Piper instead for fast CPU-only synthesis.
8. Next Steps
Section titled “8. Next Steps”Now that you have a working setup, explore these handbook pages:
- Architecture — Understand the Compiler / Graph / Engine design and why it matters.
- Backends — Set up Kokoro (GPU), Piper (CPU), cloud backends (OpenAI, ElevenLabs, Azure), or Coqui for your use case.
- Streaming — Wire Voice Soundboard into an LLM token stream for real-time voice output.
- Reference — Full emotion list, style guide, presets, migration notes, and security scope.
For bug reports or feature requests, visit the GitHub repository.
9. Glossary
Section titled “9. Glossary”- Backend — A TTS engine that converts a
ControlGraphinto audio. Examples: Kokoro (GPU), Piper (CPU), OpenAI (cloud), Mock (testing). - Compiler — The first stage of the pipeline. It takes your text, emotion, style, and voice choices and produces a
ControlGraph. All “intelligence” lives here. - ControlGraph — An immutable data structure that describes exactly what to synthesize. It contains token events, speaker references, and prosody parameters. This is the contract between the compiler and the engine.
- Emotion — A compile-time concept that adjusts pitch, speed, energy, and pause timing. After compilation, the emotion name is gone; only numeric prosody values remain in the graph.
- Engine — The second stage of the pipeline. It takes a
ControlGraphand produces PCM audio through a backend. The engine knows nothing about emotions or styles. - PCM — Pulse-code modulation. The raw digital audio format produced by synthesis, stored as a NumPy float32 array.
- Preset — A named bundle of voice, speed, and description (e.g.,
"announcer"= Michael voice at 1.1x speed). - Prosody — The rhythm, stress, and intonation of speech. Emotions and styles modify prosody parameters at compile time.
- SpeakerRef — A reference to a speaker identity inside a
ControlGraph. Can point to a built-in voice ID or a custom speaker embedding. - Style — A natural-language description of how speech should sound (e.g., “warmly and cheerfully”). The compiler interprets it into prosody adjustments.
- TokenEvent — A single unit in a
ControlGraphrepresenting a word or pause, with attached prosody modifiers (pitch scale, energy scale, duration scale).