Skip to content

Engines

Original Voice Soundboard supports three TTS engines. Kokoro ships by default. The others are optional installs that unlock additional capabilities.

EngineInstall commandWhat it adds
Kokoro (default)pip install voice-soundboard54+ voices, 19 emotions, voice presets
Chatterboxpip install voice-soundboard[chatterbox]Paralinguistic tags, 23 languages, emotion exaggeration
F5-TTSpip install voice-soundboard[f5tts]Zero-shot voice cloning from 3-10 second audio samples

Kokoro is the primary engine. It provides the full voice library, emotion system, and preset infrastructure.

from voice_soundboard import VoiceEngine
engine = VoiceEngine() # Uses Kokoro by default
result = engine.speak("Hello!", voice="af_bella", emotion="happy")

Kokoro runs via ONNX Runtime, so it works on CPU out of the box. A CUDA GPU speeds things up but is not required.

Chatterbox adds paralinguistic tags and multilingual support.

Terminal window
pip install voice-soundboard[chatterbox]

Paralinguistic tags let you embed non-verbal sounds directly in the text:

engine = VoiceEngine(engine="chatterbox")
result = engine.speak("That's hilarious [laugh] I can't stop [laugh]")
result = engine.speak("I don't know [sigh] it's been a long day")

Chatterbox supports 23 languages and provides an emotion exaggeration parameter for more dramatic delivery.

F5-TTS is a Diffusion Transformer model that enables zero-shot voice cloning. Give it a 3-10 second audio sample of any voice, and it can synthesize new speech in that voice.

Terminal window
pip install voice-soundboard[f5tts]
engine = VoiceEngine(engine="f5tts")
result = engine.speak(
"This is my cloned voice speaking.",
reference_audio="path/to/sample.wav"
)

Voice cloning requires explicit consent acknowledgment. The library enforces this to prevent misuse.

Terminal window
pip install voice-soundboard # Core (Kokoro engine)
pip install voice-soundboard[mcp] # + MCP server for AI agents
pip install voice-soundboard[chatterbox] # + Paralinguistic tags & 23 languages
pip install voice-soundboard[f5tts] # + F5-TTS voice cloning
pip install voice-soundboard[websocket] # + WebSocket server
pip install voice-soundboard[web] # + Mobile web UI
pip install voice-soundboard[all] # Everything

F5-TTS can clone a voice from a short audio sample (3-10 seconds). The sample should be clean speech with minimal background noise. See the voice cloning example for a working script.

Generate conversations with multiple voices, each with their own emotion and style. Script a dialogue and assign per-character voice settings:

dialogue = [
{"speaker": "af_bella", "emotion": "happy", "text": "Good morning!"},
{"speaker": "bm_george", "emotion": "calm", "text": "Morning. Coffee?"},
{"speaker": "af_bella", "emotion": "excited", "text": "Yes please!"},
]

See the multi-speaker example for the full implementation.

Speech Synthesis Markup Language gives you fine-grained control over pauses, emphasis, and prosody. The SSML parser uses defusedxml for protection against XML-based attacks.

<speak>
<s>Welcome to <emphasis level="strong">Original Voice Soundboard</emphasis>.</s>
<break time="500ms"/>
<s>Let's get started.</s>
</speak>

For real-time bidirectional communication, install the WebSocket extra:

Terminal window
pip install voice-soundboard[websocket]

This enables streaming audio generation over WebSocket connections, suitable for interactive applications with low-latency requirements.