Skip to content

Beginners Guide

A hands-on walkthrough that takes you from zero to speaking AI agent. No prior MCP experience required.

MCP Voice Soundboard is a text-to-speech server that follows the Model Context Protocol (MCP). It gives AI agents like Claude the ability to synthesize speech, run multi-speaker dialogues, and produce audio files — all through structured tool calls over stdio or HTTP.

Key facts:

  • 48 voices across 9 languages (English, Japanese, Mandarin, Spanish, French, Hindi, Italian, Brazilian Portuguese)
  • 5 tools: voice_speak, voice_dialogue, voice_status, voice_interrupt, voice_inner_monologue
  • Swappable backends: Mock (built-in, zero setup), HTTP proxy, or Python bridge (Kokoro, Coqui, etc.)
  • No telemetry: all processing is local, no data leaves your machine except to the TTS backend you configure

Before you start, make sure you have:

  • Node.js 20 or later — check with node --version
  • An MCP client — Claude Desktop, Cursor, or any MCP-compatible tool
  • A terminal — for running commands and checking output

No GPU or Python installation is needed for the default mock backend. The mock backend generates silent WAV files so you can test the full tool flow without a real TTS engine.

The fastest way to run the server:

Terminal window
npx @mcptoolshop/voice-soundboard-mcp

This downloads and starts the server in stdio mode. To install globally instead:

Terminal window
npm install -g @mcptoolshop/voice-soundboard-mcp
voice-soundboard-mcp

Add the following to your claude_desktop_config.json:

{
"mcpServers": {
"voice-soundboard": {
"command": "npx",
"args": ["-y", "@mcptoolshop/voice-soundboard-mcp"]
}
}
}

Restart Claude Desktop. The voice tools will appear in the tool list.

Once connected, try each of the main tools:

voice_speak({ text: "Hello, world!" })

This uses the default voice (bm_george, a British male) at normal speed. The server returns a file path to the generated audio.

voice_speak({ text: "Breaking news from the studio.", voice: "announcer" })

The announcer preset uses am_eric at 1.1x speed for a bold broadcast style. Five presets are available: narrator, announcer, whisper, storyteller, assistant.

voice_dialogue({
script: "Alice: Welcome to the show!\nBob: Thanks for having me.",
cast: { "Alice": "bf_alice", "Bob": "bm_george" }
})

Each speaker gets their own voice. Omit the cast parameter and speakers are auto-assigned from the voice roster.

voice_status()

Returns available voices, active presets, backend health, and configuration details.

Every voice ID follows the pattern {accent}{gender}_{name}. The prefix determines the language automatically:

PrefixLanguage
af_ / am_English (American)
bf_ / bm_English (British)
jf_ / jm_Japanese
zf_ / zm_Mandarin Chinese
ef_ / em_Spanish
ff_French
hf_ / hm_Hindi
if_ / im_Italian
pf_ / pm_Brazilian Portuguese

Wrap text in curly-brace tags to change the voice and speed per segment:

{joy}Great news!{/joy} {calm}Let me explain.{/calm}

Available emotions: neutral, serious, friendly, professional, calm, joy, urgent, whisper. Each maps to a specific voice and speed. Untagged text defaults to neutral.

Add inline sound effects with square-bracket tags:

[ding] Build complete! [chime] All tests passed.

Six tags available: [ding], [chime], [whoosh], [tada], [pop], [click]. Enable with sfx: true in voice_speak.

For finer timing and emphasis control:

<break time="500ms"/> <emphasis level="strong">important</emphasis>
<prosody rate="slow">Take your time.</prosody>
GoalHow
Read code aloudvoice_speak with narrator preset
Announce build resultsvoice_speak with announcer preset and SFX
Explain a conceptvoice_speak with storyteller preset
Quick notificationvoice_speak with short text
Conversational demovoice_dialogue with a cast mapping
Check engine healthvoice_status (no arguments)
Comedy deliveryvoice_speak with mood: "dry" (or roast, chaotic, cheeky, cynic, zoomer)

The TTS backend is not available. If you are using the default mock backend, this should not happen. For HTTP or Python backends, check that the backend URL or Python environment is reachable.

You passed a voice ID that is not in the 48-voice approved roster. Use voice_status to see the full list of valid voice IDs.

The text limit is 12,000 characters per request. Split long content into multiple calls.

The server applies per-tool rate limiting. Wait a moment and retry. This is a safety guardrail to prevent runaway synthesis.

The concurrency semaphore is full. The default limit is 3 concurrent requests. Wait for an active request to finish, or increase the limit with --max-concurrent.

The voice_inner_monologue tool requires explicit opt-in. Start the server with --ambient or set VOICE_SOUNDBOARD_AMBIENT_ENABLED=1.

The server auto-cleans files older than 240 minutes by default. Adjust with --retention-minutes=<n> or set to 0 to keep files indefinitely.