Architecture
Core concepts
Section titled “Core concepts”Chapter
Section titled “Chapter”A Chapter is a chunk of source text with a title and index. After compilation, it contains a list of utterances. After rendering, it carries an audio path and duration.
Utterance
Section titled “Utterance”An Utterance is the atomic unit of speech output:
speaker— who is speaking (e.g.,narrator,Alice)text— what to speaktype— narration vs dialogueemotion— optional style hint (e.g.,angry,whisper)
Casting table
Section titled “Casting table”The CastingTable maps speaker names to voice IDs (from voice-soundboard), plus optional default emotions. It also defines how to handle unknown speakers and a fallback voice.
Mapping priority:
- Exact character entry in the casting table
- Default narrator (if set and cast)
- Fallback voice (
fallback_voice_id) as the ultimate last resort
Project
Section titled “Project”A Project (saved as .audiobooker) is the persistent state: metadata, chapters, compiled utterances, casting table, config settings, and render cache pointers.
Pipeline
Section titled “Pipeline”Source (EPUB/TXT/MD) -> Parser -> Chapters -> Dialogue Detection -> Speaker Attribution -> Utterances -> Review Export/Import (optional) -> TTS (voice-soundboard) -> Chapter WAVs -> FFmpeg assembly -> M4B (or M4A fallback when chapters cannot embed)Key design principles:
- Stage separation — parsing, attribution, review, rendering, and assembly are cleanly separated.
- Human control point — the review workflow allows correcting mistakes before you spend compute.
- Resumability — chapter WAVs and a manifest allow continuing after failure without re-rendering.
Repository structure
Section titled “Repository structure”audiobooker/ parser/ EPUB/TXT parsing casting/ dialogue detection, attribution, voice registry, voice suggester language/ language profiles (en, extensible) nlp/ BookNLP adapter, emotion inference, speaker resolver renderer/ TTS engine, cache manifest, FFmpeg assembly, progress, failure reports review.py review format import/export models.py core data models (Chapter, Utterance, Character, CastingTable, ProjectConfig) project.py AudiobookProject orchestration cli.py CLI entrypointRendering and cache
Section titled “Rendering and cache”Rendering has two phases:
- Synthesis — utterances to chapter WAV files (via voice-soundboard)
- Assembly — chapter WAVs to final M4B (via FFmpeg)
Cache structure
Section titled “Cache structure”<project_dir>/.audiobooker/cache/ chapters/ chapter_0000.wav chapter_0001.wav manifests/ render_v1.jsonA manifest entry tracks validity by hashing the chapter text, casting table inputs, and audio-affecting render parameters. If hashes match and the WAV exists, the chapter is skipped on rerun.
Resume behavior
Section titled “Resume behavior”- Completed chapters are not re-rendered
- If a render fails at chapter 15, chapters 0-14 remain usable
- Rerun
audiobooker renderto continue - Use
--no-resumeto force full re-render or--clean-cacheto wipe cache
Voice suggestion engine
Section titled “Voice suggestion engine”The cast-suggest command uses a scoring engine that ranks voices per speaker based on:
- Gender inference — pronoun and name cues in sample utterances hint at a preferred gender
- Role match — narrator roles prefer calm/neutral voices tagged for narration; dialogue roles prefer expressive voices
- Diversity — voices already assigned to other speakers receive a penalty to avoid reuse
- Curated metadata — voices with known style notes (calm, powerful, warm) receive small bonuses for appropriate roles
Each suggestion includes a human-readable reason string explaining why it scored the way it did. The top suggestion is applied when using cast-apply --auto.
Language profiles
Section titled “Language profiles”Audiobooker separates language-specific heuristics into a LanguageProfile that controls:
- Supported quote characters (straight quotes, smart quotes)
- Speaker attribution verbs and patterns
- Blacklist words to avoid false-positive names
- Valid-name heuristics
- Chapter heading patterns
The default profile is en (English). Choose language at project creation with --lang.
Inline overrides
Section titled “Inline overrides”You can override speaker and emotion inline in the source text:
[Alice|angry] "How dare you!"[Bob|whisper] "Shh."[narrator] The room fell silent.Inline overrides are parsed during compilation and take precedence for that specific line.