Architecture

Core concepts

Chapter

A Chapter is a chunk of source text with a title and index. After compilation, it contains a list of utterances. After rendering, it carries an audio path and duration.

Utterance

An Utterance is the atomic unit of speech output:

speaker — who is speaking (e.g., narrator, Alice)
text — what to speak
type — narration vs dialogue
emotion — optional style hint (e.g., angry, whisper)

Casting table

The CastingTable maps speaker names to voice IDs (from voice-soundboard), plus optional default emotions. It also defines how to handle unknown speakers and a fallback voice.

Mapping priority:

Exact character entry in the casting table
Default narrator (if set and cast)
Fallback voice (fallback_voice_id) as the ultimate last resort

Project

A Project (saved as .audiobooker) is the persistent state: metadata, chapters, compiled utterances, casting table, config settings, and render cache pointers.

Pipeline

Source (EPUB/TXT/MD)
  -> Parser
  -> Chapters
  -> Dialogue Detection
  -> Speaker Attribution
  -> Utterances
  -> Review Export/Import (optional)
  -> TTS (voice-soundboard)
  -> Chapter WAVs
  -> FFmpeg assembly
  -> M4B (or M4A fallback when chapters cannot embed)

Key design principles:

Stage separation — parsing, attribution, review, rendering, and assembly are cleanly separated.
Human control point — the review workflow allows correcting mistakes before you spend compute.
Resumability — chapter WAVs and a manifest allow continuing after failure without re-rendering.

Repository structure

audiobooker/
  parser/            EPUB/TXT parsing
  casting/           dialogue detection, attribution, voice registry, voice suggester
  language/          language profiles (en, extensible)
  nlp/               BookNLP adapter, emotion inference, speaker resolver
  renderer/          TTS engine, cache manifest, FFmpeg assembly, progress, failure reports
  review.py          review format import/export
  models.py          core data models (Chapter, Utterance, Character, CastingTable, ProjectConfig)
  project.py         AudiobookProject orchestration
  cli.py             CLI entrypoint

Rendering and cache

Rendering has two phases:

Synthesis — utterances to chapter WAV files (via voice-soundboard)
Assembly — chapter WAVs to final M4B (via FFmpeg)

Cache structure

<project_dir>/.audiobooker/cache/
  chapters/
    chapter_0000.wav
    chapter_0001.wav
  manifests/
    render_v1.json

A manifest entry tracks validity by hashing the chapter text, casting table inputs, and audio-affecting render parameters. If hashes match and the WAV exists, the chapter is skipped on rerun.

Resume behavior

Completed chapters are not re-rendered
If a render fails at chapter 15, chapters 0-14 remain usable
Rerun audiobooker render to continue
Use --no-resume to force full re-render or --clean-cache to wipe cache

Voice suggestion engine

The cast-suggest command uses a scoring engine that ranks voices per speaker based on:

Gender inference — pronoun and name cues in sample utterances hint at a preferred gender
Role match — narrator roles prefer calm/neutral voices tagged for narration; dialogue roles prefer expressive voices
Diversity — voices already assigned to other speakers receive a penalty to avoid reuse
Curated metadata — voices with known style notes (calm, powerful, warm) receive small bonuses for appropriate roles

Each suggestion includes a human-readable reason string explaining why it scored the way it did. The top suggestion is applied when using cast-apply --auto.

Language profiles

Audiobooker separates language-specific heuristics into a LanguageProfile that controls:

Supported quote characters (straight quotes, smart quotes)
Speaker attribution verbs and patterns
Blacklist words to avoid false-positive names
Valid-name heuristics
Chapter heading patterns

The default profile is en (English). Choose language at project creation with --lang.

Inline overrides

You can override speaker and emotion inline in the source text:

[Alice|angry] "How dare you!"
[Bob|whisper] "Shh."
[narrator] The room fell silent.

Inline overrides are parsed during compilation and take precedence for that specific line.