- Python 65.2%
- TypeScript 30.9%
- Shell 2.6%
- Perl 1%
- CSS 0.1%
|
Some checks are pending
runner smoke / hello (push) Successful in 1s
build-images / build (., Dockerfile.worker, chautauqua-worker) (push) Waiting to run
build-images / build (ui, ui/Dockerfile, chautauqua-ui) (push) Waiting to run
build-images / build (., Dockerfile, chautauqua-api) (push) Has started running
corepack now resolves pnpm@latest to v11.0.8, which requires Node.js v22+. The image is based on node:20-alpine, and the lockfile was generated by pnpm v9 (lockfileVersion 9.0). Pin to pnpm@9 to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|---|---|---|
| .claude | ||
| .forgejo/workflows | ||
| .github/workflows | ||
| .kilo/plans | ||
| .playwright-mcp | ||
| chautauqua | ||
| cuda_tts | ||
| docs | ||
| gemini_tts | ||
| input | ||
| mlx_tts | ||
| scripts | ||
| tempdocs | ||
| tests | ||
| ui | ||
| vllm_tts | ||
| voxtral_tts | ||
| .dockerignore | ||
| .env.example | ||
| .env.gpu.example | ||
| .gitignore | ||
| AGENTS.md | ||
| CLAUDE.md | ||
| dev.sh | ||
| docker-compose.gpu-worker.yml | ||
| docker-compose.yml | ||
| Dockerfile | ||
| Dockerfile.worker | ||
| Dockerfile.worker.cuda | ||
| edit-page-toolbar.png | ||
| edit-page-with-buttons.png | ||
| preplan-fixed.png | ||
| preplan-with-cast.png | ||
| pyproject.toml | ||
| README.md | ||
| requirements-voxtral.txt | ||
| uv.lock | ||
Chautauqua
Self-hostable audiobook pipeline: raw text in, chaptered M4B out. Cast differentiation, incremental caching, multiple TTS backends.
System requirements
Minimum
| Component | Requirement |
|---|---|
| OS | macOS 13+, Ubuntu 22.04+, or Windows 11 (WSL2) |
| Python | 3.12+ |
| RAM | 4 GB (cloud backends only) |
| Disk | 5 GB (Python deps + Docker images) |
| Docker | 24+ with Compose V2 (for the full stack) |
Recommended
| Use case | RAM | Disk | Notes |
|---|---|---|---|
| Cloud TTS only (Voxtral, Gemini) | 4 GB | 5 GB | Fastest setup, needs API keys |
| Piper ONNX CPU | 4 GB | 6 GB | Small local CPU voices, downloaded on demand |
| CPU TTS (Kokoro via PyTorch) | 8 GB | 10 GB | No GPU needed, slower inference |
| MLX local (Kokoro) | 8 GB | 6 GB | Apple Silicon only, fast |
| MLX local (Chatterbox / Dia) | 16 GB | 10 GB | Voice cloning, expressive |
| MLX local (Voxtral 4B) | 16 GB | 15 GB | Multilingual, 20 voices |
| MLX local (kugelaudio 7B) | 32 GB | 25 GB | SOTA quality, 24 languages |
| vLLM remote (NVIDIA GPU) | 8 GB host | 5 GB host | GPU server needs 8+ GB VRAM |
MLX model weights and Piper ONNX voices are downloaded on first use to ~/.cache/huggingface/. The disk estimates above include model and voice weights.
How it works
text -> Ingest -> BNM -> Transform -> directed BNM -> Pre-plan -> voice map
|
Render <------+
|
chaptered M4B
| Stage | Input | Output |
|---|---|---|
| Ingest | Plain text | book.bnm.md + book.lock.yaml |
| Transform | BNM | Directed BNM (LLM-enriched stage directions) |
| Pre-plan | Directed BNM | Approved voice map |
| Render | BNM + voice map | Chaptered M4B + per-cue WAVs |
Cache-aware: same text + model + voice = cache hit. Editing one sentence re-renders only that cue.
Quick start
Prerequisites (all platforms)
- Python 3.12+
- uv (Python package manager)
- Docker and Docker Compose (for full stack)
- ffmpeg (for M4B assembly)
- SoX (optional fallback for WAV concatenation if ffmpeg concat fails)
- Node.js 20+ and pnpm (for the web UI)
macOS (Apple Silicon)
Apple Silicon Macs can run the MLX backend natively for fast local TTS with no cloud API keys needed.
# 1. Install system deps
brew install uv ffmpeg sox node
npm install -g pnpm
# 2. Clone and install
git clone <repo-url> && cd chautauqua
uv sync && uv sync --extra mlx
# 3. Set up environment
cp .env.example .env
# Edit .env — defaults work for local dev (see Environment Variables below)
# 4. Start the full stack (Docker services + host MLX workers)
./dev.sh up --mlx
# 5. Open the web UI
open http://localhost:5173
The --mlx flag tells dev.sh to start MLX TTS workers on the host (Metal GPU is not accessible inside Docker). Docker handles Redis, Temporal, MinIO, the API server, and the web UI.
CLI-only (no Docker):
uv sync && uv sync --extra mlx
uv run chautauqua ingest book.txt --auto --output-dir output
uv run chautauqua render output/book.bnm.md --backend mlx --model kokoro --output-dir output
Linux
Linux machines can use the local CPU backends, the vLLM backend with an NVIDIA GPU, or cloud backends (Voxtral, Gemini).
# 1. Install system deps
# Debian/Ubuntu:
sudo apt update && sudo apt install -y ffmpeg sox
curl -LsSf https://astral.sh/uv/install.sh | sh
# 2. Clone and install
git clone <repo-url> && cd chautauqua
uv sync
# 3. Set up environment
cp .env.example .env
# Edit .env — see Environment Variables below
# 4. Start the stack with the small Piper ONNX CPU TTS worker
docker compose --profile piper up -d
# 5. Open the web UI
xdg-open http://localhost:5173
With an NVIDIA GPU (vLLM):
# Start a vLLM server on the GPU host (see docs/guides/mlx.md for model setup)
# Then set VLLM_SERVER_URL in .env and:
docker compose --profile vllm up -d
With cloud TTS (no GPU needed):
# Voxtral (Mistral API) — set MISTRAL_API_KEY in .env
docker compose --profile voxtral up -d
# Gemini (Google) — set GEMINI_API_KEY in .env
docker compose --profile gemini up -d
With Piper ONNX CPU (small local voices):
docker compose --profile piper up -d
Piper voice files are resolved from rhasspy/piper-voices and downloaded on first render.
With Kokoro CPU (PyTorch):
docker compose --profile cpu up -d
Kokoro CPU uses CPU-only PyTorch wheels. It avoids CUDA packages, but still downloads Torch because Kokoro depends on PyTorch.
Windows
Windows support is through WSL2 (Windows Subsystem for Linux). Native Windows is not supported.
# 1. Install WSL2 (PowerShell as admin, then restart)
wsl --install
After restarting, open your WSL2 terminal (Ubuntu by default):
# 2. Install system deps inside WSL2
sudo apt update && sudo apt install -y ffmpeg sox
curl -LsSf https://astral.sh/uv/install.sh | sh
# 3. Install Docker Desktop for Windows and enable the WSL2 backend
# https://docs.docker.com/desktop/install/windows-install/
# In Docker Desktop Settings > Resources > WSL Integration, enable your distro.
# 4. Clone and install
git clone <repo-url> && cd chautauqua
uv sync
# 5. Set up environment
cp .env.example .env
# Edit .env — see Environment Variables below
# 6. Start the stack with the small Piper ONNX CPU TTS worker
docker compose --profile piper up -d
# 7. Open the web UI
explorer.exe http://localhost:5173
For NVIDIA GPU support on Windows, install the NVIDIA CUDA drivers for WSL2 and follow the Linux vLLM instructions above.
Environment variables
Copy .env.example to .env and configure:
Storage
| Variable | Default | Description |
|---|---|---|
STORAGE_BACKEND |
local |
local for filesystem, minio for S3-compatible storage |
CHAUTAUQUA_STORAGE_ROOT |
~/.chautauqua/storage |
Root directory when using local storage |
MINIO_ENDPOINT |
localhost:9000 |
MinIO server address (Docker sets this automatically) |
MINIO_ACCESS_KEY |
minioadmin |
MinIO access key |
MINIO_SECRET_KEY |
minioadmin |
MinIO secret key |
TTS backends
| Variable | Required for | Description |
|---|---|---|
MISTRAL_API_KEY |
Voxtral backend | API key from Mistral |
GEMINI_API_KEY |
Gemini backend | API key from Google AI Studio |
HF_TOKEN |
MLX model downloads | HuggingFace token for gated model access |
VLLM_SERVER_URL |
vLLM backend | URL of your vLLM inference server (e.g. http://gpu-host:8000) |
Infrastructure
| Variable | Default | Description |
|---|---|---|
TEMPORAL_ADDRESS |
localhost:7233 |
Temporal server gRPC endpoint |
TEMPORAL_NAMESPACE |
default |
Temporal namespace |
REDIS_URL |
— | Redis connection string (e.g. redis://localhost:6379/0). Persists job state across restarts |
LLM (for Ingest and Transform)
| Variable | Default | Description |
|---|---|---|
LLM_BASE_URL |
— | OpenAI-compatible API base URL (e.g. https://api.openai.com/v1) |
LLM_API_KEY |
— | API key for the LLM endpoint |
LLM_MODEL |
— | Model name (e.g. gpt-4o-mini) |
Tip: For local dev without Docker, only
STORAGE_BACKEND=localis required. Everything else is optional depending on which backends and features you use.
Install extras
uv sync # core
uv sync --extra mlx # Apple Silicon TTS (Kokoro, Chatterbox, Qwen3-TTS)
uv sync --extra kokoro-cpu # CPU-only Kokoro (PyTorch, slower)
uv sync --extra kokoro-gpu # Kokoro via PyTorch on CUDA
uv sync --extra piper-cpu # Piper ONNX CPU voices
uv sync --extra vllm # remote CUDA vLLM server
uv sync --extra voxtral # Mistral Voxtral cloud API
uv sync --extra gemini # Google Gemini cloud TTS
uv sync --extra stt # Whisper STT for word alignment and batch splitting
uv sync --extra temporal # Temporal workflow engine
uv sync --extra ingest # spaCy NLP for ingest
uv sync --extra transform # LLM-based transform pipeline
uv sync --extra convert # PDF/EPUB -> text conversion
uv sync --extra convert-ocr # + OCR support (Tesseract)
uv sync --extra convert-ml # + ML-based conversion (Marker, Docling)
uv sync --extra redis # Redis job state persistence
uv sync --extra storage-minio # MinIO S3 storage
uv sync --extra dev # pytest, hypothesis
uv sync --extra all # everything (except convert-ocr and convert-ml)
CLI
uv run chautauqua ingest book.txt --auto # text -> BNM
uv run chautauqua render book.bnm.md --backend mlx --model kokoro # BNM -> audio
uv run chautauqua voices list --backend mlx --model kokoro
uv run chautauqua voices list --backend piper --model piper
uv run chautauqua validate book.bnm.md
uv run chautauqua doctor
uv run chautauqua serve # web UI + API on :8080
CLI smoke test
Use the included fixtures to verify the command-line generation path without Docker.
# Validate a known-good BNM file.
uv run chautauqua validate fixtures/tiny.bnm.md --summary
# Compile the BNM into render metadata and cue prompts.
uv run chautauqua compile fixtures/tiny.bnm.md \
--backend mlx \
--model kokoro \
--output-dir /tmp/chautauqua-cli-compile
# Exercise the render planner without loading a TTS model.
uv run chautauqua render fixtures/tiny.bnm.md \
--backend mlx \
--model kokoro \
--limit 1 \
--dry-run
On Apple Silicon with MLX installed, run one real cue render:
uv run chautauqua render fixtures/tiny.bnm.md \
--backend mlx \
--model kokoro \
--limit 1 \
--force \
--storage local \
--storage-root /tmp/chautauqua-cli-storage \
--output-dir /tmp/chautauqua-cli-render
Expected outputs include:
- Per-cue WAV:
/tmp/chautauqua-cli-render/<job_id>/cue-0001.wav - Stitched chapter WAV:
/tmp/chautauqua-cli-render/<job_id>/chapters/chapter-01.wav - Final M4B:
/tmp/chautauqua-cli-render/<job_id>/final/Tiny Test Book.m4b - Stored artifact copy:
/tmp/chautauqua-cli-storage/chautauqua-artifacts/<job_id>/cues/cue-0001.wav
To test raw text to BNM generation:
uv run chautauqua ingest fixtures/simple-dialogue.txt \
--auto \
--output-dir /tmp/chautauqua-cli-ingest
uv run chautauqua validate /tmp/chautauqua-cli-ingest/simple-dialogue.bnm.md --summary
For the ONNX CPU backend:
uv sync --extra piper-cpu
uv run chautauqua voices list --backend piper --model piper
uv run chautauqua render fixtures/tiny.bnm.md \
--backend piper \
--model piper \
--limit 1
The default voice is en_US-lessac-medium. Other built-in aliases include amy, amy-low, and ryan; the corresponding .onnx and .onnx.json files download from Hugging Face on first use.
Backends
| Backend | Flag | Hardware | API key needed |
|---|---|---|---|
| MLX | --backend mlx |
Apple Silicon | No |
| Kokoro CPU | --backend cpu |
Any CPU | No |
| Kokoro CUDA | --backend cuda |
NVIDIA CUDA | No |
| Piper ONNX CPU | --backend piper |
Any CPU | No |
| vLLM | --backend vllm |
NVIDIA GPU (remote server) | No |
| Voxtral | --backend voxtral |
Cloud | MISTRAL_API_KEY |
| Gemini | --backend gemini |
Cloud | GEMINI_API_KEY |
Two render modes: Single (default) and Overlay (narrator base + character dialogue spliced via Whisper alignment).
Backend selector (UI)
The web UI groups backends by compute tier, then offers a model dropdown per tier. This differs from the flat --backend flag taxonomy above — at the CLI each row is its own backend; in the UI cloud vendors are split out and Piper sits under CPU as a model.
| UI tier | Wire backend | Models exposed |
|---|---|---|
| Cloud — Gemini | gemini |
gemini-3.1-flash-tts-preview, gemini-2.5-flash-preview-tts, gemini-2.5-pro-preview-tts |
| Cloud — Voxtral | voxtral |
voxtral-mini-tts |
| MLX (Apple GPU) | mlx |
kokoro, chatterbox, voxtral (local) |
| CUDA (NVIDIA GPU) | cuda |
kokoro |
| vLLM (remote GPU) | vllm |
kokoro, voxtral-mini-tts |
| CPU | cpu |
kokoro, piper |
Selecting CPU × piper in the UI translates to wire --backend piper --model piper at the API boundary, so jobs land on the existing gpu-tts-piper-piper queue. Everything else passes through with the wire backend matching the tier name. The taxonomy lives in ui/src/lib/backend-options.ts.
Which backend should I use?
- Just want to try it out? Use
geminiorvoxtral— cloud-based, no hardware requirements, sign up for a free API key and go. - Apple Silicon Mac (M1/M2/M3/M4)? Use
mlxwith thekokoromodel for the best speed/quality tradeoff. Upgrade tochatterboxfor voice cloning orvoxtral(the MLX model, not the API) for multilingual support. - Linux with NVIDIA GPU? Either run
cuda(Kokoro on PyTorch CUDA —docker compose --profile cuda up -d, see docs/guides/cuda.md) or set up avllmserver for higher-throughput batch rendering. - No GPU, no API key? Use
piperfor the smallest local ONNX path, orcpuwithkokorofor higher quality PyTorch inference. - Production audiobooks? Start with
kokorofor drafting, then re-render final output withchatterboxorkugelaudiofor higher quality.
Example: mid-range PC (16 GB RAM, integrated GPU, quad-core CPU)
AMD/Intel integrated graphics (Vega, UHD, etc.) are not supported by any TTS backend — MLX needs Apple Silicon and vLLM needs NVIDIA CUDA. Three good options:
Option A — Cloud TTS (recommended). Offload rendering to Gemini or Voxtral. Your PC runs only the orchestration stack (Docker), which is lightweight. Best quality-per-dollar and fastest turnaround.
cp .env.example .env
# Add your API key:
# GEMINI_API_KEY=your_key (free tier available at aistudio.google.com)
# — or —
# MISTRAL_API_KEY=your_key (free tier available at console.mistral.ai)
docker compose --profile gemini up -d # or --profile voxtral
Option B — Piper ONNX CPU. Runs small local Piper voices with no API key. This is the lightest local backend and downloads voice files on first use.
docker compose --profile piper up -d
Option C — Kokoro CPU. Runs Kokoro inference on your CPU with no API key. Expect ~5-10x real-time on a quad-core (a 1-hour audiobook takes 5-10 hours to render). Good for offline/batch work or if you prefer not to use cloud APIs.
docker compose --profile cpu up -d
You can also mix backends: use piper for fast local checks, cpu with kokoro for Kokoro previews, and gemini or voxtral for the final render.
BNM format
Intermediate representation — Markdown + YAML front matter:
---
bnm: "0.3"
title: "Bartleby, the Scrivener"
cast:
narrator:
preferred:
kokoro: { voice: am_adam, lang_code: a }
---
:::chapter {#ch-001 title="Chapter I"}
:::cue {#cue-001 speaker="narrator"}
I am a rather elderly man.
:::
:::
Full spec: docs/SPEC_BNM.md
Web UI + API
uv run chautauqua serve # API on :8080
cd ui && pnpm install && pnpm dev # UI dev server on :5173 (proxies /api to :8080)
| Endpoint | Method | Purpose |
|---|---|---|
/api/jobs |
GET / POST | List / create jobs |
/api/jobs/{id}/progress |
GET (SSE) | Live progress stream |
/api/jobs/{id}/pause |
POST | Pause / resume / cancel |
/api/ingest/upload |
POST | Upload text for ingest |
/api/preplan/{id} |
GET / POST | Preplan status / approve |
/api/voices |
GET | List voices |
/api/voices/sample |
POST | Render a voice sample |
/api/artifacts/{id}/{path} |
GET | Download artifacts |
Docker Compose
The docker-compose.yml provides the full stack. Core services start by default; TTS workers are activated via profiles:
docker compose up -d # core (Redis, Temporal, MinIO, API, UI, general worker)
docker compose --profile cpu up -d # + Kokoro CPU TTS worker (PyTorch)
docker compose --profile cuda up -d # + Kokoro CUDA worker (NVIDIA GPU, see docs/guides/cuda.md)
docker compose --profile piper up -d # + Piper ONNX CPU worker
docker compose --profile voxtral up -d # + Voxtral cloud TTS worker
docker compose --profile vllm up -d # + vLLM remote GPU worker
docker compose --profile gemini up -d # + Gemini cloud TTS worker
docker compose --profile stt up -d # + Whisper STT worker on CPU
docker compose --profile stt-cuda up -d # + Whisper STT worker on NVIDIA CUDA
The STT workers poll audiobook-stt and are used for Whisper-heavy work such
as listen-along word alignment and marker-based one-shot batch splitting.
| Service | Port | Description |
|---|---|---|
| API | localhost:8080 |
FastAPI server |
| UI | localhost:5173 |
Vite dev server |
| Redis | localhost:6379 |
Job state persistence |
| Temporal | localhost:7233 |
Workflow orchestration (gRPC) |
| Temporal UI | localhost:8233 |
Temporal web dashboard |
| MinIO S3 | localhost:9000 |
Object storage API |
| MinIO Console | localhost:9001 |
Object storage web UI |
Building images
Four Dockerfiles, all built with the chautauqua/ directory as the build context — the chautauqua subtree is fully self-contained (its own pyproject.toml and uv.lock) and builds without needing a parent workspace:
| Image | Dockerfile | Base | Size |
|---|---|---|---|
chautauqua-api |
Dockerfile |
python:3.12-slim-bookworm |
~650 MB |
chautauqua-worker |
Dockerfile.worker |
python:3.12-slim-bookworm |
~1 GB |
chautauqua-worker-cuda |
Dockerfile.worker.cuda |
nvidia/cuda:12.4.1-cudnn-runtime-ubuntu22.04 |
~6 GB |
chautauqua-ui |
ui/Dockerfile |
node:20-alpine |
~690 MB |
# Build all images
docker compose build
# Build a single image
docker compose build api
docker compose build worker
docker compose build ui
# Rebuild after changing pyproject.toml, uv.lock, package.json, or pnpm-lock.yaml
docker compose build api ui worker && docker compose up -d
The API and worker images pin uv to v0.10.4 (matching the host lockfile format). If you upgrade uv locally, update the FROM ghcr.io/astral-sh/uv: line in both Dockerfiles.
The API and worker images include ffmpeg and SoX for recorder/STT normalization, audio stitching, and M4B composition fallback. The Kokoro CPU variant (--profile cpu) builds with INSTALL_KOKORO_CPU=true for PyTorch-based inference, which increases image size. The Piper variant (--profile piper) builds with INSTALL_PIPER_CPU=true for small ONNX Runtime CPU inference.
Lockfile management
This subtree carries its own uv.lock so it can be built standalone from a zip
or a subtree-only checkout — no parent workspace required. When this directory
is cloned as part of the larger audiobook-generator workspace, uv prefers the
parent's audiobook-generator/uv.lock (workspace rules) and the local
chautauqua/uv.lock is dormant. Inside Docker the build context is just the
chautauqua subtree, so the local lock is what actually pins versions.
When you change anything in chautauqua/pyproject.toml (deps, extras, sources),
regenerate both locks so they don't drift:
# 1. Parent workspace lock
cd /path/to/audiobook-generator
uv lock
# 2. Standalone chautauqua lock — copy to a temp dir so uv doesn't detect the
# parent workspace, then run uv lock and copy the result back.
TMP=$(mktemp -d)
cp chautauqua/pyproject.toml "$TMP/"
cp -r chautauqua/chautauqua "$TMP/"
( cd "$TMP" && uv lock )
cp "$TMP/uv.lock" chautauqua/uv.lock
rm -rf "$TMP"
Both locks should resolve cleanly with uv lock --check.
MLX workers cannot run in Docker on macOS — Metal GPU is inaccessible inside Docker's Linux VM. Run them on the host via ./dev.sh up --mlx or directly:
python -m chautauqua.temporal.worker gpu-tts-mlx-kokoro \
--backend mlx --model kokoro --temporal-address localhost:7233
Local orchestration (dev.sh)
./dev.sh up --mlx # Docker stack + host MLX workers
./dev.sh up --cpu # Docker stack + Kokoro CPU worker (PyTorch)
./dev.sh up --piper # Docker stack + Piper ONNX CPU worker
./dev.sh down # stop Docker + host workers
./dev.sh restart --mlx # full stop/start cycle
./dev.sh rebuild --mlx # rebuild Docker images, then restart
./dev.sh status # Docker + worker status
./dev.sh worker-restart kokoro
Host worker PIDs live under .dev/run/ and logs under .dev/logs/.
Development
uv run pytest # all tests
uv run pytest -m "not slow" # skip model-loading tests
cd ui && pnpm typecheck && pnpm test # frontend type check + vitest
See CLAUDE.md for architecture, conventions, and full docs index.
Docs
| File | What |
|---|---|
| docs/SPEC.md | Product spec (architecture, modules, phases) |
| docs/SPEC_BNM.md | BNM format (syntax, validation, plugins) |
| docs/FLOW.md | API lifecycle per workflow phase |
| docs/guides/transformer.md | T1-T6 LLM pipeline design |
| docs/guides/mlx.md | MLX backend setup + model presets |
| docs/guides/bnm-mvp.md | MVP BNM constraints |
License
See LICENSE.