Voice-based interview for the profile module — users choose between text, voice (question read aloud + mic for answer), or conversation mode (fully automatic flow with auto-save). Interview audio: - 92 pre-rendered MP3 files (23 questions × 4 voices) via Edge TTS - Voices: Seraphina (DE-f), Florian (DE-m), Leni (CH-f), Jan (CH-m) - User picks voice via dropdown, persisted in localStorage - Web Speech API fallback for missing audio files Profile UI: - Interview hero block on overview with 3 start modes (text/voice/conversation) - Voice/conversation toggle + voice picker in interview view - Mic button on text/textarea/tags inputs for per-question voice input - Conversation mode: auto-save + auto-advance after STT transcription - Recording/transcribing/speaking state indicators mana-tts service: - New Orpheus TTS backend (German finetune, SNAC codec) - New Zonos TTS backend (Zyphra, 200k hours, emotion control) - Endpoints: POST /synthesize/orpheus, POST /synthesize/zonos - espeak-ng installed on GPU server for Zonos phonemizer - Compare script for side-by-side voice quality testing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5.3 KiB
mana-tts
Text-to-Speech microservice. Wraps Kokoro (English presets), Piper (German, local ONNX), and F5-TTS (voice cloning) behind a small FastAPI surface. Lives on the Windows GPU server (mana-server-gpu, RTX 3090).
⚠️ Earlier history: this directory used to contain MLX-optimized Mac-Mini code (
f5-tts-mlx,mlx-audio,setup.shwith Apple Silicon checks,com.mana.mana-tts.plistlaunchd setup). All of that moved to the Windows GPU box and was removed from the repo. If you need the MLX path, see git history.
Tech Stack
| Layer | Technology |
|---|---|
| Runtime | Python 3.11 + uvicorn (Windows) |
| Framework | FastAPI |
| English (preset) | Kokoro-82M (kokoro_service.py) |
| German (local) | Piper ONNX with kerstin_low.onnx and thorsten_medium.onnx voices (piper_service.py) |
| German (high-quality) | Orpheus-3B German finetune (orpheus_service.py) — best for pre-generation |
| Multilingual (expressive) | Zonos v0.1 by Zyphra (zonos_service.py) — emotion control, 200k hours training |
| Voice cloning | F5-TTS on CUDA (f5_service.py) |
| Audio I/O | soundfile, pydub |
| Auth | Per-key + internal-key API auth (auth.py) + JWT via mana-auth (external_auth.py) |
| VRAM | Shared vram_manager.py (same module as mana-stt + mana-image-gen) |
| Process supervision | Windows Scheduled Task ManaTTS (AtLogOn) |
Port: 3022
Where it runs
| Host | Path on disk | Entrypoint |
|---|---|---|
Windows GPU server (192.168.178.11) |
C:\mana\services\mana-tts\ |
service.pyw via Scheduled Task ManaTTS |
Public URL: https://gpu-tts.mana.how.
API Endpoints
| Method | Path | Description |
|---|---|---|
| GET | /health |
Liveness + which backends are loaded |
| GET | /models |
Available TTS models |
| GET | /voices |
List all voices (preset + custom) |
| POST | /voices |
Register a custom voice (reference audio + transcript) |
| DELETE | /voices/{voice_id} |
Delete a custom voice |
| POST | /synthesize/kokoro |
Kokoro synthesis (English presets) |
| POST | /synthesize |
F5-TTS voice cloning |
| POST | /synthesize/orpheus |
Orpheus synthesis (German, high-quality, pre-generation) |
| POST | /synthesize/zonos |
Zonos synthesis (multilingual, expressive, emotion control) |
| POST | /synthesize/auto |
Routing helper — picks the right backend for the requested voice |
All non-health endpoints require Authorization: Bearer <token> (per-app key, internal key, or mana-auth JWT).
Voices
Kokoro-82M (English presets)
~300 MB download. 30+ preset English voices. Fast, no reference audio needed.
Piper (German, local ONNX)
~63 MB per voice. 100% local, GDPR-compliant. Available:
de_kerstin(female, default)de_thorsten(male)
Fallback to Edge TTS cloud voices if Piper isn't loaded.
Orpheus-3B German (high-quality pre-generation)
~8 GB VRAM. German finetune (Kartoffel/Orpheus-3B_german_natural-v0.1). Natural intonation, built-in speaker voices (tara, leo, emma, ...). Best quality for pre-generating static audio files. Not real-time.
Zonos v0.1 (expressive multilingual)
~5 GB VRAM. By Zyphra, trained on 200k hours. Explicit German support. Fine-grained control: emotion (neutral/friendly/warm/curious), speaking rate, pitch variation. Can clone voices from 5s reference audio.
F5-TTS (voice cloning)
~6 GB. Requires reference audio + transcript. Higher quality, slower. Custom voices live in voices/ (reference audio + transcript per voice ID).
Configuration (.env on the Windows GPU box)
PORT=3022
PRELOAD_MODELS=false
MAX_TEXT_LENGTH=1000
REQUIRE_AUTH=true
API_KEYS=sk-app1:app1,sk-app2:app2
INTERNAL_API_KEY=...
CORS_ORIGINS=https://mana.how,https://chat.mana.how
Code layout
services/mana-tts/
├── app/
│ ├── __init__.py
│ ├── main.py # FastAPI endpoints
│ ├── kokoro_service.py # Kokoro (English presets)
│ ├── piper_service.py # Piper (German, local ONNX)
│ ├── f5_service.py # F5-TTS (voice cloning, CUDA)
│ ├── orpheus_service.py # Orpheus-3B German (high-quality)
│ ├── zonos_service.py # Zonos v0.1 (expressive multilingual)
│ ├── voice_manager.py # Custom voice registry
│ ├── audio_utils.py # Format conversion, resampling
│ ├── auth.py # API-key auth
│ ├── external_auth.py # JWT validation via mana-auth
│ └── vram_manager.py # Shared VRAM accountant
└── service.pyw # Windows runner (used by ManaTTS scheduled task)
The Piper voice ONNX files live alongside the service on the GPU box (C:\mana\services\mana-tts\piper_voices\*.onnx) — too big to commit, downloaded once during setup.
Operations
# Status
Get-ScheduledTask -TaskName "ManaTTS" | Format-List TaskName, State
Get-NetTCPConnection -LocalPort 3022 -State Listen
# Restart
Stop-ScheduledTask -TaskName "ManaTTS"
Start-ScheduledTask -TaskName "ManaTTS"
# Logs
Get-Content C:\mana\services\mana-tts\service.log -Tail 50
Reference
docs/WINDOWS_GPU_SERVER_SETUP.md— Windows box setup, scheduled tasks, firewall, Cloudflare tunneldocs/PORT_SCHEMA.md— port assignments across services