mirror of https://github.com/Memo-2023/mana-monorepo.git synced 2026-05-14 20:21:09 +02:00

Till JS 8823cc0bf0 feat(profile): voice interview with pre-rendered TTS audio + Orpheus/Zonos backends

Voice-based interview for the profile module — users choose between text,
voice (question read aloud + mic for answer), or conversation mode (fully
automatic flow with auto-save).

Interview audio:
- 92 pre-rendered MP3 files (23 questions × 4 voices) via Edge TTS
- Voices: Seraphina (DE-f), Florian (DE-m), Leni (CH-f), Jan (CH-m)
- User picks voice via dropdown, persisted in localStorage
- Web Speech API fallback for missing audio files

Profile UI:
- Interview hero block on overview with 3 start modes (text/voice/conversation)
- Voice/conversation toggle + voice picker in interview view
- Mic button on text/textarea/tags inputs for per-question voice input
- Conversation mode: auto-save + auto-advance after STT transcription
- Recording/transcribing/speaking state indicators

mana-tts service:
- New Orpheus TTS backend (German finetune, SNAC codec)
- New Zonos TTS backend (Zyphra, 200k hours, emotion control)
- Endpoints: POST /synthesize/orpheus, POST /synthesize/zonos
- espeak-ng installed on GPU server for Zonos phonemizer
- Compare script for side-by-side voice quality testing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-17 15:22:52 +02:00

5.3 KiB

Raw Blame History

mana-tts

Text-to-Speech microservice. Wraps Kokoro (English presets), Piper (German, local ONNX), and F5-TTS (voice cloning) behind a small FastAPI surface. Lives on the Windows GPU server (mana-server-gpu, RTX 3090).

⚠️ Earlier history: this directory used to contain MLX-optimized Mac-Mini code (f5-tts-mlx, mlx-audio, setup.sh with Apple Silicon checks, com.mana.mana-tts.plist launchd setup). All of that moved to the Windows GPU box and was removed from the repo. If you need the MLX path, see git history.

Tech Stack

Layer	Technology
Runtime	Python 3.11 + uvicorn (Windows)
Framework	FastAPI
English (preset)	Kokoro-82M (`kokoro_service.py`)
German (local)	Piper ONNX with `kerstin_low.onnx` and `thorsten_medium.onnx` voices (`piper_service.py`)
German (high-quality)	Orpheus-3B German finetune (`orpheus_service.py`) — best for pre-generation
Multilingual (expressive)	Zonos v0.1 by Zyphra (`zonos_service.py`) — emotion control, 200k hours training
Voice cloning	F5-TTS on CUDA (`f5_service.py`)
Audio I/O	`soundfile`, `pydub`
Auth	Per-key + internal-key API auth (`auth.py`) + JWT via mana-auth (`external_auth.py`)
VRAM	Shared `vram_manager.py` (same module as mana-stt + mana-image-gen)
Process supervision	Windows Scheduled Task `ManaTTS` (AtLogOn)

Port: 3022

Where it runs

Host	Path on disk	Entrypoint
Windows GPU server (`192.168.178.11`)	`C:\mana\services\mana-tts\`	`service.pyw` via Scheduled Task `ManaTTS`

Public URL: https://gpu-tts.mana.how.

API Endpoints

Method	Path	Description
GET	`/health`	Liveness + which backends are loaded
GET	`/models`	Available TTS models
GET	`/voices`	List all voices (preset + custom)
POST	`/voices`	Register a custom voice (reference audio + transcript)
DELETE	`/voices/{voice_id}`	Delete a custom voice
POST	`/synthesize/kokoro`	Kokoro synthesis (English presets)
POST	`/synthesize`	F5-TTS voice cloning
POST	`/synthesize/orpheus`	Orpheus synthesis (German, high-quality, pre-generation)
POST	`/synthesize/zonos`	Zonos synthesis (multilingual, expressive, emotion control)
POST	`/synthesize/auto`	Routing helper — picks the right backend for the requested voice

All non-health endpoints require Authorization: Bearer <token> (per-app key, internal key, or mana-auth JWT).

Voices

Kokoro-82M (English presets)

~300 MB download. 30+ preset English voices. Fast, no reference audio needed.

Piper (German, local ONNX)

~63 MB per voice. 100% local, GDPR-compliant. Available:

de_kerstin (female, default)
de_thorsten (male)

Fallback to Edge TTS cloud voices if Piper isn't loaded.

Orpheus-3B German (high-quality pre-generation)

~8 GB VRAM. German finetune (Kartoffel/Orpheus-3B_german_natural-v0.1). Natural intonation, built-in speaker voices (tara, leo, emma, ...). Best quality for pre-generating static audio files. Not real-time.

Zonos v0.1 (expressive multilingual)

~5 GB VRAM. By Zyphra, trained on 200k hours. Explicit German support. Fine-grained control: emotion (neutral/friendly/warm/curious), speaking rate, pitch variation. Can clone voices from 5s reference audio.

F5-TTS (voice cloning)

~6 GB. Requires reference audio + transcript. Higher quality, slower. Custom voices live in voices/ (reference audio + transcript per voice ID).

Configuration (`.env` on the Windows GPU box)

PORT=3022
PRELOAD_MODELS=false
MAX_TEXT_LENGTH=1000
REQUIRE_AUTH=true
API_KEYS=sk-app1:app1,sk-app2:app2
INTERNAL_API_KEY=...
CORS_ORIGINS=https://mana.how,https://chat.mana.how

Code layout

services/mana-tts/
├── app/
│   ├── __init__.py
│   ├── main.py             # FastAPI endpoints
│   ├── kokoro_service.py   # Kokoro (English presets)
│   ├── piper_service.py    # Piper (German, local ONNX)
│   ├── f5_service.py       # F5-TTS (voice cloning, CUDA)
│   ├── orpheus_service.py  # Orpheus-3B German (high-quality)
│   ├── zonos_service.py    # Zonos v0.1 (expressive multilingual)
│   ├── voice_manager.py    # Custom voice registry
│   ├── audio_utils.py      # Format conversion, resampling
│   ├── auth.py             # API-key auth
│   ├── external_auth.py    # JWT validation via mana-auth
│   └── vram_manager.py     # Shared VRAM accountant
└── service.pyw             # Windows runner (used by ManaTTS scheduled task)

The Piper voice ONNX files live alongside the service on the GPU box (C:\mana\services\mana-tts\piper_voices\*.onnx) — too big to commit, downloaded once during setup.

Operations

# Status
Get-ScheduledTask -TaskName "ManaTTS" | Format-List TaskName, State
Get-NetTCPConnection -LocalPort 3022 -State Listen

# Restart
Stop-ScheduledTask -TaskName "ManaTTS"
Start-ScheduledTask -TaskName "ManaTTS"

# Logs
Get-Content C:\mana\services\mana-tts\service.log -Tail 50

Reference

docs/WINDOWS_GPU_SERVER_SETUP.md — Windows box setup, scheduled tasks, firewall, Cloudflare tunnel
docs/PORT_SCHEMA.md — port assignments across services

5.3 KiB Raw Blame History