mirror of
https://github.com/Memo-2023/mana-monorepo.git
synced 2026-05-18 17:41:23 +02:00
- Add Piper TTS section to mana-tts CLAUDE.md - Document available German voices (local and cloud) - Update matrix-tts-bot CLAUDE.md with new default voice - Add language auto-detection documentation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
125 lines
3.5 KiB
Markdown
125 lines
3.5 KiB
Markdown
# CLAUDE.md - Mana TTS Service
|
|
|
|
## Service Overview
|
|
|
|
Text-to-Speech microservice using MLX-optimized models for Apple Silicon:
|
|
|
|
- **Port**: 3022
|
|
- **Framework**: Python + FastAPI
|
|
- **Models**: Kokoro-82M (fast), F5-TTS (voice cloning)
|
|
|
|
## Commands
|
|
|
|
```bash
|
|
# Setup
|
|
./setup.sh
|
|
|
|
# Development
|
|
source .venv/bin/activate
|
|
uvicorn app.main:app --host 0.0.0.0 --port 3022 --reload
|
|
|
|
# Production (Mac Mini)
|
|
../../scripts/mac-mini/setup-tts.sh
|
|
|
|
# Test
|
|
curl http://localhost:3022/health
|
|
|
|
# English (Kokoro)
|
|
curl -X POST http://localhost:3022/synthesize/kokoro \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"text": "Hello world", "voice": "af_heart"}' \
|
|
--output test_en.wav
|
|
|
|
# German (Piper) - use /synthesize/auto
|
|
curl -X POST http://localhost:3022/synthesize/auto \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"text": "Hallo Welt", "voice": "de_kerstin"}' \
|
|
--output test_de.wav
|
|
```
|
|
|
|
## File Structure
|
|
|
|
```
|
|
services/mana-tts/
|
|
├── app/
|
|
│ ├── __init__.py
|
|
│ ├── main.py # FastAPI endpoints
|
|
│ ├── kokoro_service.py # Kokoro TTS (English preset voices)
|
|
│ ├── piper_service.py # Piper TTS (German voices, local)
|
|
│ ├── f5_service.py # F5-TTS (voice cloning)
|
|
│ ├── voice_manager.py # Custom voice registry
|
|
│ └── audio_utils.py # Audio format conversion
|
|
├── piper_voices/ # Piper voice models (.onnx)
|
|
├── voices/ # Custom F5 voice storage
|
|
├── mlx_models/ # MLX model cache
|
|
├── setup.sh # Setup script
|
|
├── requirements.txt
|
|
└── README.md
|
|
```
|
|
|
|
## API Endpoints
|
|
|
|
| Endpoint | Method | Purpose |
|
|
|----------|--------|---------|
|
|
| `/health` | GET | Health check |
|
|
| `/models` | GET | Model info |
|
|
| `/voices` | GET | List all voices |
|
|
| `/voices` | POST | Register custom voice |
|
|
| `/voices/{id}` | DELETE | Delete custom voice |
|
|
| `/synthesize/kokoro` | POST | Kokoro synthesis |
|
|
| `/synthesize` | POST | F5-TTS voice cloning |
|
|
| `/synthesize/auto` | POST | Auto-select model |
|
|
|
|
## Models
|
|
|
|
### Kokoro-82M (English)
|
|
- ~300 MB download
|
|
- 30+ preset English voices
|
|
- Fast inference
|
|
- No reference audio needed
|
|
|
|
### Piper TTS (German)
|
|
- ~63 MB per voice model
|
|
- 100% local, GDPR-compliant
|
|
- Fast inference on CPU
|
|
- Available voices:
|
|
- `de_kerstin` - Female (default)
|
|
- `de_thorsten` - Male
|
|
- Fallback to Edge TTS (cloud) if Piper unavailable:
|
|
- `de_katja` - Female (cloud)
|
|
- `de_conrad` - Male (cloud)
|
|
- `de_amala` - Female young (cloud)
|
|
- `de_florian` - Male young (cloud)
|
|
|
|
### F5-TTS (Voice Cloning)
|
|
- ~6 GB download
|
|
- Voice cloning capability
|
|
- Requires reference audio + transcript
|
|
- Higher quality, slower
|
|
|
|
## Environment Variables
|
|
|
|
| Variable | Default | Description |
|
|
|----------|---------|-------------|
|
|
| `PORT` | `3022` | Service port |
|
|
| `PRELOAD_MODELS` | `false` | Load on startup |
|
|
| `MAX_TEXT_LENGTH` | `1000` | Max chars |
|
|
| `CORS_ORIGINS` | (production URLs) | CORS config |
|
|
|
|
## Key Dependencies
|
|
|
|
- `fastapi` - Web framework
|
|
- `f5-tts-mlx` - Voice cloning model
|
|
- `mlx-audio` - Kokoro implementation
|
|
- `mlx` - Apple Silicon ML framework
|
|
- `piper-tts` - German TTS (local)
|
|
- `edge-tts` - German TTS fallback (cloud)
|
|
- `soundfile` - Audio I/O
|
|
- `pydub` - MP3 conversion
|
|
|
|
## Development Notes
|
|
|
|
- Models load lazily on first request (unless `PRELOAD_MODELS=true`)
|
|
- Custom voices stored in `voices/` with reference audio + transcript
|
|
- Singleton pattern for model instances
|
|
- Audio returned as raw bytes with headers for metadata
|