mirror of
https://github.com/Memo-2023/mana-monorepo.git
synced 2026-05-17 00:59:40 +02:00
Complete voice pipeline combining: - STT: Whisper (mana-stt) - LLM: Ollama (Gemma/Qwen) - TTS: Edge TTS (15 German voices) Endpoints: - /voice - Full audio-to-audio pipeline - /chat/audio - Text-to-audio - /tts - Direct TTS - /transcribe - STT only Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
3.3 KiB
3.3 KiB
CLAUDE.md - Mana Voice Bot
Service Overview
German voice-to-voice assistant combining:
- STT: Whisper via mana-stt (Port 3020)
- LLM: Ollama with Gemma/Qwen (Port 11434)
- TTS: Edge TTS (Microsoft, cloud API)
Port: 3050
Architecture
Audio Input → Whisper (STT) → Ollama (LLM) → Edge TTS → Audio Output
↓ ↓ ↓ ↓
[WAV/MP3] [German Text] [Response] [MP3 Audio]
Commands
# Setup
./setup.sh
# Development
source venv/bin/activate
uvicorn app.main:app --host 0.0.0.0 --port 3050 --reload
# Production
./start.sh
# Test
curl http://localhost:3050/health
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Service health check |
/voices |
GET | List German TTS voices |
/models |
GET | List available Ollama models |
/transcribe |
POST | Audio → Text (STT only) |
/chat |
POST | Text → Text (LLM only) |
/chat/audio |
POST | Text → Audio (LLM + TTS) |
/tts |
POST | Text → Audio (TTS only) |
/voice |
POST | Audio → Audio (Full pipeline) |
/voice/metadata |
POST | Audio → JSON (Full pipeline, no audio) |
Usage Examples
Full Voice Pipeline
# Record audio and send to voice bot
curl -X POST http://localhost:3050/voice \
-F "audio=@input.wav" \
-F "model=gemma3:4b" \
-F "voice=de-DE-ConradNeural" \
-o response.mp3
Text to Audio
curl -X POST http://localhost:3050/chat/audio \
-H "Content-Type: application/json" \
-d '{"message": "Was ist die Hauptstadt von Deutschland?", "voice": "de-DE-KatjaNeural"}' \
-o response.mp3
TTS Only
curl -X POST http://localhost:3050/tts \
-F "text=Hallo, wie geht es dir?" \
-F "voice=de-DE-ConradNeural" \
-o hello.mp3
German Voices
| Voice ID | Description |
|---|---|
de-DE-ConradNeural |
Male - Professional (Default) |
de-DE-KatjaNeural |
Female - Natural |
de-DE-AmalaNeural |
Female - Friendly |
de-DE-BerndNeural |
Male - Calm |
de-DE-ChristophNeural |
Male - News |
de-DE-ElkeNeural |
Female - Warm |
de-DE-KillianNeural |
Male - Casual |
de-DE-KlarissaNeural |
Female - Cheerful |
de-DE-KlausNeural |
Male - Storyteller |
de-DE-LouisaNeural |
Female - Assistant |
de-DE-TanjaNeural |
Female - Business |
Environment Variables
| Variable | Default | Description |
|---|---|---|
PORT |
3050 |
Service port |
STT_URL |
http://localhost:3020 |
mana-stt URL |
OLLAMA_URL |
http://localhost:11434 |
Ollama URL |
DEFAULT_MODEL |
gemma3:4b |
Default LLM model |
DEFAULT_VOICE |
de-DE-ConradNeural |
Default TTS voice |
SYSTEM_PROMPT |
(German assistant) | LLM system prompt |
Dependencies
fastapi- Web frameworkuvicorn- ASGI serveraiohttp- Async HTTP clientedge-tts- Microsoft TTSpython-multipart- File uploads
Performance
Typical latency breakdown:
- STT (Whisper): 0.5-2s
- LLM (Gemma 4B): 1-5s
- TTS (Edge): 0.3-0.5s
- Total: 2-7s
Mac Mini Deployment
# On Mac Mini
cd ~/projects/manacore-monorepo/services/mana-voice-bot
./setup.sh
./start.sh
# Or with launchd (autostart)
# See scripts/mac-mini/setup-voice-bot.sh