mirror of
https://github.com/Memo-2023/mana-monorepo.git
synced 2026-05-14 20:21:09 +02:00
🌐 feat: add i18n support to 6 web apps
Add internationalization (DE + EN) to previously missing apps:
- todo: task management translations
- skilltree: skill/XP system translations
- nutriphi: nutrition tracking translations
- planta: plant care translations
- questions: research app translations
- matrix: chat client translations (layout integration)
Each app includes:
- svelte-i18n setup with SSR support
- localStorage persistence ({app}_locale pattern)
- i18n loading state in +layout.svelte
- German (default) and English translations
Updated CONSISTENCY_REPORT.md to mark i18n task as complete.
Also includes:
- mana-tts service placeholder files
This commit is contained in:
parent
a938ed86d4
commit
5a0815708c
35 changed files with 3440 additions and 56 deletions
100
services/mana-tts/CLAUDE.md
Normal file
100
services/mana-tts/CLAUDE.md
Normal file
|
|
@ -0,0 +1,100 @@
|
|||
# CLAUDE.md - Mana TTS Service
|
||||
|
||||
## Service Overview
|
||||
|
||||
Text-to-Speech microservice using MLX-optimized models for Apple Silicon:
|
||||
|
||||
- **Port**: 3022
|
||||
- **Framework**: Python + FastAPI
|
||||
- **Models**: Kokoro-82M (fast), F5-TTS (voice cloning)
|
||||
|
||||
## Commands
|
||||
|
||||
```bash
|
||||
# Setup
|
||||
./setup.sh
|
||||
|
||||
# Development
|
||||
source .venv/bin/activate
|
||||
uvicorn app.main:app --host 0.0.0.0 --port 3022 --reload
|
||||
|
||||
# Production (Mac Mini)
|
||||
../../scripts/mac-mini/setup-tts.sh
|
||||
|
||||
# Test
|
||||
curl http://localhost:3022/health
|
||||
curl -X POST http://localhost:3022/synthesize/kokoro \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"text": "Hello world", "voice": "af_heart"}' \
|
||||
--output test.wav
|
||||
```
|
||||
|
||||
## File Structure
|
||||
|
||||
```
|
||||
services/mana-tts/
|
||||
├── app/
|
||||
│ ├── __init__.py
|
||||
│ ├── main.py # FastAPI endpoints
|
||||
│ ├── kokoro_service.py # Kokoro TTS (preset voices)
|
||||
│ ├── f5_service.py # F5-TTS (voice cloning)
|
||||
│ ├── voice_manager.py # Custom voice registry
|
||||
│ └── audio_utils.py # Audio format conversion
|
||||
├── voices/ # Custom voice storage
|
||||
├── mlx_models/ # Model cache
|
||||
├── setup.sh # Setup script
|
||||
├── requirements.txt
|
||||
└── README.md
|
||||
```
|
||||
|
||||
## API Endpoints
|
||||
|
||||
| Endpoint | Method | Purpose |
|
||||
|----------|--------|---------|
|
||||
| `/health` | GET | Health check |
|
||||
| `/models` | GET | Model info |
|
||||
| `/voices` | GET | List all voices |
|
||||
| `/voices` | POST | Register custom voice |
|
||||
| `/voices/{id}` | DELETE | Delete custom voice |
|
||||
| `/synthesize/kokoro` | POST | Kokoro synthesis |
|
||||
| `/synthesize` | POST | F5-TTS voice cloning |
|
||||
| `/synthesize/auto` | POST | Auto-select model |
|
||||
|
||||
## Models
|
||||
|
||||
### Kokoro-82M
|
||||
- ~300 MB download
|
||||
- 30+ preset voices
|
||||
- Fast inference
|
||||
- No reference audio needed
|
||||
|
||||
### F5-TTS
|
||||
- ~6 GB download
|
||||
- Voice cloning capability
|
||||
- Requires reference audio + transcript
|
||||
- Higher quality, slower
|
||||
|
||||
## Environment Variables
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `PORT` | `3022` | Service port |
|
||||
| `PRELOAD_MODELS` | `false` | Load on startup |
|
||||
| `MAX_TEXT_LENGTH` | `1000` | Max chars |
|
||||
| `CORS_ORIGINS` | (production URLs) | CORS config |
|
||||
|
||||
## Key Dependencies
|
||||
|
||||
- `fastapi` - Web framework
|
||||
- `f5-tts-mlx` - Voice cloning model
|
||||
- `mlx-audio` - Kokoro implementation
|
||||
- `mlx` - Apple Silicon ML framework
|
||||
- `soundfile` - Audio I/O
|
||||
- `pydub` - MP3 conversion
|
||||
|
||||
## Development Notes
|
||||
|
||||
- Models load lazily on first request (unless `PRELOAD_MODELS=true`)
|
||||
- Custom voices stored in `voices/` with reference audio + transcript
|
||||
- Singleton pattern for model instances
|
||||
- Audio returned as raw bytes with headers for metadata
|
||||
237
services/mana-tts/README.md
Normal file
237
services/mana-tts/README.md
Normal file
|
|
@ -0,0 +1,237 @@
|
|||
# Mana TTS
|
||||
|
||||
Text-to-Speech microservice with voice cloning support, optimized for Apple Silicon.
|
||||
|
||||
## Features
|
||||
|
||||
- **Kokoro TTS**: Fast preset voices (~300 MB model)
|
||||
- **F5-TTS**: Voice cloning with reference audio (~6 GB model)
|
||||
- **MLX Optimized**: Runs efficiently on Apple Silicon
|
||||
- **REST API**: FastAPI with OpenAPI documentation
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Setup
|
||||
|
||||
```bash
|
||||
# Run setup script
|
||||
./setup.sh
|
||||
|
||||
# Or manually
|
||||
python3.11 -m venv .venv
|
||||
source .venv/bin/activate
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
### Start Service
|
||||
|
||||
```bash
|
||||
source .venv/bin/activate
|
||||
uvicorn app.main:app --host 0.0.0.0 --port 3022
|
||||
```
|
||||
|
||||
### Test
|
||||
|
||||
```bash
|
||||
# Health check
|
||||
curl http://localhost:3022/health
|
||||
|
||||
# Synthesize with Kokoro
|
||||
curl -X POST http://localhost:3022/synthesize/kokoro \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"text": "Hello world", "voice": "af_heart"}' \
|
||||
--output test.wav
|
||||
|
||||
# Play audio (macOS)
|
||||
afplay test.wav
|
||||
```
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### Health & Info
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
|----------|--------|-------------|
|
||||
| `/health` | GET | Health check |
|
||||
| `/models` | GET | Available models |
|
||||
| `/voices` | GET | All available voices |
|
||||
|
||||
### Synthesis
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
|----------|--------|-------------|
|
||||
| `/synthesize/kokoro` | POST | Kokoro preset voices |
|
||||
| `/synthesize` | POST | F5-TTS voice cloning |
|
||||
| `/synthesize/auto` | POST | Auto-select model |
|
||||
|
||||
### Voice Management
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
|----------|--------|-------------|
|
||||
| `/voices` | POST | Register custom voice |
|
||||
| `/voices/{id}` | DELETE | Delete custom voice |
|
||||
|
||||
## Synthesis Examples
|
||||
|
||||
### Kokoro (Fast Preset Voices)
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:3022/synthesize/kokoro \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"text": "Welcome to Mana TTS, your personal voice synthesis service.",
|
||||
"voice": "af_heart",
|
||||
"speed": 1.0,
|
||||
"output_format": "wav"
|
||||
}' \
|
||||
--output output.wav
|
||||
```
|
||||
|
||||
### F5-TTS (Voice Cloning)
|
||||
|
||||
```bash
|
||||
# With reference audio upload
|
||||
curl -X POST http://localhost:3022/synthesize \
|
||||
-F "text=Hello, this is a cloned voice speaking." \
|
||||
-F "reference_audio=@reference.wav" \
|
||||
-F "reference_text=This is what the reference audio says." \
|
||||
-F "output_format=wav" \
|
||||
--output cloned.wav
|
||||
|
||||
# With registered voice
|
||||
curl -X POST http://localhost:3022/synthesize \
|
||||
-F "text=Hello from my registered voice." \
|
||||
-F "voice_id=my_custom_voice" \
|
||||
--output output.wav
|
||||
```
|
||||
|
||||
### Auto-Select
|
||||
|
||||
```bash
|
||||
# Uses Kokoro for preset voices, F5-TTS for custom
|
||||
curl -X POST http://localhost:3022/synthesize/auto \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"text": "Auto-selected synthesis", "voice": "af_bella"}' \
|
||||
--output output.wav
|
||||
```
|
||||
|
||||
## Available Kokoro Voices
|
||||
|
||||
### American Female
|
||||
- `af_heart` - Warm, emotional (default)
|
||||
- `af_alloy` - Neutral, professional
|
||||
- `af_bella` - Friendly, approachable
|
||||
- `af_jessica` - Confident, clear
|
||||
- `af_nicole` - Bright, energetic
|
||||
- `af_nova` - Modern, dynamic
|
||||
- `af_sarah` - Warm, conversational
|
||||
- ... and more
|
||||
|
||||
### American Male
|
||||
- `am_adam` - Deep, authoritative
|
||||
- `am_echo` - Resonant, clear
|
||||
- `am_eric` - Professional, neutral
|
||||
- `am_michael` - Warm, trustworthy
|
||||
- ... and more
|
||||
|
||||
### British Female
|
||||
- `bf_alice` - Refined, elegant
|
||||
- `bf_emma` - Clear, professional
|
||||
- `bf_lily` - Soft, gentle
|
||||
|
||||
### British Male
|
||||
- `bm_daniel` - Classic, authoritative
|
||||
- `bm_fable` - Storyteller, expressive
|
||||
- `bm_george` - Traditional, clear
|
||||
|
||||
## Voice Registration
|
||||
|
||||
Register a custom voice for F5-TTS voice cloning:
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:3022/voices \
|
||||
-F "voice_id=my_voice" \
|
||||
-F "name=My Custom Voice" \
|
||||
-F "description=A sample voice for testing" \
|
||||
-F "transcript=Hello, this is the text spoken in the reference audio." \
|
||||
-F "reference_audio=@my_reference.wav"
|
||||
```
|
||||
|
||||
Pre-defined voices can also be placed in the `voices/` directory:
|
||||
|
||||
```
|
||||
voices/
|
||||
└── my_voice/
|
||||
├── reference.wav # Reference audio (required)
|
||||
├── transcript.txt # Transcript of reference (required)
|
||||
└── metadata.json # Name and description (optional)
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `PORT` | `3022` | API port |
|
||||
| `PRELOAD_MODELS` | `false` | Load models on startup |
|
||||
| `MAX_TEXT_LENGTH` | `1000` | Max characters per request |
|
||||
| `CORS_ORIGINS` | `https://mana.how,...` | Allowed CORS origins |
|
||||
| `F5_MODEL` | `lucasnewman/f5-tts-mlx` | F5-TTS model |
|
||||
| `KOKORO_MODEL` | `mlx-community/Kokoro-82M-bf16` | Kokoro model |
|
||||
|
||||
## Mac Mini Deployment
|
||||
|
||||
```bash
|
||||
# Install and start as launchd service
|
||||
../../scripts/mac-mini/setup-tts.sh
|
||||
|
||||
# Service management
|
||||
launchctl list | grep com.manacore.tts
|
||||
launchctl unload ~/Library/LaunchAgents/com.manacore.tts.plist
|
||||
launchctl load ~/Library/LaunchAgents/com.manacore.tts.plist
|
||||
|
||||
# View logs
|
||||
tail -f /tmp/manacore-tts.log
|
||||
```
|
||||
|
||||
## Requirements
|
||||
|
||||
- Python 3.10+
|
||||
- macOS with Apple Silicon (recommended)
|
||||
- ~7 GB disk space for models
|
||||
- 16 GB RAM recommended
|
||||
- ffmpeg (for MP3 output)
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Models Not Loading
|
||||
|
||||
```bash
|
||||
# Check MLX installation
|
||||
python -c "import mlx; print(mlx.__version__)"
|
||||
|
||||
# Check mlx-audio
|
||||
python -c "import mlx_audio; print('OK')"
|
||||
|
||||
# Check f5-tts-mlx
|
||||
python -c "from f5_tts_mlx import F5TTS; print('OK')"
|
||||
```
|
||||
|
||||
### MP3 Output Not Working
|
||||
|
||||
```bash
|
||||
# Install ffmpeg
|
||||
brew install ffmpeg
|
||||
|
||||
# Verify
|
||||
ffmpeg -version
|
||||
```
|
||||
|
||||
### Memory Issues
|
||||
|
||||
- Reduce `MAX_TEXT_LENGTH` for less memory usage
|
||||
- Set `PRELOAD_MODELS=false` for lazy loading
|
||||
- F5-TTS requires ~6 GB, Kokoro ~500 MB
|
||||
|
||||
## API Documentation
|
||||
|
||||
When running, visit http://localhost:3022/docs for interactive API documentation.
|
||||
0
services/mana-tts/app/__init__.py
Normal file
0
services/mana-tts/app/__init__.py
Normal file
224
services/mana-tts/app/audio_utils.py
Normal file
224
services/mana-tts/app/audio_utils.py
Normal file
|
|
@ -0,0 +1,224 @@
|
|||
"""
|
||||
Audio conversion utilities for the TTS service.
|
||||
Handles format conversion between WAV and MP3.
|
||||
"""
|
||||
|
||||
import io
|
||||
import logging
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
import numpy as np
|
||||
import soundfile as sf
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Supported output formats
|
||||
SUPPORTED_FORMATS = ["wav", "mp3"]
|
||||
DEFAULT_FORMAT = "wav"
|
||||
DEFAULT_SAMPLE_RATE = 24000
|
||||
|
||||
|
||||
def audio_to_wav_bytes(
|
||||
audio_data: np.ndarray,
|
||||
sample_rate: int = DEFAULT_SAMPLE_RATE,
|
||||
) -> bytes:
|
||||
"""
|
||||
Convert numpy audio array to WAV bytes.
|
||||
|
||||
Args:
|
||||
audio_data: Audio samples as numpy array
|
||||
sample_rate: Sample rate in Hz
|
||||
|
||||
Returns:
|
||||
WAV file as bytes
|
||||
"""
|
||||
buffer = io.BytesIO()
|
||||
sf.write(buffer, audio_data, sample_rate, format="WAV")
|
||||
buffer.seek(0)
|
||||
return buffer.read()
|
||||
|
||||
|
||||
def audio_to_mp3_bytes(
|
||||
audio_data: np.ndarray,
|
||||
sample_rate: int = DEFAULT_SAMPLE_RATE,
|
||||
bitrate: str = "192k",
|
||||
) -> bytes:
|
||||
"""
|
||||
Convert numpy audio array to MP3 bytes.
|
||||
Requires ffmpeg to be installed.
|
||||
|
||||
Args:
|
||||
audio_data: Audio samples as numpy array
|
||||
sample_rate: Sample rate in Hz
|
||||
bitrate: MP3 bitrate (e.g., "128k", "192k", "320k")
|
||||
|
||||
Returns:
|
||||
MP3 file as bytes
|
||||
"""
|
||||
try:
|
||||
from pydub import AudioSegment
|
||||
except ImportError:
|
||||
logger.error("pydub not installed, falling back to WAV")
|
||||
return audio_to_wav_bytes(audio_data, sample_rate)
|
||||
|
||||
# First convert to WAV
|
||||
wav_bytes = audio_to_wav_bytes(audio_data, sample_rate)
|
||||
|
||||
# Then convert to MP3 using pydub
|
||||
try:
|
||||
audio_segment = AudioSegment.from_wav(io.BytesIO(wav_bytes))
|
||||
buffer = io.BytesIO()
|
||||
audio_segment.export(buffer, format="mp3", bitrate=bitrate)
|
||||
buffer.seek(0)
|
||||
return buffer.read()
|
||||
except Exception as e:
|
||||
logger.error(f"MP3 conversion failed: {e}, falling back to WAV")
|
||||
return wav_bytes
|
||||
|
||||
|
||||
def convert_audio(
|
||||
audio_data: np.ndarray,
|
||||
sample_rate: int = DEFAULT_SAMPLE_RATE,
|
||||
output_format: str = DEFAULT_FORMAT,
|
||||
) -> tuple[bytes, str]:
|
||||
"""
|
||||
Convert audio data to the specified format.
|
||||
|
||||
Args:
|
||||
audio_data: Audio samples as numpy array
|
||||
sample_rate: Sample rate in Hz
|
||||
output_format: Output format ("wav" or "mp3")
|
||||
|
||||
Returns:
|
||||
Tuple of (audio bytes, content type)
|
||||
"""
|
||||
output_format = output_format.lower()
|
||||
|
||||
if output_format not in SUPPORTED_FORMATS:
|
||||
logger.warning(f"Unsupported format '{output_format}', using WAV")
|
||||
output_format = "wav"
|
||||
|
||||
if output_format == "mp3":
|
||||
return audio_to_mp3_bytes(audio_data, sample_rate), "audio/mpeg"
|
||||
else:
|
||||
return audio_to_wav_bytes(audio_data, sample_rate), "audio/wav"
|
||||
|
||||
|
||||
def get_content_type(format: str) -> str:
|
||||
"""Get MIME content type for audio format."""
|
||||
content_types = {
|
||||
"wav": "audio/wav",
|
||||
"mp3": "audio/mpeg",
|
||||
}
|
||||
return content_types.get(format.lower(), "audio/wav")
|
||||
|
||||
|
||||
def load_reference_audio(
|
||||
file_path: str | Path,
|
||||
) -> tuple[np.ndarray, int]:
|
||||
"""
|
||||
Load reference audio file for voice cloning.
|
||||
|
||||
Args:
|
||||
file_path: Path to the audio file
|
||||
|
||||
Returns:
|
||||
Tuple of (audio data as numpy array, sample rate)
|
||||
"""
|
||||
audio_data, sample_rate = sf.read(file_path)
|
||||
|
||||
# Convert to mono if stereo
|
||||
if len(audio_data.shape) > 1:
|
||||
audio_data = np.mean(audio_data, axis=1)
|
||||
|
||||
return audio_data, sample_rate
|
||||
|
||||
|
||||
def resample_audio(
|
||||
audio_data: np.ndarray,
|
||||
original_sr: int,
|
||||
target_sr: int = DEFAULT_SAMPLE_RATE,
|
||||
) -> np.ndarray:
|
||||
"""
|
||||
Resample audio to target sample rate.
|
||||
|
||||
Args:
|
||||
audio_data: Audio samples as numpy array
|
||||
original_sr: Original sample rate
|
||||
target_sr: Target sample rate
|
||||
|
||||
Returns:
|
||||
Resampled audio data
|
||||
"""
|
||||
if original_sr == target_sr:
|
||||
return audio_data
|
||||
|
||||
from scipy import signal
|
||||
|
||||
# Calculate resampling ratio
|
||||
num_samples = int(len(audio_data) * target_sr / original_sr)
|
||||
resampled = signal.resample(audio_data, num_samples)
|
||||
|
||||
return resampled.astype(np.float32)
|
||||
|
||||
|
||||
def normalize_audio(
|
||||
audio_data: np.ndarray,
|
||||
target_db: float = -3.0,
|
||||
) -> np.ndarray:
|
||||
"""
|
||||
Normalize audio to target dB level.
|
||||
|
||||
Args:
|
||||
audio_data: Audio samples as numpy array
|
||||
target_db: Target peak level in dB
|
||||
|
||||
Returns:
|
||||
Normalized audio data
|
||||
"""
|
||||
# Calculate current peak
|
||||
peak = np.max(np.abs(audio_data))
|
||||
|
||||
if peak == 0:
|
||||
return audio_data
|
||||
|
||||
# Calculate target peak from dB
|
||||
target_peak = 10 ** (target_db / 20)
|
||||
|
||||
# Apply gain
|
||||
gain = target_peak / peak
|
||||
return audio_data * gain
|
||||
|
||||
|
||||
def save_temp_audio(
|
||||
audio_bytes: bytes,
|
||||
suffix: str = ".wav",
|
||||
) -> str:
|
||||
"""
|
||||
Save audio bytes to a temporary file.
|
||||
|
||||
Args:
|
||||
audio_bytes: Audio data as bytes
|
||||
suffix: File extension
|
||||
|
||||
Returns:
|
||||
Path to temporary file
|
||||
"""
|
||||
with tempfile.NamedTemporaryFile(suffix=suffix, delete=False) as tmp:
|
||||
tmp.write(audio_bytes)
|
||||
return tmp.name
|
||||
|
||||
|
||||
def cleanup_temp_file(file_path: str) -> None:
|
||||
"""
|
||||
Clean up a temporary file.
|
||||
|
||||
Args:
|
||||
file_path: Path to the file to delete
|
||||
"""
|
||||
try:
|
||||
Path(file_path).unlink()
|
||||
except Exception:
|
||||
pass # Silent cleanup failure
|
||||
208
services/mana-tts/app/f5_service.py
Normal file
208
services/mana-tts/app/f5_service.py
Normal file
|
|
@ -0,0 +1,208 @@
|
|||
"""
|
||||
F5-TTS Service for voice cloning synthesis.
|
||||
Uses f5-tts-mlx optimized for Apple Silicon.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import os
|
||||
import tempfile
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
import numpy as np
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Global singleton for lazy initialization
|
||||
_f5_model = None
|
||||
_f5_model_name = None
|
||||
|
||||
# Default model
|
||||
DEFAULT_F5_MODEL = os.getenv("F5_MODEL", "lucasnewman/f5-tts-mlx")
|
||||
|
||||
# Default generation parameters
|
||||
DEFAULT_DURATION = 10.0 # seconds
|
||||
DEFAULT_STEPS = 32
|
||||
DEFAULT_CFG_STRENGTH = 2.0
|
||||
DEFAULT_SWAY_COEF = -1.0
|
||||
DEFAULT_SPEED = 1.0
|
||||
|
||||
|
||||
@dataclass
|
||||
class F5Result:
|
||||
"""Result from F5-TTS synthesis."""
|
||||
|
||||
audio: np.ndarray
|
||||
sample_rate: int
|
||||
duration: float
|
||||
voice_id: Optional[str] = None
|
||||
|
||||
|
||||
def get_f5_model(model_name: str = DEFAULT_F5_MODEL):
|
||||
"""
|
||||
Get or create F5-TTS model instance (singleton pattern).
|
||||
|
||||
Args:
|
||||
model_name: HuggingFace model identifier
|
||||
|
||||
Returns:
|
||||
F5TTS model instance
|
||||
"""
|
||||
global _f5_model, _f5_model_name
|
||||
|
||||
# Return existing model if same model name
|
||||
if _f5_model is not None and _f5_model_name == model_name:
|
||||
return _f5_model
|
||||
|
||||
logger.info(f"Loading F5-TTS model: {model_name}")
|
||||
|
||||
try:
|
||||
from f5_tts_mlx import F5TTS
|
||||
|
||||
_f5_model = F5TTS(model_name=model_name)
|
||||
_f5_model_name = model_name
|
||||
logger.info("F5-TTS model loaded successfully")
|
||||
return _f5_model
|
||||
|
||||
except ImportError as e:
|
||||
logger.error(f"Failed to import f5_tts_mlx: {e}")
|
||||
raise RuntimeError(
|
||||
"f5-tts-mlx not installed. Run: pip install f5-tts-mlx"
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to load F5-TTS model: {e}")
|
||||
raise
|
||||
|
||||
|
||||
def is_f5_loaded() -> bool:
|
||||
"""Check if F5-TTS model is currently loaded."""
|
||||
return _f5_model is not None
|
||||
|
||||
|
||||
async def synthesize_f5(
|
||||
text: str,
|
||||
reference_audio_path: str,
|
||||
reference_text: str,
|
||||
duration: Optional[float] = None,
|
||||
steps: int = DEFAULT_STEPS,
|
||||
cfg_strength: float = DEFAULT_CFG_STRENGTH,
|
||||
sway_coef: float = DEFAULT_SWAY_COEF,
|
||||
speed: float = DEFAULT_SPEED,
|
||||
model_name: str = DEFAULT_F5_MODEL,
|
||||
) -> F5Result:
|
||||
"""
|
||||
Synthesize speech using F5-TTS with voice cloning.
|
||||
|
||||
Args:
|
||||
text: Text to synthesize
|
||||
reference_audio_path: Path to reference audio file
|
||||
reference_text: Transcript of the reference audio
|
||||
duration: Target duration in seconds (auto-calculated if None)
|
||||
steps: Number of diffusion steps
|
||||
cfg_strength: Classifier-free guidance strength
|
||||
sway_coef: Sway sampling coefficient
|
||||
speed: Speech speed multiplier
|
||||
model_name: HuggingFace model identifier
|
||||
|
||||
Returns:
|
||||
F5Result with audio data
|
||||
"""
|
||||
# Get model
|
||||
model = get_f5_model(model_name)
|
||||
|
||||
logger.info(
|
||||
f"Synthesizing with F5-TTS: text_length={len(text)}, "
|
||||
f"ref_audio={reference_audio_path}, steps={steps}"
|
||||
)
|
||||
|
||||
try:
|
||||
# Generate audio
|
||||
audio, sample_rate = model.generate(
|
||||
text=text,
|
||||
ref_audio_path=reference_audio_path,
|
||||
ref_audio_text=reference_text,
|
||||
duration=duration,
|
||||
steps=steps,
|
||||
cfg_strength=cfg_strength,
|
||||
sway_coef=sway_coef,
|
||||
speed=speed,
|
||||
)
|
||||
|
||||
# Calculate duration
|
||||
audio_duration = len(audio) / sample_rate
|
||||
|
||||
logger.info(f"F5-TTS synthesis complete: duration={audio_duration:.2f}s")
|
||||
|
||||
return F5Result(
|
||||
audio=audio,
|
||||
sample_rate=sample_rate,
|
||||
duration=audio_duration,
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"F5-TTS synthesis failed: {e}")
|
||||
raise RuntimeError(f"Voice cloning synthesis failed: {e}")
|
||||
|
||||
|
||||
async def synthesize_f5_from_bytes(
|
||||
text: str,
|
||||
reference_audio_bytes: bytes,
|
||||
reference_text: str,
|
||||
audio_extension: str = ".wav",
|
||||
**kwargs,
|
||||
) -> F5Result:
|
||||
"""
|
||||
Synthesize speech using F5-TTS with reference audio as bytes.
|
||||
|
||||
Args:
|
||||
text: Text to synthesize
|
||||
reference_audio_bytes: Reference audio as bytes
|
||||
reference_text: Transcript of the reference audio
|
||||
audio_extension: File extension for temp file
|
||||
**kwargs: Additional arguments passed to synthesize_f5
|
||||
|
||||
Returns:
|
||||
F5Result with audio data
|
||||
"""
|
||||
# Save reference audio to temp file
|
||||
with tempfile.NamedTemporaryFile(
|
||||
suffix=audio_extension,
|
||||
delete=False,
|
||||
) as tmp:
|
||||
tmp.write(reference_audio_bytes)
|
||||
tmp_path = tmp.name
|
||||
|
||||
try:
|
||||
result = await synthesize_f5(
|
||||
text=text,
|
||||
reference_audio_path=tmp_path,
|
||||
reference_text=reference_text,
|
||||
**kwargs,
|
||||
)
|
||||
return result
|
||||
finally:
|
||||
# Clean up temp file
|
||||
try:
|
||||
Path(tmp_path).unlink()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
|
||||
def estimate_duration(text: str, speed: float = 1.0) -> float:
|
||||
"""
|
||||
Estimate audio duration from text.
|
||||
|
||||
Args:
|
||||
text: Text to synthesize
|
||||
speed: Speech speed multiplier
|
||||
|
||||
Returns:
|
||||
Estimated duration in seconds
|
||||
"""
|
||||
# Rough estimate: ~150 words per minute at normal speed
|
||||
# Average word length: ~5 characters
|
||||
words = len(text) / 5
|
||||
minutes = words / 150
|
||||
seconds = minutes * 60
|
||||
return seconds / speed
|
||||
187
services/mana-tts/app/kokoro_service.py
Normal file
187
services/mana-tts/app/kokoro_service.py
Normal file
|
|
@ -0,0 +1,187 @@
|
|||
"""
|
||||
Kokoro TTS Service for fast preset voice synthesis.
|
||||
Uses mlx-audio's Kokoro implementation optimized for Apple Silicon.
|
||||
"""
|
||||
|
||||
import logging
|
||||
from dataclasses import dataclass
|
||||
from typing import Optional
|
||||
|
||||
import numpy as np
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Global singleton for lazy initialization
|
||||
_kokoro_model = None
|
||||
_kokoro_model_name = None
|
||||
|
||||
# Default model
|
||||
DEFAULT_KOKORO_MODEL = "mlx-community/Kokoro-82M-bf16"
|
||||
|
||||
# Available Kokoro voices (American Female/Male, British Female/Male)
|
||||
KOKORO_VOICES = {
|
||||
# American Female voices
|
||||
"af_heart": "American Female - Heart (warm, emotional)",
|
||||
"af_alloy": "American Female - Alloy (neutral, professional)",
|
||||
"af_aoede": "American Female - Aoede (clear, articulate)",
|
||||
"af_bella": "American Female - Bella (friendly, approachable)",
|
||||
"af_jessica": "American Female - Jessica (confident, clear)",
|
||||
"af_kore": "American Female - Kore (calm, measured)",
|
||||
"af_nicole": "American Female - Nicole (bright, energetic)",
|
||||
"af_nova": "American Female - Nova (modern, dynamic)",
|
||||
"af_river": "American Female - River (smooth, flowing)",
|
||||
"af_sarah": "American Female - Sarah (warm, conversational)",
|
||||
"af_sky": "American Female - Sky (light, airy)",
|
||||
# American Male voices
|
||||
"am_adam": "American Male - Adam (deep, authoritative)",
|
||||
"am_echo": "American Male - Echo (resonant, clear)",
|
||||
"am_eric": "American Male - Eric (professional, neutral)",
|
||||
"am_fenrir": "American Male - Fenrir (strong, commanding)",
|
||||
"am_liam": "American Male - Liam (friendly, casual)",
|
||||
"am_michael": "American Male - Michael (warm, trustworthy)",
|
||||
"am_onyx": "American Male - Onyx (deep, smooth)",
|
||||
"am_puck": "American Male - Puck (playful, light)",
|
||||
# British Female voices
|
||||
"bf_alice": "British Female - Alice (refined, elegant)",
|
||||
"bf_emma": "British Female - Emma (clear, professional)",
|
||||
"bf_isabella": "British Female - Isabella (sophisticated, warm)",
|
||||
"bf_lily": "British Female - Lily (soft, gentle)",
|
||||
# British Male voices
|
||||
"bm_daniel": "British Male - Daniel (classic, authoritative)",
|
||||
"bm_fable": "British Male - Fable (storyteller, expressive)",
|
||||
"bm_george": "British Male - George (traditional, clear)",
|
||||
"bm_lewis": "British Male - Lewis (modern, approachable)",
|
||||
}
|
||||
|
||||
DEFAULT_VOICE = "af_heart"
|
||||
|
||||
|
||||
@dataclass
|
||||
class KokoroResult:
|
||||
"""Result from Kokoro TTS synthesis."""
|
||||
|
||||
audio: np.ndarray
|
||||
sample_rate: int
|
||||
voice: str
|
||||
duration: float
|
||||
|
||||
|
||||
def get_kokoro_model(model_name: str = DEFAULT_KOKORO_MODEL):
|
||||
"""
|
||||
Get or create Kokoro model instance (singleton pattern).
|
||||
|
||||
Args:
|
||||
model_name: HuggingFace model identifier
|
||||
|
||||
Returns:
|
||||
Kokoro model instance
|
||||
"""
|
||||
global _kokoro_model, _kokoro_model_name
|
||||
|
||||
# Return existing model if same model name
|
||||
if _kokoro_model is not None and _kokoro_model_name == model_name:
|
||||
return _kokoro_model
|
||||
|
||||
logger.info(f"Loading Kokoro model: {model_name}")
|
||||
|
||||
try:
|
||||
from mlx_audio.tts import load
|
||||
|
||||
_kokoro_model = load(model_name)
|
||||
_kokoro_model_name = model_name
|
||||
logger.info("Kokoro model loaded successfully")
|
||||
return _kokoro_model
|
||||
|
||||
except ImportError as e:
|
||||
logger.error(f"Failed to import mlx_audio: {e}")
|
||||
raise RuntimeError(
|
||||
"mlx-audio not installed. Run: pip install mlx-audio"
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to load Kokoro model: {e}")
|
||||
raise
|
||||
|
||||
|
||||
def is_kokoro_loaded() -> bool:
|
||||
"""Check if Kokoro model is currently loaded."""
|
||||
return _kokoro_model is not None
|
||||
|
||||
|
||||
def get_available_voices() -> dict[str, str]:
|
||||
"""Get dictionary of available Kokoro voices."""
|
||||
return KOKORO_VOICES.copy()
|
||||
|
||||
|
||||
async def synthesize_kokoro(
|
||||
text: str,
|
||||
voice: str = DEFAULT_VOICE,
|
||||
speed: float = 1.0,
|
||||
model_name: str = DEFAULT_KOKORO_MODEL,
|
||||
) -> KokoroResult:
|
||||
"""
|
||||
Synthesize speech using Kokoro TTS.
|
||||
|
||||
Args:
|
||||
text: Text to synthesize
|
||||
voice: Voice ID from KOKORO_VOICES
|
||||
speed: Speech speed multiplier (0.5-2.0)
|
||||
model_name: HuggingFace model identifier
|
||||
|
||||
Returns:
|
||||
KokoroResult with audio data
|
||||
"""
|
||||
# Validate voice
|
||||
if voice not in KOKORO_VOICES:
|
||||
logger.warning(f"Unknown voice '{voice}', using default '{DEFAULT_VOICE}'")
|
||||
voice = DEFAULT_VOICE
|
||||
|
||||
# Clamp speed to valid range
|
||||
speed = max(0.5, min(2.0, speed))
|
||||
|
||||
# Get model
|
||||
model = get_kokoro_model(model_name)
|
||||
|
||||
logger.info(f"Synthesizing with Kokoro: voice={voice}, speed={speed}, text_length={len(text)}")
|
||||
|
||||
try:
|
||||
# Generate audio using mlx-audio's generate method
|
||||
# Returns a generator of GenerationResult objects
|
||||
result_gen = model.generate(
|
||||
text=text,
|
||||
voice=voice,
|
||||
speed=speed,
|
||||
)
|
||||
|
||||
# Collect all audio chunks from the generator
|
||||
audio_chunks = []
|
||||
sample_rate = 24000 # Default, will be updated from result
|
||||
|
||||
for result in result_gen:
|
||||
# Each result has audio, sample_rate, audio_duration (string)
|
||||
sample_rate = result.sample_rate
|
||||
|
||||
# Convert MLX array to numpy
|
||||
audio_np = np.array(result.audio, dtype=np.float32)
|
||||
audio_chunks.append(audio_np)
|
||||
|
||||
# Concatenate all chunks
|
||||
if audio_chunks:
|
||||
full_audio = np.concatenate(audio_chunks)
|
||||
else:
|
||||
raise RuntimeError("No audio generated")
|
||||
|
||||
# Calculate duration from audio length
|
||||
total_duration = len(full_audio) / sample_rate
|
||||
|
||||
logger.info(f"Kokoro synthesis complete: duration={total_duration:.2f}s")
|
||||
|
||||
return KokoroResult(
|
||||
audio=full_audio,
|
||||
sample_rate=sample_rate,
|
||||
voice=voice,
|
||||
duration=total_duration,
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Kokoro synthesis failed: {e}")
|
||||
raise RuntimeError(f"TTS synthesis failed: {e}")
|
||||
625
services/mana-tts/app/main.py
Normal file
625
services/mana-tts/app/main.py
Normal file
|
|
@ -0,0 +1,625 @@
|
|||
"""
|
||||
Mana TTS - Text-to-Speech Microservice
|
||||
|
||||
Provides TTS synthesis using:
|
||||
- Kokoro: Fast preset voices
|
||||
- F5-TTS: Voice cloning with reference audio
|
||||
|
||||
Optimized for Apple Silicon (MLX).
|
||||
"""
|
||||
|
||||
import logging
|
||||
import os
|
||||
from contextlib import asynccontextmanager
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
from fastapi import FastAPI, HTTPException, UploadFile, File, Form, Response
|
||||
from fastapi.middleware.cors import CORSMiddleware
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
from .audio_utils import convert_audio, SUPPORTED_FORMATS, cleanup_temp_file, save_temp_audio
|
||||
from .kokoro_service import (
|
||||
synthesize_kokoro,
|
||||
get_kokoro_model,
|
||||
is_kokoro_loaded,
|
||||
KOKORO_VOICES,
|
||||
DEFAULT_VOICE as DEFAULT_KOKORO_VOICE,
|
||||
DEFAULT_KOKORO_MODEL,
|
||||
)
|
||||
from .f5_service import (
|
||||
synthesize_f5,
|
||||
synthesize_f5_from_bytes,
|
||||
get_f5_model,
|
||||
is_f5_loaded,
|
||||
DEFAULT_F5_MODEL,
|
||||
)
|
||||
from .voice_manager import get_voice_manager, CustomVoice
|
||||
|
||||
# Configure logging
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
|
||||
)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Configuration from environment
|
||||
PORT = int(os.getenv("PORT", "3022"))
|
||||
PRELOAD_MODELS = os.getenv("PRELOAD_MODELS", "false").lower() == "true"
|
||||
MAX_TEXT_LENGTH = int(os.getenv("MAX_TEXT_LENGTH", "1000"))
|
||||
CORS_ORIGINS = os.getenv(
|
||||
"CORS_ORIGINS",
|
||||
"https://mana.how,https://chat.mana.how,https://todo.mana.how,http://localhost:5173",
|
||||
).split(",")
|
||||
|
||||
# Supported audio extensions for uploads
|
||||
SUPPORTED_AUDIO_EXTENSIONS = {".wav", ".mp3", ".m4a", ".flac", ".ogg"}
|
||||
|
||||
|
||||
@asynccontextmanager
|
||||
async def lifespan(app: FastAPI):
|
||||
"""Application lifespan manager for startup/shutdown."""
|
||||
logger.info(f"Starting Mana TTS service on port {PORT}")
|
||||
|
||||
# Initialize voice manager (scans voices directory)
|
||||
voice_manager = get_voice_manager()
|
||||
logger.info(f"Voice manager initialized with {len(voice_manager.list_voices())} custom voices")
|
||||
|
||||
if PRELOAD_MODELS:
|
||||
logger.info("Pre-loading models (PRELOAD_MODELS=true)...")
|
||||
try:
|
||||
get_kokoro_model()
|
||||
logger.info("Kokoro model pre-loaded")
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to pre-load Kokoro: {e}")
|
||||
|
||||
try:
|
||||
get_f5_model()
|
||||
logger.info("F5-TTS model pre-loaded")
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to pre-load F5-TTS: {e}")
|
||||
else:
|
||||
logger.info("Models will be loaded on first request (lazy loading)")
|
||||
|
||||
yield
|
||||
|
||||
logger.info("Shutting down Mana TTS service")
|
||||
|
||||
|
||||
# Create FastAPI app
|
||||
app = FastAPI(
|
||||
title="Mana TTS",
|
||||
description="Text-to-Speech service with voice cloning support",
|
||||
version="1.0.0",
|
||||
lifespan=lifespan,
|
||||
)
|
||||
|
||||
# CORS middleware
|
||||
app.add_middleware(
|
||||
CORSMiddleware,
|
||||
allow_origins=CORS_ORIGINS,
|
||||
allow_credentials=True,
|
||||
allow_methods=["*"],
|
||||
allow_headers=["*"],
|
||||
)
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# Request/Response Models
|
||||
# ============================================================================
|
||||
|
||||
|
||||
class KokoroRequest(BaseModel):
|
||||
"""Request for Kokoro TTS synthesis."""
|
||||
|
||||
text: str = Field(..., description="Text to synthesize", max_length=5000)
|
||||
voice: str = Field(DEFAULT_KOKORO_VOICE, description="Voice ID")
|
||||
speed: float = Field(1.0, ge=0.5, le=2.0, description="Speech speed")
|
||||
output_format: str = Field("wav", description="Output format (wav, mp3)")
|
||||
|
||||
|
||||
class AutoRequest(BaseModel):
|
||||
"""Request for auto-selection TTS synthesis."""
|
||||
|
||||
text: str = Field(..., description="Text to synthesize", max_length=5000)
|
||||
voice: Optional[str] = Field(None, description="Voice ID (Kokoro preset or registered)")
|
||||
speed: float = Field(1.0, ge=0.5, le=2.0, description="Speech speed")
|
||||
output_format: str = Field("wav", description="Output format (wav, mp3)")
|
||||
|
||||
|
||||
class RegisterVoiceRequest(BaseModel):
|
||||
"""Request to register a new custom voice."""
|
||||
|
||||
voice_id: str = Field(..., description="Unique voice identifier", min_length=2, max_length=50)
|
||||
name: str = Field(..., description="Display name")
|
||||
description: str = Field("", description="Voice description")
|
||||
transcript: str = Field(..., description="Transcript of the reference audio")
|
||||
|
||||
|
||||
class HealthResponse(BaseModel):
|
||||
"""Health check response."""
|
||||
|
||||
status: str
|
||||
service: str
|
||||
models_loaded: dict
|
||||
|
||||
|
||||
class ModelsResponse(BaseModel):
|
||||
"""Available models response."""
|
||||
|
||||
kokoro: dict
|
||||
f5: dict
|
||||
|
||||
|
||||
class VoiceInfo(BaseModel):
|
||||
"""Voice information."""
|
||||
|
||||
id: str
|
||||
name: str
|
||||
description: str
|
||||
type: str # "kokoro" or "f5_custom"
|
||||
|
||||
|
||||
class VoicesResponse(BaseModel):
|
||||
"""Available voices response."""
|
||||
|
||||
kokoro_voices: list[VoiceInfo]
|
||||
custom_voices: list[VoiceInfo]
|
||||
|
||||
|
||||
class VoiceRegisteredResponse(BaseModel):
|
||||
"""Response after registering a voice."""
|
||||
|
||||
voice_id: str
|
||||
message: str
|
||||
|
||||
|
||||
class VoiceDeletedResponse(BaseModel):
|
||||
"""Response after deleting a voice."""
|
||||
|
||||
voice_id: str
|
||||
message: str
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# Health & Info Endpoints
|
||||
# ============================================================================
|
||||
|
||||
|
||||
@app.get("/health", response_model=HealthResponse)
|
||||
async def health_check():
|
||||
"""Check service health and model status."""
|
||||
return HealthResponse(
|
||||
status="healthy",
|
||||
service="mana-tts",
|
||||
models_loaded={
|
||||
"kokoro": is_kokoro_loaded(),
|
||||
"f5": is_f5_loaded(),
|
||||
},
|
||||
)
|
||||
|
||||
|
||||
@app.get("/models", response_model=ModelsResponse)
|
||||
async def get_models():
|
||||
"""Get information about available models."""
|
||||
return ModelsResponse(
|
||||
kokoro={
|
||||
"name": "Kokoro-82M",
|
||||
"description": "Fast TTS with preset voices",
|
||||
"model_id": DEFAULT_KOKORO_MODEL,
|
||||
"loaded": is_kokoro_loaded(),
|
||||
"voice_count": len(KOKORO_VOICES),
|
||||
},
|
||||
f5={
|
||||
"name": "F5-TTS",
|
||||
"description": "Voice cloning with reference audio",
|
||||
"model_id": DEFAULT_F5_MODEL,
|
||||
"loaded": is_f5_loaded(),
|
||||
"supports_cloning": True,
|
||||
},
|
||||
)
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# Voice Management Endpoints
|
||||
# ============================================================================
|
||||
|
||||
|
||||
@app.get("/voices", response_model=VoicesResponse)
|
||||
async def get_voices():
|
||||
"""Get all available voices."""
|
||||
# Kokoro preset voices
|
||||
kokoro_voices = [
|
||||
VoiceInfo(
|
||||
id=voice_id,
|
||||
name=voice_id,
|
||||
description=description,
|
||||
type="kokoro",
|
||||
)
|
||||
for voice_id, description in KOKORO_VOICES.items()
|
||||
]
|
||||
|
||||
# Custom voices from voice manager
|
||||
voice_manager = get_voice_manager()
|
||||
custom_voices = [
|
||||
VoiceInfo(
|
||||
id=voice.id,
|
||||
name=voice.name,
|
||||
description=voice.description,
|
||||
type="f5_custom",
|
||||
)
|
||||
for voice in voice_manager.list_voices()
|
||||
]
|
||||
|
||||
return VoicesResponse(
|
||||
kokoro_voices=kokoro_voices,
|
||||
custom_voices=custom_voices,
|
||||
)
|
||||
|
||||
|
||||
@app.post("/voices", response_model=VoiceRegisteredResponse)
|
||||
async def register_voice(
|
||||
voice_id: str = Form(..., description="Unique voice identifier"),
|
||||
name: str = Form(..., description="Display name"),
|
||||
description: str = Form("", description="Voice description"),
|
||||
transcript: str = Form(..., description="Transcript of the reference audio"),
|
||||
reference_audio: UploadFile = File(..., description="Reference audio file"),
|
||||
):
|
||||
"""
|
||||
Register a new custom voice for F5-TTS voice cloning.
|
||||
|
||||
Requires:
|
||||
- Reference audio file (WAV, MP3, M4A, FLAC, OGG)
|
||||
- Transcript of what is said in the audio
|
||||
"""
|
||||
# Validate file extension
|
||||
if reference_audio.filename:
|
||||
ext = Path(reference_audio.filename).suffix.lower()
|
||||
if ext not in SUPPORTED_AUDIO_EXTENSIONS:
|
||||
raise HTTPException(
|
||||
status_code=400,
|
||||
detail=f"Unsupported audio format. Use one of: {SUPPORTED_AUDIO_EXTENSIONS}",
|
||||
)
|
||||
else:
|
||||
ext = ".wav"
|
||||
|
||||
# Read audio bytes
|
||||
audio_bytes = await reference_audio.read()
|
||||
|
||||
if len(audio_bytes) == 0:
|
||||
raise HTTPException(status_code=400, detail="Audio file is empty")
|
||||
|
||||
if len(audio_bytes) > 50 * 1024 * 1024: # 50 MB limit
|
||||
raise HTTPException(status_code=400, detail="Audio file too large (max 50 MB)")
|
||||
|
||||
# Register voice
|
||||
voice_manager = get_voice_manager()
|
||||
try:
|
||||
voice_manager.register_voice(
|
||||
voice_id=voice_id,
|
||||
name=name,
|
||||
description=description,
|
||||
audio_bytes=audio_bytes,
|
||||
transcript=transcript,
|
||||
audio_extension=ext,
|
||||
)
|
||||
except ValueError as e:
|
||||
raise HTTPException(status_code=400, detail=str(e))
|
||||
|
||||
return VoiceRegisteredResponse(
|
||||
voice_id=voice_id,
|
||||
message=f"Voice '{voice_id}' registered successfully",
|
||||
)
|
||||
|
||||
|
||||
@app.delete("/voices/{voice_id}", response_model=VoiceDeletedResponse)
|
||||
async def delete_voice(voice_id: str):
|
||||
"""Delete a registered custom voice."""
|
||||
voice_manager = get_voice_manager()
|
||||
|
||||
if not voice_manager.delete_voice(voice_id):
|
||||
raise HTTPException(status_code=404, detail=f"Voice '{voice_id}' not found")
|
||||
|
||||
return VoiceDeletedResponse(
|
||||
voice_id=voice_id,
|
||||
message=f"Voice '{voice_id}' deleted successfully",
|
||||
)
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# Kokoro TTS Endpoint
|
||||
# ============================================================================
|
||||
|
||||
|
||||
@app.post("/synthesize/kokoro")
|
||||
async def synthesize_with_kokoro(request: KokoroRequest):
|
||||
"""
|
||||
Synthesize speech using Kokoro with preset voices.
|
||||
|
||||
Fast synthesis with high-quality preset voices.
|
||||
"""
|
||||
# Validate text length
|
||||
if len(request.text) > MAX_TEXT_LENGTH:
|
||||
raise HTTPException(
|
||||
status_code=400,
|
||||
detail=f"Text exceeds maximum length of {MAX_TEXT_LENGTH} characters",
|
||||
)
|
||||
|
||||
if not request.text.strip():
|
||||
raise HTTPException(status_code=400, detail="Text cannot be empty")
|
||||
|
||||
# Validate output format
|
||||
output_format = request.output_format.lower()
|
||||
if output_format not in SUPPORTED_FORMATS:
|
||||
raise HTTPException(
|
||||
status_code=400,
|
||||
detail=f"Unsupported format. Use one of: {SUPPORTED_FORMATS}",
|
||||
)
|
||||
|
||||
try:
|
||||
# Synthesize
|
||||
result = await synthesize_kokoro(
|
||||
text=request.text,
|
||||
voice=request.voice,
|
||||
speed=request.speed,
|
||||
)
|
||||
|
||||
# Convert to requested format
|
||||
audio_bytes, content_type = convert_audio(
|
||||
result.audio,
|
||||
result.sample_rate,
|
||||
output_format,
|
||||
)
|
||||
|
||||
# Return audio response
|
||||
return Response(
|
||||
content=audio_bytes,
|
||||
media_type=content_type,
|
||||
headers={
|
||||
"X-Voice": result.voice,
|
||||
"X-Duration": str(result.duration),
|
||||
"X-Sample-Rate": str(result.sample_rate),
|
||||
},
|
||||
)
|
||||
|
||||
except RuntimeError as e:
|
||||
raise HTTPException(status_code=500, detail=str(e))
|
||||
except Exception as e:
|
||||
logger.error(f"Kokoro synthesis error: {e}")
|
||||
raise HTTPException(status_code=500, detail=f"Synthesis failed: {e}")
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# F5-TTS Endpoint
|
||||
# ============================================================================
|
||||
|
||||
|
||||
@app.post("/synthesize")
|
||||
async def synthesize_with_f5(
|
||||
text: str = Form(..., description="Text to synthesize"),
|
||||
voice_id: Optional[str] = Form(None, description="Registered voice ID"),
|
||||
reference_audio: Optional[UploadFile] = File(None, description="Reference audio for cloning"),
|
||||
reference_text: Optional[str] = Form(None, description="Transcript of reference audio"),
|
||||
output_format: str = Form("wav", description="Output format (wav, mp3)"),
|
||||
speed: float = Form(1.0, ge=0.5, le=2.0, description="Speech speed"),
|
||||
steps: int = Form(32, ge=8, le=64, description="Diffusion steps"),
|
||||
):
|
||||
"""
|
||||
Synthesize speech using F5-TTS with voice cloning.
|
||||
|
||||
Provide either:
|
||||
- voice_id: Use a pre-registered voice
|
||||
- reference_audio + reference_text: Clone voice from audio sample
|
||||
"""
|
||||
# Validate text
|
||||
if len(text) > MAX_TEXT_LENGTH:
|
||||
raise HTTPException(
|
||||
status_code=400,
|
||||
detail=f"Text exceeds maximum length of {MAX_TEXT_LENGTH} characters",
|
||||
)
|
||||
|
||||
if not text.strip():
|
||||
raise HTTPException(status_code=400, detail="Text cannot be empty")
|
||||
|
||||
# Validate output format
|
||||
output_format = output_format.lower()
|
||||
if output_format not in SUPPORTED_FORMATS:
|
||||
raise HTTPException(
|
||||
status_code=400,
|
||||
detail=f"Unsupported format. Use one of: {SUPPORTED_FORMATS}",
|
||||
)
|
||||
|
||||
voice_manager = get_voice_manager()
|
||||
ref_audio_path: Optional[str] = None
|
||||
ref_text: Optional[str] = None
|
||||
temp_file_path: Optional[str] = None
|
||||
|
||||
try:
|
||||
# Option 1: Use registered voice
|
||||
if voice_id:
|
||||
voice = voice_manager.get_voice(voice_id)
|
||||
if not voice:
|
||||
raise HTTPException(
|
||||
status_code=404,
|
||||
detail=f"Voice '{voice_id}' not found. Register it first or provide reference audio.",
|
||||
)
|
||||
ref_audio_path = voice.audio_path
|
||||
ref_text = voice.transcript
|
||||
|
||||
# Option 2: Use uploaded reference audio
|
||||
elif reference_audio and reference_text:
|
||||
# Get file extension
|
||||
ext = ".wav"
|
||||
if reference_audio.filename:
|
||||
ext = Path(reference_audio.filename).suffix.lower()
|
||||
if ext not in SUPPORTED_AUDIO_EXTENSIONS:
|
||||
raise HTTPException(
|
||||
status_code=400,
|
||||
detail=f"Unsupported audio format. Use one of: {SUPPORTED_AUDIO_EXTENSIONS}",
|
||||
)
|
||||
|
||||
# Read and save to temp file
|
||||
audio_bytes = await reference_audio.read()
|
||||
if len(audio_bytes) == 0:
|
||||
raise HTTPException(status_code=400, detail="Reference audio is empty")
|
||||
|
||||
temp_file_path = save_temp_audio(audio_bytes, suffix=ext)
|
||||
ref_audio_path = temp_file_path
|
||||
ref_text = reference_text
|
||||
|
||||
else:
|
||||
raise HTTPException(
|
||||
status_code=400,
|
||||
detail="Provide either voice_id or reference_audio + reference_text",
|
||||
)
|
||||
|
||||
# Synthesize with F5-TTS
|
||||
result = await synthesize_f5(
|
||||
text=text,
|
||||
reference_audio_path=ref_audio_path,
|
||||
reference_text=ref_text,
|
||||
speed=speed,
|
||||
steps=steps,
|
||||
)
|
||||
|
||||
# Convert to requested format
|
||||
audio_bytes, content_type = convert_audio(
|
||||
result.audio,
|
||||
result.sample_rate,
|
||||
output_format,
|
||||
)
|
||||
|
||||
# Return audio response
|
||||
return Response(
|
||||
content=audio_bytes,
|
||||
media_type=content_type,
|
||||
headers={
|
||||
"X-Model": "f5-tts",
|
||||
"X-Voice-ID": voice_id or "custom",
|
||||
"X-Duration": str(result.duration),
|
||||
"X-Sample-Rate": str(result.sample_rate),
|
||||
},
|
||||
)
|
||||
|
||||
except HTTPException:
|
||||
raise
|
||||
except RuntimeError as e:
|
||||
raise HTTPException(status_code=500, detail=str(e))
|
||||
except Exception as e:
|
||||
logger.error(f"F5-TTS synthesis error: {e}")
|
||||
raise HTTPException(status_code=500, detail=f"Voice cloning synthesis failed: {e}")
|
||||
finally:
|
||||
# Clean up temp file
|
||||
if temp_file_path:
|
||||
cleanup_temp_file(temp_file_path)
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# Auto-Selection Endpoint
|
||||
# ============================================================================
|
||||
|
||||
|
||||
@app.post("/synthesize/auto")
|
||||
async def synthesize_auto(request: AutoRequest):
|
||||
"""
|
||||
Auto-select the best TTS model based on voice parameter.
|
||||
|
||||
- If voice is a Kokoro preset: Use Kokoro
|
||||
- If voice is a registered custom voice: Use F5-TTS
|
||||
- If no voice specified: Use Kokoro with default voice
|
||||
"""
|
||||
# Validate text
|
||||
if len(request.text) > MAX_TEXT_LENGTH:
|
||||
raise HTTPException(
|
||||
status_code=400,
|
||||
detail=f"Text exceeds maximum length of {MAX_TEXT_LENGTH} characters",
|
||||
)
|
||||
|
||||
if not request.text.strip():
|
||||
raise HTTPException(status_code=400, detail="Text cannot be empty")
|
||||
|
||||
# Determine which model to use
|
||||
voice = request.voice or DEFAULT_KOKORO_VOICE
|
||||
|
||||
# Check if it's a Kokoro voice
|
||||
if voice in KOKORO_VOICES:
|
||||
kokoro_request = KokoroRequest(
|
||||
text=request.text,
|
||||
voice=voice,
|
||||
speed=request.speed,
|
||||
output_format=request.output_format,
|
||||
)
|
||||
return await synthesize_with_kokoro(kokoro_request)
|
||||
|
||||
# Check if it's a registered custom voice
|
||||
voice_manager = get_voice_manager()
|
||||
if voice_manager.voice_exists(voice):
|
||||
# Use F5-TTS with registered voice
|
||||
# Create a form-like context for the F5 endpoint
|
||||
custom_voice = voice_manager.get_voice(voice)
|
||||
try:
|
||||
result = await synthesize_f5(
|
||||
text=request.text,
|
||||
reference_audio_path=custom_voice.audio_path,
|
||||
reference_text=custom_voice.transcript,
|
||||
speed=request.speed,
|
||||
)
|
||||
|
||||
# Convert to requested format
|
||||
output_format = request.output_format.lower()
|
||||
audio_bytes, content_type = convert_audio(
|
||||
result.audio,
|
||||
result.sample_rate,
|
||||
output_format,
|
||||
)
|
||||
|
||||
return Response(
|
||||
content=audio_bytes,
|
||||
media_type=content_type,
|
||||
headers={
|
||||
"X-Model": "f5-tts",
|
||||
"X-Voice-ID": voice,
|
||||
"X-Duration": str(result.duration),
|
||||
"X-Sample-Rate": str(result.sample_rate),
|
||||
},
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error(f"F5-TTS auto synthesis error: {e}")
|
||||
raise HTTPException(status_code=500, detail=f"Voice synthesis failed: {e}")
|
||||
|
||||
# Unknown voice - fall back to Kokoro with default
|
||||
logger.warning(f"Unknown voice '{voice}', falling back to Kokoro default")
|
||||
kokoro_request = KokoroRequest(
|
||||
text=request.text,
|
||||
voice=DEFAULT_KOKORO_VOICE,
|
||||
speed=request.speed,
|
||||
output_format=request.output_format,
|
||||
)
|
||||
return await synthesize_with_kokoro(kokoro_request)
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# Error Handler
|
||||
# ============================================================================
|
||||
|
||||
|
||||
@app.exception_handler(Exception)
|
||||
async def global_exception_handler(request, exc):
|
||||
"""Handle uncaught exceptions."""
|
||||
logger.error(f"Unhandled exception: {exc}")
|
||||
return Response(
|
||||
content=f'{{"error": "Internal server error", "detail": "{str(exc)}"}}',
|
||||
status_code=500,
|
||||
media_type="application/json",
|
||||
)
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# Main
|
||||
# ============================================================================
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import uvicorn
|
||||
|
||||
uvicorn.run(app, host="0.0.0.0", port=PORT)
|
||||
275
services/mana-tts/app/voice_manager.py
Normal file
275
services/mana-tts/app/voice_manager.py
Normal file
|
|
@ -0,0 +1,275 @@
|
|||
"""
|
||||
Voice Manager for registering and managing custom voices.
|
||||
Handles pre-defined voices from the voices/ directory and runtime-registered voices.
|
||||
"""
|
||||
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
from dataclasses import dataclass, asdict
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Base directory for voices
|
||||
VOICES_DIR = Path(__file__).parent.parent / "voices"
|
||||
|
||||
# Registry file for custom voices
|
||||
REGISTRY_FILE = VOICES_DIR / "registry.json"
|
||||
|
||||
|
||||
@dataclass
|
||||
class CustomVoice:
|
||||
"""Custom voice registration."""
|
||||
|
||||
id: str
|
||||
name: str
|
||||
description: str
|
||||
audio_path: str
|
||||
transcript: str
|
||||
created_at: str # ISO format timestamp
|
||||
|
||||
|
||||
class VoiceManager:
|
||||
"""Manages custom voice registrations for F5-TTS."""
|
||||
|
||||
def __init__(self, voices_dir: Path = VOICES_DIR):
|
||||
self.voices_dir = voices_dir
|
||||
self.registry_file = voices_dir / "registry.json"
|
||||
self._voices: dict[str, CustomVoice] = {}
|
||||
self._load_registry()
|
||||
self._scan_predefined_voices()
|
||||
|
||||
def _load_registry(self) -> None:
|
||||
"""Load voice registry from disk."""
|
||||
if not self.registry_file.exists():
|
||||
logger.info("No voice registry found, starting fresh")
|
||||
return
|
||||
|
||||
try:
|
||||
with open(self.registry_file, "r") as f:
|
||||
data = json.load(f)
|
||||
|
||||
for voice_id, voice_data in data.items():
|
||||
# Verify audio file exists
|
||||
if Path(voice_data["audio_path"]).exists():
|
||||
self._voices[voice_id] = CustomVoice(**voice_data)
|
||||
else:
|
||||
logger.warning(
|
||||
f"Voice '{voice_id}' audio file not found: {voice_data['audio_path']}"
|
||||
)
|
||||
|
||||
logger.info(f"Loaded {len(self._voices)} custom voices from registry")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to load voice registry: {e}")
|
||||
|
||||
def _save_registry(self) -> None:
|
||||
"""Save voice registry to disk."""
|
||||
try:
|
||||
data = {
|
||||
voice_id: asdict(voice)
|
||||
for voice_id, voice in self._voices.items()
|
||||
}
|
||||
with open(self.registry_file, "w") as f:
|
||||
json.dump(data, f, indent=2)
|
||||
logger.info("Voice registry saved")
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to save voice registry: {e}")
|
||||
|
||||
def _scan_predefined_voices(self) -> None:
|
||||
"""Scan voices directory for pre-defined voices."""
|
||||
if not self.voices_dir.exists():
|
||||
return
|
||||
|
||||
# Look for voice directories with audio + transcript
|
||||
for voice_dir in self.voices_dir.iterdir():
|
||||
if not voice_dir.is_dir():
|
||||
continue
|
||||
|
||||
voice_id = voice_dir.name
|
||||
if voice_id in self._voices:
|
||||
continue # Already registered
|
||||
|
||||
# Look for audio file
|
||||
audio_file = None
|
||||
for ext in [".wav", ".mp3", ".m4a", ".flac"]:
|
||||
candidate = voice_dir / f"reference{ext}"
|
||||
if candidate.exists():
|
||||
audio_file = candidate
|
||||
break
|
||||
|
||||
# Look for transcript
|
||||
transcript_file = voice_dir / "transcript.txt"
|
||||
if not transcript_file.exists():
|
||||
continue
|
||||
|
||||
if not audio_file:
|
||||
logger.warning(f"No reference audio found in {voice_dir}")
|
||||
continue
|
||||
|
||||
# Load transcript
|
||||
try:
|
||||
transcript = transcript_file.read_text().strip()
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to read transcript for {voice_id}: {e}")
|
||||
continue
|
||||
|
||||
# Load metadata if exists
|
||||
metadata_file = voice_dir / "metadata.json"
|
||||
name = voice_id
|
||||
description = f"Pre-defined voice: {voice_id}"
|
||||
|
||||
if metadata_file.exists():
|
||||
try:
|
||||
with open(metadata_file, "r") as f:
|
||||
metadata = json.load(f)
|
||||
name = metadata.get("name", name)
|
||||
description = metadata.get("description", description)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Register pre-defined voice
|
||||
from datetime import datetime
|
||||
|
||||
self._voices[voice_id] = CustomVoice(
|
||||
id=voice_id,
|
||||
name=name,
|
||||
description=description,
|
||||
audio_path=str(audio_file),
|
||||
transcript=transcript,
|
||||
created_at=datetime.now().isoformat(),
|
||||
)
|
||||
logger.info(f"Found pre-defined voice: {voice_id}")
|
||||
|
||||
def register_voice(
|
||||
self,
|
||||
voice_id: str,
|
||||
name: str,
|
||||
description: str,
|
||||
audio_bytes: bytes,
|
||||
transcript: str,
|
||||
audio_extension: str = ".wav",
|
||||
) -> CustomVoice:
|
||||
"""
|
||||
Register a new custom voice.
|
||||
|
||||
Args:
|
||||
voice_id: Unique voice identifier
|
||||
name: Display name
|
||||
description: Voice description
|
||||
audio_bytes: Reference audio data
|
||||
transcript: Transcript of the reference audio
|
||||
audio_extension: Audio file extension
|
||||
|
||||
Returns:
|
||||
Registered CustomVoice
|
||||
|
||||
Raises:
|
||||
ValueError: If voice_id already exists
|
||||
"""
|
||||
if voice_id in self._voices:
|
||||
raise ValueError(f"Voice '{voice_id}' already exists")
|
||||
|
||||
# Validate voice_id format
|
||||
if not voice_id.replace("_", "").replace("-", "").isalnum():
|
||||
raise ValueError("Voice ID must be alphanumeric (with _ or -)")
|
||||
|
||||
# Create voice directory
|
||||
voice_dir = self.voices_dir / voice_id
|
||||
voice_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Save audio file
|
||||
audio_path = voice_dir / f"reference{audio_extension}"
|
||||
with open(audio_path, "wb") as f:
|
||||
f.write(audio_bytes)
|
||||
|
||||
# Save transcript
|
||||
transcript_file = voice_dir / "transcript.txt"
|
||||
with open(transcript_file, "w") as f:
|
||||
f.write(transcript)
|
||||
|
||||
# Create voice entry
|
||||
from datetime import datetime
|
||||
|
||||
voice = CustomVoice(
|
||||
id=voice_id,
|
||||
name=name,
|
||||
description=description,
|
||||
audio_path=str(audio_path),
|
||||
transcript=transcript,
|
||||
created_at=datetime.now().isoformat(),
|
||||
)
|
||||
|
||||
# Save metadata
|
||||
metadata_file = voice_dir / "metadata.json"
|
||||
with open(metadata_file, "w") as f:
|
||||
json.dump(
|
||||
{"name": name, "description": description},
|
||||
f,
|
||||
indent=2,
|
||||
)
|
||||
|
||||
# Add to registry
|
||||
self._voices[voice_id] = voice
|
||||
self._save_registry()
|
||||
|
||||
logger.info(f"Registered new voice: {voice_id}")
|
||||
return voice
|
||||
|
||||
def get_voice(self, voice_id: str) -> Optional[CustomVoice]:
|
||||
"""Get a voice by ID."""
|
||||
return self._voices.get(voice_id)
|
||||
|
||||
def delete_voice(self, voice_id: str) -> bool:
|
||||
"""
|
||||
Delete a custom voice.
|
||||
|
||||
Args:
|
||||
voice_id: Voice to delete
|
||||
|
||||
Returns:
|
||||
True if deleted, False if not found
|
||||
"""
|
||||
if voice_id not in self._voices:
|
||||
return False
|
||||
|
||||
voice = self._voices[voice_id]
|
||||
|
||||
# Remove voice directory
|
||||
voice_dir = self.voices_dir / voice_id
|
||||
if voice_dir.exists():
|
||||
import shutil
|
||||
|
||||
try:
|
||||
shutil.rmtree(voice_dir)
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to delete voice directory: {e}")
|
||||
|
||||
# Remove from registry
|
||||
del self._voices[voice_id]
|
||||
self._save_registry()
|
||||
|
||||
logger.info(f"Deleted voice: {voice_id}")
|
||||
return True
|
||||
|
||||
def list_voices(self) -> list[CustomVoice]:
|
||||
"""List all registered custom voices."""
|
||||
return list(self._voices.values())
|
||||
|
||||
def voice_exists(self, voice_id: str) -> bool:
|
||||
"""Check if a voice exists."""
|
||||
return voice_id in self._voices
|
||||
|
||||
|
||||
# Global singleton instance
|
||||
_voice_manager: Optional[VoiceManager] = None
|
||||
|
||||
|
||||
def get_voice_manager() -> VoiceManager:
|
||||
"""Get the global VoiceManager instance."""
|
||||
global _voice_manager
|
||||
if _voice_manager is None:
|
||||
_voice_manager = VoiceManager()
|
||||
return _voice_manager
|
||||
22
services/mana-tts/requirements.txt
Normal file
22
services/mana-tts/requirements.txt
Normal file
|
|
@ -0,0 +1,22 @@
|
|||
# Web Framework
|
||||
fastapi>=0.115.0
|
||||
uvicorn[standard]>=0.34.0
|
||||
python-multipart>=0.0.20
|
||||
|
||||
# TTS Models (MLX optimized for Apple Silicon)
|
||||
f5-tts-mlx>=0.2.6
|
||||
mlx-audio>=0.1.0
|
||||
mlx>=0.21.0
|
||||
|
||||
# Kokoro dependencies (phonemizer)
|
||||
misaki[en]>=0.9.0
|
||||
|
||||
# Audio Processing
|
||||
soundfile>=0.13.0
|
||||
scipy>=1.11.0
|
||||
numpy>=1.26.0
|
||||
pydub>=0.25.1
|
||||
tqdm>=4.67.0
|
||||
|
||||
# Utilities
|
||||
aiofiles>=24.1.0
|
||||
150
services/mana-tts/setup.sh
Executable file
150
services/mana-tts/setup.sh
Executable file
|
|
@ -0,0 +1,150 @@
|
|||
#!/bin/bash
|
||||
# Setup script for Mana TTS service
|
||||
# Optimized for Apple Silicon (MLX)
|
||||
|
||||
set -e
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
VENV_DIR="$SCRIPT_DIR/.venv"
|
||||
PYTHON_VERSION="3.11"
|
||||
|
||||
echo "=========================================="
|
||||
echo "Mana TTS Setup"
|
||||
echo "=========================================="
|
||||
echo ""
|
||||
|
||||
# Check platform
|
||||
if [[ "$(uname)" != "Darwin" ]]; then
|
||||
echo "Warning: This service is optimized for macOS with Apple Silicon."
|
||||
echo "Some features may not work on other platforms."
|
||||
echo ""
|
||||
fi
|
||||
|
||||
# Check for Apple Silicon
|
||||
if [[ "$(uname -m)" != "arm64" ]]; then
|
||||
echo "Warning: This service is optimized for Apple Silicon (arm64)."
|
||||
echo "Performance may be reduced on Intel Macs."
|
||||
echo ""
|
||||
fi
|
||||
|
||||
# Find Python
|
||||
if command -v python3.11 &> /dev/null; then
|
||||
PYTHON_CMD="python3.11"
|
||||
elif command -v python3 &> /dev/null; then
|
||||
PYTHON_CMD="python3"
|
||||
else
|
||||
echo "Error: Python 3 not found. Please install Python 3.11 or later."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "Using Python: $PYTHON_CMD"
|
||||
$PYTHON_CMD --version
|
||||
echo ""
|
||||
|
||||
# Check Python version
|
||||
PYTHON_MAJOR=$($PYTHON_CMD -c "import sys; print(sys.version_info.major)")
|
||||
PYTHON_MINOR=$($PYTHON_CMD -c "import sys; print(sys.version_info.minor)")
|
||||
|
||||
if [[ $PYTHON_MAJOR -lt 3 ]] || [[ $PYTHON_MINOR -lt 10 ]]; then
|
||||
echo "Error: Python 3.10 or later required. Found $PYTHON_MAJOR.$PYTHON_MINOR"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Create or recreate virtual environment
|
||||
if [[ -d "$VENV_DIR" ]]; then
|
||||
echo "Virtual environment exists at $VENV_DIR"
|
||||
read -p "Recreate it? (y/N) " -n 1 -r
|
||||
echo ""
|
||||
if [[ $REPLY =~ ^[Yy]$ ]]; then
|
||||
echo "Removing existing virtual environment..."
|
||||
rm -rf "$VENV_DIR"
|
||||
echo "Creating new virtual environment..."
|
||||
$PYTHON_CMD -m venv "$VENV_DIR"
|
||||
fi
|
||||
else
|
||||
echo "Creating virtual environment..."
|
||||
$PYTHON_CMD -m venv "$VENV_DIR"
|
||||
fi
|
||||
|
||||
# Activate virtual environment
|
||||
echo "Activating virtual environment..."
|
||||
source "$VENV_DIR/bin/activate"
|
||||
|
||||
# Upgrade pip
|
||||
echo ""
|
||||
echo "Upgrading pip..."
|
||||
pip install --upgrade pip
|
||||
|
||||
# Install dependencies
|
||||
echo ""
|
||||
echo "Installing dependencies..."
|
||||
pip install -r "$SCRIPT_DIR/requirements.txt"
|
||||
|
||||
# Install ffmpeg check (for MP3 support)
|
||||
echo ""
|
||||
echo "Checking for ffmpeg (required for MP3 output)..."
|
||||
if command -v ffmpeg &> /dev/null; then
|
||||
echo "ffmpeg found: $(which ffmpeg)"
|
||||
else
|
||||
echo "Warning: ffmpeg not found. MP3 output will not work."
|
||||
echo "Install with: brew install ffmpeg"
|
||||
fi
|
||||
|
||||
# Verify installations
|
||||
echo ""
|
||||
echo "Verifying installations..."
|
||||
|
||||
# Test FastAPI
|
||||
python -c "import fastapi; print(f'FastAPI {fastapi.__version__}')" || {
|
||||
echo "Error: FastAPI not installed correctly"
|
||||
exit 1
|
||||
}
|
||||
|
||||
# Test soundfile
|
||||
python -c "import soundfile; print(f'soundfile {soundfile.__version__}')" || {
|
||||
echo "Error: soundfile not installed correctly"
|
||||
exit 1
|
||||
}
|
||||
|
||||
# Test MLX (on Apple Silicon)
|
||||
if [[ "$(uname -m)" == "arm64" ]]; then
|
||||
python -c "import mlx; print(f'MLX {mlx.__version__}')" || {
|
||||
echo "Warning: MLX not installed correctly. TTS may not work."
|
||||
}
|
||||
fi
|
||||
|
||||
# Test mlx-audio
|
||||
python -c "import mlx_audio; print('mlx-audio installed')" 2>/dev/null || {
|
||||
echo "Warning: mlx-audio not imported successfully."
|
||||
echo "You may need to install it manually or models won't load."
|
||||
}
|
||||
|
||||
# Create directories
|
||||
echo ""
|
||||
echo "Creating required directories..."
|
||||
mkdir -p "$SCRIPT_DIR/voices"
|
||||
mkdir -p "$SCRIPT_DIR/mlx_models"
|
||||
|
||||
echo ""
|
||||
echo "=========================================="
|
||||
echo "Setup Complete!"
|
||||
echo "=========================================="
|
||||
echo ""
|
||||
echo "To start the service:"
|
||||
echo ""
|
||||
echo " cd $SCRIPT_DIR"
|
||||
echo " source .venv/bin/activate"
|
||||
echo " uvicorn app.main:app --host 0.0.0.0 --port 3022"
|
||||
echo ""
|
||||
echo "Or for development with auto-reload:"
|
||||
echo ""
|
||||
echo " uvicorn app.main:app --host 0.0.0.0 --port 3022 --reload"
|
||||
echo ""
|
||||
echo "Test the service:"
|
||||
echo ""
|
||||
echo " curl http://localhost:3022/health"
|
||||
echo ""
|
||||
echo "For Mac Mini deployment, run:"
|
||||
echo ""
|
||||
echo " ./../../scripts/mac-mini/setup-tts.sh"
|
||||
echo ""
|
||||
0
services/mana-tts/voices/.gitkeep
Normal file
0
services/mana-tts/voices/.gitkeep
Normal file
Loading…
Add table
Add a link
Reference in a new issue