mirror of
https://github.com/Memo-2023/mana-monorepo.git
synced 2026-05-16 16:59:41 +02:00
- Disable vLLM by default (has issues on macOS CPU) - Use Mistral API for Voxtral transcription (cloud-based) - Keep Whisper-MLX for local transcription - Update README with architecture diagram Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
185 lines
4.6 KiB
Markdown
185 lines
4.6 KiB
Markdown
# ManaCore STT Service
|
|
|
|
Speech-to-Text API service with **Whisper (Lightning MLX)** and **Voxtral (Mistral API)**.
|
|
|
|
Optimized for Mac Mini M4 (Apple Silicon).
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────────────┐
|
|
│ mana-stt (3020) │
|
|
│ FastAPI │
|
|
└─────────┬───────────┘
|
|
│
|
|
┌─────────────────┼─────────────────┐
|
|
▼ ▼ ▼
|
|
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
|
│ Whisper │ │ Voxtral API │ │ vLLM │
|
|
│ MLX (Local) │ │ (Mistral) │ │ (Optional) │
|
|
└──────────────┘ └──────────────┘ └──────────────┘
|
|
```
|
|
|
|
## Features
|
|
|
|
- **Whisper Large V3** - Best quality, 99+ languages, German WER 6-9% (local, MLX)
|
|
- **Voxtral Mini** - Mistral API, speaker diarization support (cloud)
|
|
- **Apple Silicon Optimized** - Uses MLX for fast local inference
|
|
- **Automatic Fallback** - Falls back between backends automatically
|
|
- **REST API** - Simple HTTP endpoints for integration
|
|
|
|
## Quick Start
|
|
|
|
### Installation
|
|
|
|
```bash
|
|
cd services/mana-stt
|
|
./setup.sh
|
|
```
|
|
|
|
### Run Locally
|
|
|
|
```bash
|
|
source .venv/bin/activate
|
|
uvicorn app.main:app --host 0.0.0.0 --port 3020
|
|
```
|
|
|
|
### Setup as System Service (Mac Mini)
|
|
|
|
```bash
|
|
./scripts/mac-mini/setup-stt.sh
|
|
```
|
|
|
|
## API Endpoints
|
|
|
|
| Endpoint | Method | Description |
|
|
|----------|--------|-------------|
|
|
| `/health` | GET | Health check |
|
|
| `/models` | GET | List available models |
|
|
| `/transcribe` | POST | Whisper transcription |
|
|
| `/transcribe/voxtral` | POST | Voxtral transcription |
|
|
| `/transcribe/auto` | POST | Auto-select best model |
|
|
|
|
## Usage Examples
|
|
|
|
### Transcribe with Whisper (Recommended)
|
|
|
|
```bash
|
|
curl -X POST http://localhost:3020/transcribe \
|
|
-F "file=@recording.mp3" \
|
|
-F "language=de"
|
|
```
|
|
|
|
Response:
|
|
```json
|
|
{
|
|
"text": "Das ist ein Beispieltext...",
|
|
"language": "de",
|
|
"model": "whisper-large-v3-turbo"
|
|
}
|
|
```
|
|
|
|
### Transcribe with Voxtral
|
|
|
|
```bash
|
|
curl -X POST http://localhost:3020/transcribe/voxtral \
|
|
-F "file=@recording.mp3" \
|
|
-F "language=de"
|
|
```
|
|
|
|
### Auto-Select Model
|
|
|
|
```bash
|
|
curl -X POST http://localhost:3020/transcribe/auto \
|
|
-F "file=@recording.mp3" \
|
|
-F "prefer=whisper"
|
|
```
|
|
|
|
## Configuration
|
|
|
|
Environment variables:
|
|
|
|
| Variable | Default | Description |
|
|
|----------|---------|-------------|
|
|
| `PORT` | `3020` | API server port |
|
|
| `WHISPER_MODEL` | `large-v3` | Default Whisper model |
|
|
| `PRELOAD_MODELS` | `false` | Load models on startup |
|
|
| `CORS_ORIGINS` | `https://mana.how,...` | Allowed CORS origins |
|
|
| `MISTRAL_API_KEY` | - | Required for Voxtral API |
|
|
| `USE_VLLM` | `false` | Enable vLLM backend (experimental) |
|
|
| `VLLM_URL` | `http://localhost:8100` | vLLM server URL |
|
|
|
|
## Supported Audio Formats
|
|
|
|
- MP3, WAV, M4A, FLAC, OGG, WebM, MP4
|
|
- Max file size: 100MB
|
|
- Any sample rate (automatically resampled to 16kHz)
|
|
|
|
## Model Comparison
|
|
|
|
| Model | German WER | Speed | VRAM | License |
|
|
|-------|------------|-------|------|---------|
|
|
| Whisper Large V3 Turbo | 6-9% | Fast | ~6 GB | MIT |
|
|
| Voxtral Mini (3B) | 8-12% | Medium | ~4 GB | Apache 2.0 |
|
|
|
|
## Logs
|
|
|
|
```bash
|
|
# Service logs
|
|
tail -f /tmp/manacore-stt.log
|
|
|
|
# Error logs
|
|
tail -f /tmp/manacore-stt.error.log
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Model Download Slow
|
|
|
|
First run downloads ~1.6 GB for Whisper and ~6 GB for Voxtral. Be patient.
|
|
|
|
### Out of Memory
|
|
|
|
Reduce batch size or use smaller model:
|
|
```bash
|
|
export WHISPER_MODEL=medium
|
|
```
|
|
|
|
### MPS Not Available
|
|
|
|
Ensure PyTorch is installed with MPS support:
|
|
```bash
|
|
pip install torch torchvision torchaudio
|
|
python -c "import torch; print(torch.backends.mps.is_available())"
|
|
```
|
|
|
|
## Integration
|
|
|
|
### From Chat Backend (NestJS)
|
|
|
|
```typescript
|
|
const formData = new FormData();
|
|
formData.append('file', audioBuffer, 'recording.webm');
|
|
formData.append('language', 'de');
|
|
|
|
const response = await fetch('http://localhost:3020/transcribe', {
|
|
method: 'POST',
|
|
body: formData,
|
|
});
|
|
|
|
const { text } = await response.json();
|
|
```
|
|
|
|
### From SvelteKit Web
|
|
|
|
```typescript
|
|
const formData = new FormData();
|
|
formData.append('file', audioBlob, 'recording.webm');
|
|
|
|
const response = await fetch('https://stt-api.mana.how/transcribe', {
|
|
method: 'POST',
|
|
body: formData,
|
|
});
|
|
|
|
const { text } = await response.json();
|
|
```
|