mirror of
https://github.com/Memo-2023/mana-monorepo.git
synced 2026-05-14 19:21:10 +02:00
chore(cutover): remove services/mana-stt/ — moved to mana-platform
Live containers on the Mac Mini build out of `../mana/services/mana-stt/`
since the 8-Doppel-Cutover commit (774852ba2). Smoke test green
2026-05-08 — health endpoints, JWKS, login flow, Stripe-webhook all
reachable from the new build path. Removing the now-stale duplicate.
Was 132K in this repo, gone now. Active code lives in
`Code/mana/services/mana-stt/` (siehe ../mana/CLAUDE.md).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
879975b665
commit
3c4a6d4f69
16 changed files with 0 additions and 2973 deletions
|
|
@ -1,70 +0,0 @@
|
|||
# Mana STT Service Configuration
|
||||
# Copy to .env and adjust values as needed
|
||||
|
||||
# Server
|
||||
PORT=3020
|
||||
|
||||
# Whisper (Lightning MLX)
|
||||
WHISPER_MODEL=large-v3
|
||||
|
||||
# Voxtral (Local Models)
|
||||
# Options: voxtral-mini-3b, voxtral-realtime-4b, voxtral-small-24b
|
||||
VOXTRAL_MODEL=voxtral-realtime-4b
|
||||
|
||||
# WhisperX (CUDA GPU Server)
|
||||
# Enable WhisperX for rich transcription (diarization, word alignment)
|
||||
# Requires NVIDIA GPU + requirements-cuda.txt
|
||||
USE_WHISPERX=false
|
||||
|
||||
# WhisperX batch size (higher = faster but more VRAM, 16 works well for RTX 3090)
|
||||
WHISPERX_BATCH_SIZE=16
|
||||
|
||||
# Device and compute type for CUDA
|
||||
# WHISPER_DEVICE=cuda
|
||||
# WHISPER_COMPUTE_TYPE=float16
|
||||
|
||||
# HuggingFace token for pyannote speaker diarization models
|
||||
# Required for diarization. Accept terms at:
|
||||
# https://huggingface.co/pyannote/speaker-diarization-3.1
|
||||
# https://huggingface.co/pyannote/segmentation-3.0
|
||||
HF_TOKEN=
|
||||
|
||||
# Model Loading
|
||||
# Set to true to preload models on startup (slower startup, faster first request)
|
||||
PRELOAD_MODELS=false
|
||||
|
||||
# Load Management
|
||||
# Maximum concurrent transcription requests before API fallback
|
||||
MAX_CONCURRENT_REQUESTS=3
|
||||
|
||||
# API Fallback
|
||||
# Enable automatic fallback to Mistral API when overloaded
|
||||
API_FALLBACK_ENABLED=true
|
||||
|
||||
# Mistral API Key (required for API fallback)
|
||||
# Get your key at https://console.mistral.ai/
|
||||
MISTRAL_API_KEY=
|
||||
|
||||
# CORS Origins (comma-separated)
|
||||
CORS_ORIGINS=https://mana.how,https://chat.mana.how,http://localhost:5173
|
||||
|
||||
# ===========================================
|
||||
# Authentication
|
||||
# ===========================================
|
||||
|
||||
# Enable API key authentication (default: true for production)
|
||||
REQUIRE_AUTH=true
|
||||
|
||||
# API Keys (comma-separated, format: key:name)
|
||||
# Example: sk-abc123:myapp,sk-def456:testuser
|
||||
API_KEYS=
|
||||
|
||||
# Internal API key (no rate limit, for internal services)
|
||||
# Generate with: openssl rand -hex 32
|
||||
INTERNAL_API_KEY=
|
||||
|
||||
# Rate Limiting
|
||||
# Requests per window per API key
|
||||
RATE_LIMIT_REQUESTS=60
|
||||
# Window size in seconds
|
||||
RATE_LIMIT_WINDOW=60
|
||||
|
|
@ -1,96 +0,0 @@
|
|||
# mana-stt
|
||||
|
||||
Speech-to-Text microservice. Wraps Whisper (CUDA, with WhisperX for word-level timestamps + diarization), local Voxtral via vLLM, and Mistral's hosted Voxtral API behind a small FastAPI surface. Lives on the Windows GPU server (`mana-server-gpu`, RTX 3090).
|
||||
|
||||
> ⚠️ **Earlier history**: this directory used to contain Mac-Mini–targeted
|
||||
> code (Whisper Lightning MLX, com.mana.mana-stt.plist launchd setup,
|
||||
> setup.sh with Apple-Silicon checks). That all moved to the Windows
|
||||
> GPU box and was removed from the repo. If you're looking for the MLX
|
||||
> path, see git history.
|
||||
|
||||
## Tech Stack
|
||||
|
||||
| Layer | Technology |
|
||||
|-------|------------|
|
||||
| **Runtime** | Python 3.11 + uvicorn (Windows) |
|
||||
| **Framework** | FastAPI |
|
||||
| **Whisper** | `whisperx` on CUDA (large-v3 + word alignment + pyannote diarization) |
|
||||
| **Voxtral (local)** | vLLM serving Voxtral 3B/4B/24B (`vllm_service.py`) |
|
||||
| **Voxtral (cloud)** | Mistral API (`voxtral_api_service.py`) |
|
||||
| **Auth** | Per-key + internal-key API auth (`app/auth.py`, JWT via mana-auth in `app/external_auth.py`) |
|
||||
| **VRAM** | Shared `vram_manager.py` accountant — coordinated with mana-tts and mana-image-gen so multiple GPU services don't OOM each other |
|
||||
| **Process supervision** | Windows Scheduled Task `ManaSTT` (AtLogOn) |
|
||||
|
||||
## Port: 3020
|
||||
|
||||
## Where it runs
|
||||
|
||||
| Host | Path on disk | Entrypoint |
|
||||
|------|--------------|------------|
|
||||
| Windows GPU server (`192.168.178.11`) | `C:\mana\services\mana-stt\` | `service.pyw` via Scheduled Task `ManaSTT` |
|
||||
|
||||
Public URL: `https://gpu-stt.mana.how` (via Cloudflare Tunnel + Mac Mini gpu-proxy).
|
||||
|
||||
## API Endpoints
|
||||
|
||||
| Method | Path | Description |
|
||||
|--------|------|-------------|
|
||||
| GET | `/health` | Liveness + which backends are loaded |
|
||||
| GET | `/models` | Available STT models |
|
||||
| POST | `/transcribe` | Whisper (WhisperX, default) — multipart `file` + optional `language` |
|
||||
| POST | `/transcribe/voxtral` | Local Voxtral via vLLM |
|
||||
| POST | `/transcribe/auto` | Routing helper — picks the best backend for the input |
|
||||
|
||||
All endpoints (except `/health`) require `Authorization: Bearer <token>`. Tokens are validated against `API_KEYS` (per-app keys) or `INTERNAL_API_KEY` (no rate limit), and JWTs from mana-auth are also accepted via `external_auth.py`.
|
||||
|
||||
## Backends (`app/`)
|
||||
|
||||
| File | What it loads |
|
||||
|------|---------------|
|
||||
| `whisper_service.py` | WhisperX on CUDA (large-v3 + alignment + pyannote diarization) |
|
||||
| `voxtral_service.py` | Local Voxtral via vLLM (slower start, richer multilingual) |
|
||||
| `voxtral_api_service.py` | Mistral hosted Voxtral API (cloud, no GPU needed) |
|
||||
| `vllm_service.py` | vLLM client primitives shared by Voxtral |
|
||||
| `vram_manager.py` | Shared VRAM accounting — same module also used by mana-tts and mana-image-gen |
|
||||
| `auth.py` | API-key auth (internal + per-app keys) |
|
||||
| `external_auth.py` | JWT validation via mana-auth |
|
||||
|
||||
Backends are loaded lazily during the FastAPI lifespan and reported by `/health`.
|
||||
|
||||
## Configuration (`.env` on the Windows GPU box)
|
||||
|
||||
```env
|
||||
PORT=3020
|
||||
WHISPER_MODEL=large-v3
|
||||
WHISPER_DEVICE=cuda
|
||||
WHISPER_COMPUTE_TYPE=float16
|
||||
WHISPER_DEFAULT_LANGUAGE=de
|
||||
PRELOAD_MODELS=true
|
||||
USE_VLLM=false
|
||||
HF_TOKEN=... # required for pyannote diarization models
|
||||
REQUIRE_AUTH=true
|
||||
API_KEYS=sk-app1:app1,sk-app2:app2
|
||||
INTERNAL_API_KEY=... # cross-service, no rate limit
|
||||
CORS_ORIGINS=https://mana.how,https://chat.mana.how
|
||||
```
|
||||
|
||||
## Operations
|
||||
|
||||
```powershell
|
||||
# Status
|
||||
Get-ScheduledTask -TaskName "ManaSTT" | Format-List TaskName, State
|
||||
Get-NetTCPConnection -LocalPort 3020 -State Listen
|
||||
|
||||
# Restart
|
||||
Stop-ScheduledTask -TaskName "ManaSTT"
|
||||
Start-ScheduledTask -TaskName "ManaSTT"
|
||||
|
||||
# Logs
|
||||
Get-Content C:\mana\services\mana-stt\service.log -Tail 50
|
||||
```
|
||||
|
||||
## Reference
|
||||
|
||||
- `docs/WINDOWS_GPU_SERVER_SETUP.md` — Windows box setup, scheduled tasks, firewall, Cloudflare tunnel
|
||||
- `docs/LOCAL_STT_MODELS.md` — model comparisons (WER, latency, language coverage)
|
||||
- `services/mana-stt/grafana-dashboard.json` — Prometheus metrics dashboard
|
||||
|
|
@ -1,31 +0,0 @@
|
|||
# Mana STT Service
|
||||
|
||||
Speech-to-Text API service running on the Windows GPU server (`mana-server-gpu`, RTX 3090). Wraps **WhisperX** (CUDA, large-v3 + word alignment + pyannote diarization), local **Voxtral via vLLM**, and the hosted **Mistral Voxtral API**.
|
||||
|
||||
For architecture, deployment, configuration, and operations see [`CLAUDE.md`](./CLAUDE.md) and [`docs/WINDOWS_GPU_SERVER_SETUP.md`](../../docs/WINDOWS_GPU_SERVER_SETUP.md).
|
||||
|
||||
## Port: 3020
|
||||
|
||||
## Public URL
|
||||
|
||||
`https://gpu-stt.mana.how` (via Cloudflare Tunnel + Mac Mini gpu-proxy)
|
||||
|
||||
## API Endpoints
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
|----------|--------|-------------|
|
||||
| `/health` | GET | Health check + which backends are loaded |
|
||||
| `/models` | GET | List available models |
|
||||
| `/transcribe` | POST | Whisper / WhisperX transcription |
|
||||
| `/transcribe/voxtral` | POST | Voxtral transcription (local vLLM) |
|
||||
| `/transcribe/auto` | POST | Auto-select best backend for the input |
|
||||
|
||||
All endpoints (except `/health`) require `Authorization: Bearer <token>`.
|
||||
|
||||
## Quick Test
|
||||
|
||||
```bash
|
||||
curl -F "file=@audio.wav" -F "language=de" \
|
||||
-H "Authorization: Bearer $INTERNAL_API_KEY" \
|
||||
https://gpu-stt.mana.how/transcribe
|
||||
```
|
||||
|
|
@ -1 +0,0 @@
|
|||
# Mana STT Service
|
||||
|
|
@ -1,271 +0,0 @@
|
|||
"""
|
||||
API Key Authentication for ManaCore STT Service
|
||||
|
||||
Supports two authentication modes:
|
||||
1. Local API keys: Configured via environment variables
|
||||
2. External API keys: Validated via mana-core-auth service (when EXTERNAL_AUTH_ENABLED=true)
|
||||
|
||||
Usage:
|
||||
# Local keys
|
||||
API_KEYS=sk-key1:name1,sk-key2:name2
|
||||
INTERNAL_API_KEY=sk-internal-xxx
|
||||
|
||||
# External auth (for user-created keys via mana.how)
|
||||
EXTERNAL_AUTH_ENABLED=true
|
||||
MANA_CORE_AUTH_URL=http://localhost:3001
|
||||
"""
|
||||
|
||||
import os
|
||||
import time
|
||||
import logging
|
||||
from typing import Optional
|
||||
from collections import defaultdict
|
||||
from dataclasses import dataclass, field
|
||||
|
||||
from fastapi import HTTPException, Security, Request
|
||||
from fastapi.security import APIKeyHeader
|
||||
|
||||
from .external_auth import (
|
||||
is_external_auth_enabled,
|
||||
validate_api_key_external,
|
||||
ExternalValidationResult,
|
||||
)
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Configuration
|
||||
API_KEYS_ENV = os.getenv("API_KEYS", "") # Format: "sk-key1:name1,sk-key2:name2"
|
||||
INTERNAL_API_KEY = os.getenv("INTERNAL_API_KEY", "") # Unlimited internal key
|
||||
REQUIRE_AUTH = os.getenv("REQUIRE_AUTH", "true").lower() == "true"
|
||||
RATE_LIMIT_REQUESTS = int(os.getenv("RATE_LIMIT_REQUESTS", "60")) # Per minute
|
||||
RATE_LIMIT_WINDOW = int(os.getenv("RATE_LIMIT_WINDOW", "60")) # Seconds
|
||||
|
||||
|
||||
@dataclass
|
||||
class APIKey:
|
||||
"""API Key with metadata."""
|
||||
key: str
|
||||
name: str
|
||||
is_internal: bool = False
|
||||
rate_limit: int = RATE_LIMIT_REQUESTS # Requests per window
|
||||
|
||||
|
||||
@dataclass
|
||||
class RateLimitInfo:
|
||||
"""Rate limit tracking per key."""
|
||||
requests: list = field(default_factory=list)
|
||||
|
||||
def is_allowed(self, limit: int, window: int) -> bool:
|
||||
"""Check if request is allowed within rate limit."""
|
||||
now = time.time()
|
||||
# Remove old requests outside window
|
||||
self.requests = [t for t in self.requests if now - t < window]
|
||||
|
||||
if len(self.requests) >= limit:
|
||||
return False
|
||||
|
||||
self.requests.append(now)
|
||||
return True
|
||||
|
||||
def remaining(self, limit: int, window: int) -> int:
|
||||
"""Get remaining requests in current window."""
|
||||
now = time.time()
|
||||
self.requests = [t for t in self.requests if now - t < window]
|
||||
return max(0, limit - len(self.requests))
|
||||
|
||||
|
||||
# Parse API keys from environment
|
||||
def _parse_api_keys() -> dict[str, APIKey]:
|
||||
"""Parse API keys from environment variables."""
|
||||
keys = {}
|
||||
|
||||
# Parse comma-separated keys
|
||||
if API_KEYS_ENV:
|
||||
for entry in API_KEYS_ENV.split(","):
|
||||
entry = entry.strip()
|
||||
if ":" in entry:
|
||||
key, name = entry.split(":", 1)
|
||||
else:
|
||||
key, name = entry, "default"
|
||||
keys[key.strip()] = APIKey(key=key.strip(), name=name.strip())
|
||||
|
||||
# Add internal key with no rate limit
|
||||
if INTERNAL_API_KEY:
|
||||
keys[INTERNAL_API_KEY] = APIKey(
|
||||
key=INTERNAL_API_KEY,
|
||||
name="internal",
|
||||
is_internal=True,
|
||||
rate_limit=999999, # Effectively unlimited
|
||||
)
|
||||
|
||||
return keys
|
||||
|
||||
|
||||
# Global state
|
||||
_api_keys = _parse_api_keys()
|
||||
_rate_limits: dict[str, RateLimitInfo] = defaultdict(RateLimitInfo)
|
||||
|
||||
# Security scheme
|
||||
api_key_header = APIKeyHeader(name="X-API-Key", auto_error=False)
|
||||
|
||||
|
||||
@dataclass
|
||||
class AuthResult:
|
||||
"""Result of authentication check."""
|
||||
authenticated: bool
|
||||
key_name: Optional[str] = None
|
||||
is_internal: bool = False
|
||||
rate_limit_remaining: Optional[int] = None
|
||||
user_id: Optional[str] = None # Set when using external auth
|
||||
|
||||
|
||||
async def verify_api_key(
|
||||
request: Request,
|
||||
api_key: Optional[str] = Security(api_key_header),
|
||||
) -> AuthResult:
|
||||
"""
|
||||
Verify API key and check rate limits.
|
||||
|
||||
Supports two authentication modes:
|
||||
1. External auth via mana-core-auth (for sk_live_ keys)
|
||||
2. Local auth via environment variables
|
||||
|
||||
Returns AuthResult with authentication status.
|
||||
Raises HTTPException if auth fails or rate limited.
|
||||
"""
|
||||
# Skip auth for health and docs endpoints
|
||||
path = request.url.path
|
||||
if path in ["/health", "/docs", "/openapi.json", "/redoc"]:
|
||||
return AuthResult(authenticated=True, key_name="public")
|
||||
|
||||
# If auth not required, allow all
|
||||
if not REQUIRE_AUTH:
|
||||
return AuthResult(authenticated=True, key_name="anonymous")
|
||||
|
||||
# Check for API key
|
||||
if not api_key:
|
||||
logger.warning(f"Missing API key for {path} from {request.client.host if request.client else 'unknown'}")
|
||||
raise HTTPException(
|
||||
status_code=401,
|
||||
detail="Missing API key. Provide X-API-Key header.",
|
||||
headers={"WWW-Authenticate": "ApiKey"},
|
||||
)
|
||||
|
||||
# Try external auth first for sk_live_ keys (user-created keys via mana.how)
|
||||
if api_key.startswith("sk_live_") and is_external_auth_enabled():
|
||||
external_result = await validate_api_key_external(api_key, "stt")
|
||||
|
||||
if external_result is not None:
|
||||
if external_result.valid:
|
||||
# Use rate limits from external auth
|
||||
rate_info = _rate_limits[api_key]
|
||||
limit = external_result.rate_limit_requests
|
||||
window = external_result.rate_limit_window
|
||||
|
||||
if not rate_info.is_allowed(limit, window):
|
||||
remaining = rate_info.remaining(limit, window)
|
||||
logger.warning(f"Rate limit exceeded for external key")
|
||||
raise HTTPException(
|
||||
status_code=429,
|
||||
detail=f"Rate limit exceeded. Try again in {window} seconds.",
|
||||
headers={
|
||||
"X-RateLimit-Limit": str(limit),
|
||||
"X-RateLimit-Remaining": str(remaining),
|
||||
"X-RateLimit-Reset": str(int(time.time()) + window),
|
||||
"Retry-After": str(window),
|
||||
},
|
||||
)
|
||||
|
||||
remaining = rate_info.remaining(limit, window)
|
||||
logger.debug(f"Authenticated external request from user {external_result.user_id} to {path}")
|
||||
|
||||
return AuthResult(
|
||||
authenticated=True,
|
||||
key_name="external",
|
||||
is_internal=False,
|
||||
rate_limit_remaining=remaining,
|
||||
user_id=external_result.user_id,
|
||||
)
|
||||
else:
|
||||
# External auth returned invalid
|
||||
logger.warning(f"External auth failed: {external_result.error}")
|
||||
raise HTTPException(
|
||||
status_code=401,
|
||||
detail=external_result.error or "Invalid API key.",
|
||||
headers={"WWW-Authenticate": "ApiKey"},
|
||||
)
|
||||
# If external_result is None, fall through to local auth
|
||||
|
||||
# Local auth: Validate key against environment variables
|
||||
if api_key not in _api_keys:
|
||||
logger.warning(f"Invalid API key attempt for {path}")
|
||||
raise HTTPException(
|
||||
status_code=401,
|
||||
detail="Invalid API key.",
|
||||
headers={"WWW-Authenticate": "ApiKey"},
|
||||
)
|
||||
|
||||
key_info = _api_keys[api_key]
|
||||
|
||||
# Check rate limit (skip for internal keys)
|
||||
if not key_info.is_internal:
|
||||
rate_info = _rate_limits[api_key]
|
||||
if not rate_info.is_allowed(key_info.rate_limit, RATE_LIMIT_WINDOW):
|
||||
remaining = rate_info.remaining(key_info.rate_limit, RATE_LIMIT_WINDOW)
|
||||
logger.warning(f"Rate limit exceeded for key '{key_info.name}'")
|
||||
raise HTTPException(
|
||||
status_code=429,
|
||||
detail=f"Rate limit exceeded. Try again in {RATE_LIMIT_WINDOW} seconds.",
|
||||
headers={
|
||||
"X-RateLimit-Limit": str(key_info.rate_limit),
|
||||
"X-RateLimit-Remaining": str(remaining),
|
||||
"X-RateLimit-Reset": str(int(time.time()) + RATE_LIMIT_WINDOW),
|
||||
"Retry-After": str(RATE_LIMIT_WINDOW),
|
||||
},
|
||||
)
|
||||
remaining = rate_info.remaining(key_info.rate_limit, RATE_LIMIT_WINDOW)
|
||||
else:
|
||||
remaining = None
|
||||
|
||||
logger.debug(f"Authenticated request from '{key_info.name}' to {path}")
|
||||
|
||||
return AuthResult(
|
||||
authenticated=True,
|
||||
key_name=key_info.name,
|
||||
is_internal=key_info.is_internal,
|
||||
rate_limit_remaining=remaining,
|
||||
)
|
||||
|
||||
|
||||
def get_api_key_stats() -> dict:
|
||||
"""Get statistics about API keys (for admin endpoint)."""
|
||||
stats = {
|
||||
"total_keys": len(_api_keys),
|
||||
"auth_required": REQUIRE_AUTH,
|
||||
"rate_limit": {
|
||||
"requests_per_window": RATE_LIMIT_REQUESTS,
|
||||
"window_seconds": RATE_LIMIT_WINDOW,
|
||||
},
|
||||
"keys": [],
|
||||
}
|
||||
|
||||
for key, info in _api_keys.items():
|
||||
# Don't expose actual keys, just metadata
|
||||
masked_key = key[:8] + "..." if len(key) > 8 else "***"
|
||||
rate_info = _rate_limits.get(key, RateLimitInfo())
|
||||
stats["keys"].append({
|
||||
"name": info.name,
|
||||
"key_prefix": masked_key,
|
||||
"is_internal": info.is_internal,
|
||||
"requests_in_window": len(rate_info.requests),
|
||||
"remaining": rate_info.remaining(info.rate_limit, RATE_LIMIT_WINDOW),
|
||||
})
|
||||
|
||||
return stats
|
||||
|
||||
|
||||
def reload_api_keys():
|
||||
"""Reload API keys from environment (for runtime updates)."""
|
||||
global _api_keys
|
||||
_api_keys = _parse_api_keys()
|
||||
logger.info(f"Reloaded {len(_api_keys)} API keys")
|
||||
|
|
@ -1,145 +0,0 @@
|
|||
"""
|
||||
External API Key Validation via mana-core-auth
|
||||
|
||||
When EXTERNAL_AUTH_ENABLED=true, API keys are validated against the
|
||||
central mana-core-auth service. This allows users to create and manage
|
||||
API keys from the mana.how web interface.
|
||||
|
||||
Results are cached for 5 minutes to reduce load on the auth service.
|
||||
"""
|
||||
|
||||
import os
|
||||
import time
|
||||
import logging
|
||||
import httpx
|
||||
from typing import Optional
|
||||
from dataclasses import dataclass
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Configuration
|
||||
EXTERNAL_AUTH_ENABLED = os.getenv("EXTERNAL_AUTH_ENABLED", "false").lower() == "true"
|
||||
MANA_CORE_AUTH_URL = os.getenv("MANA_CORE_AUTH_URL", "http://localhost:3001")
|
||||
API_KEY_CACHE_TTL = int(os.getenv("API_KEY_CACHE_TTL", "300")) # 5 minutes
|
||||
EXTERNAL_AUTH_TIMEOUT = float(os.getenv("EXTERNAL_AUTH_TIMEOUT", "5.0")) # seconds
|
||||
|
||||
|
||||
@dataclass
|
||||
class ExternalValidationResult:
|
||||
"""Result from external API key validation."""
|
||||
valid: bool
|
||||
user_id: Optional[str] = None
|
||||
scopes: Optional[list] = None
|
||||
rate_limit_requests: int = 60
|
||||
rate_limit_window: int = 60
|
||||
error: Optional[str] = None
|
||||
cached_at: float = 0.0
|
||||
|
||||
|
||||
# In-memory cache for validation results
|
||||
# Key: API key, Value: ExternalValidationResult
|
||||
_validation_cache: dict[str, ExternalValidationResult] = {}
|
||||
|
||||
|
||||
def is_external_auth_enabled() -> bool:
|
||||
"""Check if external authentication is enabled."""
|
||||
return EXTERNAL_AUTH_ENABLED
|
||||
|
||||
|
||||
def _get_cached_result(api_key: str) -> Optional[ExternalValidationResult]:
|
||||
"""Get cached validation result if still valid."""
|
||||
result = _validation_cache.get(api_key)
|
||||
if result and (time.time() - result.cached_at) < API_KEY_CACHE_TTL:
|
||||
return result
|
||||
return None
|
||||
|
||||
|
||||
def _cache_result(api_key: str, result: ExternalValidationResult):
|
||||
"""Cache a validation result."""
|
||||
result.cached_at = time.time()
|
||||
_validation_cache[api_key] = result
|
||||
|
||||
# Clean up old entries periodically (keep cache size manageable)
|
||||
if len(_validation_cache) > 1000:
|
||||
now = time.time()
|
||||
expired_keys = [
|
||||
k for k, v in _validation_cache.items()
|
||||
if (now - v.cached_at) >= API_KEY_CACHE_TTL
|
||||
]
|
||||
for k in expired_keys:
|
||||
del _validation_cache[k]
|
||||
|
||||
|
||||
async def validate_api_key_external(api_key: str, scope: str) -> Optional[ExternalValidationResult]:
|
||||
"""
|
||||
Validate an API key against mana-core-auth service.
|
||||
|
||||
Args:
|
||||
api_key: The API key to validate (e.g., "sk_live_...")
|
||||
scope: The required scope (e.g., "stt" or "tts")
|
||||
|
||||
Returns:
|
||||
ExternalValidationResult if external auth is enabled and the key was validated.
|
||||
None if external auth is disabled or the service is unavailable (fallback to local).
|
||||
"""
|
||||
if not EXTERNAL_AUTH_ENABLED:
|
||||
return None
|
||||
|
||||
# Check cache first
|
||||
cached = _get_cached_result(api_key)
|
||||
if cached:
|
||||
logger.debug(f"Using cached validation result for key prefix: {api_key[:12]}...")
|
||||
# Check scope against cached result
|
||||
if cached.valid and cached.scopes and scope not in cached.scopes:
|
||||
return ExternalValidationResult(
|
||||
valid=False,
|
||||
error=f"API key does not have scope: {scope}",
|
||||
)
|
||||
return cached
|
||||
|
||||
# Call mana-core-auth validation endpoint
|
||||
try:
|
||||
async with httpx.AsyncClient(timeout=EXTERNAL_AUTH_TIMEOUT) as client:
|
||||
response = await client.post(
|
||||
f"{MANA_CORE_AUTH_URL}/api/v1/api-keys/validate",
|
||||
json={"apiKey": api_key, "scope": scope},
|
||||
)
|
||||
|
||||
if response.status_code == 200:
|
||||
data = response.json()
|
||||
result = ExternalValidationResult(
|
||||
valid=data.get("valid", False),
|
||||
user_id=data.get("userId"),
|
||||
scopes=data.get("scopes", []),
|
||||
rate_limit_requests=data.get("rateLimit", {}).get("requests", 60),
|
||||
rate_limit_window=data.get("rateLimit", {}).get("window", 60),
|
||||
error=data.get("error"),
|
||||
)
|
||||
_cache_result(api_key, result)
|
||||
return result
|
||||
else:
|
||||
logger.warning(
|
||||
f"External auth returned status {response.status_code}: {response.text}"
|
||||
)
|
||||
# Don't cache errors - allow retry
|
||||
return ExternalValidationResult(
|
||||
valid=False,
|
||||
error=f"Auth service returned {response.status_code}",
|
||||
)
|
||||
|
||||
except httpx.TimeoutException:
|
||||
logger.warning("External auth service timeout - falling back to local auth")
|
||||
return None
|
||||
except httpx.ConnectError:
|
||||
logger.warning("Cannot connect to external auth service - falling back to local auth")
|
||||
return None
|
||||
except Exception as e:
|
||||
logger.error(f"External auth error: {e}")
|
||||
return None
|
||||
|
||||
|
||||
def clear_cache():
|
||||
"""Clear the validation cache (for testing or runtime updates)."""
|
||||
global _validation_cache
|
||||
_validation_cache.clear()
|
||||
logger.info("External auth cache cleared")
|
||||
|
|
@ -1,392 +0,0 @@
|
|||
"""
|
||||
ManaCore STT API Service (WhisperX Edition)
|
||||
Speech-to-Text with WhisperX: transcription, word timestamps, speaker diarization.
|
||||
|
||||
Run with: uvicorn app.main:app --host 0.0.0.0 --port 3020
|
||||
"""
|
||||
|
||||
import os
|
||||
import logging
|
||||
import time
|
||||
from typing import Optional
|
||||
from contextlib import asynccontextmanager
|
||||
|
||||
from fastapi import FastAPI, File, UploadFile, Form, HTTPException, Depends, Response
|
||||
from fastapi.middleware.cors import CORSMiddleware
|
||||
from fastapi.responses import JSONResponse
|
||||
from pydantic import BaseModel
|
||||
|
||||
from app.auth import verify_api_key, AuthResult, get_api_key_stats, REQUIRE_AUTH
|
||||
|
||||
# Configure logging
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
|
||||
)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Environment
|
||||
PORT = int(os.getenv("PORT", "3020"))
|
||||
DEFAULT_WHISPER_MODEL = os.getenv("WHISPER_MODEL", "large-v3")
|
||||
PRELOAD_MODELS = os.getenv("PRELOAD_MODELS", "false").lower() == "true"
|
||||
CORS_ORIGINS = os.getenv(
|
||||
"CORS_ORIGINS",
|
||||
"https://mana.how,https://chat.mana.how,http://localhost:5173"
|
||||
).split(",")
|
||||
|
||||
# vLLM configuration
|
||||
VLLM_URL = os.getenv("VLLM_URL", "http://localhost:8100")
|
||||
USE_VLLM = os.getenv("USE_VLLM", "false").lower() == "true"
|
||||
|
||||
|
||||
# Response models
|
||||
class WordInfo(BaseModel):
|
||||
word: str
|
||||
start: float
|
||||
end: float
|
||||
score: Optional[float] = None
|
||||
speaker: Optional[str] = None
|
||||
|
||||
|
||||
class SegmentInfo(BaseModel):
|
||||
start: float
|
||||
end: float
|
||||
text: str
|
||||
speaker: Optional[str] = None
|
||||
|
||||
|
||||
class TranscriptionResponse(BaseModel):
|
||||
text: str
|
||||
language: Optional[str] = None
|
||||
model: str
|
||||
latency_ms: Optional[float] = None
|
||||
duration_seconds: Optional[float] = None
|
||||
words: Optional[list[WordInfo]] = None
|
||||
segments: Optional[list[SegmentInfo]] = None
|
||||
speakers: Optional[list[str]] = None
|
||||
|
||||
|
||||
class HealthResponse(BaseModel):
|
||||
status: str
|
||||
whisper_loaded: bool
|
||||
whisperx: bool
|
||||
vllm_available: bool
|
||||
vllm_url: Optional[str] = None
|
||||
mistral_api_available: bool
|
||||
auth_required: bool
|
||||
models: dict
|
||||
|
||||
|
||||
class ModelsResponse(BaseModel):
|
||||
whisper: list
|
||||
voxtral_vllm: list
|
||||
default_whisper: str
|
||||
|
||||
|
||||
# Track loaded models
|
||||
models_status = {
|
||||
"whisper_loaded": False,
|
||||
"vllm_available": False,
|
||||
}
|
||||
|
||||
|
||||
@asynccontextmanager
|
||||
async def lifespan(app: FastAPI):
|
||||
"""Startup and shutdown events."""
|
||||
logger.info("Starting ManaCore STT Service (WhisperX Edition)...")
|
||||
|
||||
# Check vLLM availability
|
||||
if USE_VLLM:
|
||||
from app.vllm_service import check_health
|
||||
health = await check_health()
|
||||
models_status["vllm_available"] = health.get("status") == "healthy"
|
||||
|
||||
# Check Mistral API
|
||||
from app.voxtral_api_service import is_available as api_available
|
||||
if api_available():
|
||||
logger.info("Mistral API fallback configured")
|
||||
|
||||
# Always preload WhisperX model at startup (avoids timeout on first request)
|
||||
logger.info("Preloading WhisperX model...")
|
||||
try:
|
||||
from app.whisper_service import get_whisper_model
|
||||
get_whisper_model(DEFAULT_WHISPER_MODEL)
|
||||
models_status["whisper_loaded"] = True
|
||||
logger.info("WhisperX model preloaded successfully")
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to preload WhisperX: {e}")
|
||||
|
||||
logger.info(f"STT Service ready on port {PORT}")
|
||||
yield
|
||||
logger.info("Shutting down STT Service...")
|
||||
|
||||
|
||||
# Create FastAPI app
|
||||
app = FastAPI(
|
||||
title="ManaCore STT Service",
|
||||
description="Speech-to-Text API with WhisperX (word timestamps + speaker diarization)",
|
||||
version="3.0.0",
|
||||
lifespan=lifespan,
|
||||
)
|
||||
|
||||
# CORS middleware
|
||||
app.add_middleware(
|
||||
CORSMiddleware,
|
||||
allow_origins=CORS_ORIGINS,
|
||||
allow_credentials=True,
|
||||
allow_methods=["*"],
|
||||
allow_headers=["*"],
|
||||
)
|
||||
|
||||
|
||||
@app.get("/health", response_model=HealthResponse)
|
||||
async def health_check():
|
||||
"""Health check endpoint."""
|
||||
from app.voxtral_api_service import is_available as api_available
|
||||
from app.vllm_service import check_health
|
||||
|
||||
vllm_health = await check_health()
|
||||
|
||||
return HealthResponse(
|
||||
status="healthy",
|
||||
whisper_loaded=models_status["whisper_loaded"],
|
||||
whisperx=True,
|
||||
vllm_available=vllm_health.get("status") == "healthy",
|
||||
vllm_url=VLLM_URL if USE_VLLM else None,
|
||||
mistral_api_available=api_available(),
|
||||
auth_required=REQUIRE_AUTH,
|
||||
models={
|
||||
"default_whisper": DEFAULT_WHISPER_MODEL,
|
||||
"engine": "whisperx",
|
||||
"features": ["transcription", "word_timestamps", "speaker_diarization"],
|
||||
},
|
||||
)
|
||||
|
||||
|
||||
@app.get("/models", response_model=ModelsResponse)
|
||||
async def list_models(auth: AuthResult = Depends(verify_api_key)):
|
||||
"""List available models."""
|
||||
from app.whisper_service import AVAILABLE_MODELS as whisper_models
|
||||
from app.vllm_service import get_models
|
||||
|
||||
vllm_models = await get_models()
|
||||
|
||||
return ModelsResponse(
|
||||
whisper=whisper_models,
|
||||
voxtral_vllm=vllm_models,
|
||||
default_whisper=DEFAULT_WHISPER_MODEL,
|
||||
)
|
||||
|
||||
|
||||
@app.post("/transcribe", response_model=TranscriptionResponse)
|
||||
async def transcribe_whisper(
|
||||
response: Response,
|
||||
file: UploadFile = File(..., description="Audio file to transcribe"),
|
||||
language: Optional[str] = Form(None, description="Language code (auto-detect if not provided)"),
|
||||
model: Optional[str] = Form(None, description="Whisper model to use"),
|
||||
align: bool = Form(True, description="Enable word-level timestamp alignment"),
|
||||
diarize: bool = Form(False, description="Enable speaker diarization"),
|
||||
min_speakers: Optional[int] = Form(None, description="Min expected speakers (helps diarization)"),
|
||||
max_speakers: Optional[int] = Form(None, description="Max expected speakers"),
|
||||
auth: AuthResult = Depends(verify_api_key),
|
||||
):
|
||||
"""
|
||||
Transcribe audio using WhisperX.
|
||||
|
||||
Features:
|
||||
- Word-level timestamps (align=true, default)
|
||||
- Speaker diarization (diarize=true, opt-in)
|
||||
|
||||
Supported formats: mp3, wav, m4a, flac, ogg, webm, mp4
|
||||
Max file size: 100MB
|
||||
"""
|
||||
if auth.rate_limit_remaining is not None:
|
||||
response.headers["X-RateLimit-Remaining"] = str(auth.rate_limit_remaining)
|
||||
|
||||
if not file.filename:
|
||||
raise HTTPException(status_code=400, detail="No file provided")
|
||||
|
||||
allowed_extensions = {".mp3", ".wav", ".m4a", ".flac", ".ogg", ".webm", ".mp4"}
|
||||
ext = os.path.splitext(file.filename)[1].lower()
|
||||
if ext not in allowed_extensions:
|
||||
raise HTTPException(
|
||||
status_code=400,
|
||||
detail=f"Unsupported file type: {ext}. Allowed: {allowed_extensions}"
|
||||
)
|
||||
|
||||
start_time = time.time()
|
||||
|
||||
try:
|
||||
from app.whisper_service import transcribe_audio_bytes
|
||||
|
||||
audio_bytes = await file.read()
|
||||
if len(audio_bytes) > 100 * 1024 * 1024:
|
||||
raise HTTPException(status_code=400, detail="File too large (max 100MB)")
|
||||
|
||||
model_name = model or DEFAULT_WHISPER_MODEL
|
||||
|
||||
result = await transcribe_audio_bytes(
|
||||
audio_bytes=audio_bytes,
|
||||
filename=file.filename,
|
||||
language=language,
|
||||
model_name=model_name,
|
||||
align=align,
|
||||
diarize=diarize,
|
||||
min_speakers=min_speakers,
|
||||
max_speakers=max_speakers,
|
||||
)
|
||||
|
||||
models_status["whisper_loaded"] = True
|
||||
latency_ms = (time.time() - start_time) * 1000
|
||||
|
||||
# Build response
|
||||
resp = TranscriptionResponse(
|
||||
text=result.text,
|
||||
language=result.language,
|
||||
model=f"whisperx-{model_name}",
|
||||
latency_ms=latency_ms,
|
||||
duration_seconds=result.duration,
|
||||
)
|
||||
|
||||
# Add word timestamps if available
|
||||
if result.words:
|
||||
resp.words = [
|
||||
WordInfo(
|
||||
word=w.word,
|
||||
start=w.start,
|
||||
end=w.end,
|
||||
score=w.score,
|
||||
speaker=w.speaker,
|
||||
)
|
||||
for w in result.words
|
||||
]
|
||||
|
||||
# Add segments
|
||||
if result.segments:
|
||||
resp.segments = [
|
||||
SegmentInfo(**s) for s in result.segments
|
||||
]
|
||||
|
||||
# Add speakers
|
||||
if result.speakers:
|
||||
resp.speakers = result.speakers
|
||||
|
||||
return resp
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"WhisperX transcription error: {e}")
|
||||
raise HTTPException(status_code=500, detail=str(e))
|
||||
|
||||
|
||||
@app.post("/transcribe/voxtral", response_model=TranscriptionResponse)
|
||||
async def transcribe_voxtral(
|
||||
response: Response,
|
||||
file: UploadFile = File(..., description="Audio file to transcribe"),
|
||||
language: str = Form("de", description="Language code"),
|
||||
use_realtime: bool = Form(False, description="Use Realtime 4B model"),
|
||||
auth: AuthResult = Depends(verify_api_key),
|
||||
):
|
||||
"""Transcribe audio using Voxtral via vLLM or Mistral API."""
|
||||
if auth.rate_limit_remaining is not None:
|
||||
response.headers["X-RateLimit-Remaining"] = str(auth.rate_limit_remaining)
|
||||
|
||||
if not file.filename:
|
||||
raise HTTPException(status_code=400, detail="No file provided")
|
||||
|
||||
from app.vllm_service import (
|
||||
SUPPORTED_LANGUAGES,
|
||||
is_available as vllm_available,
|
||||
transcribe_audio_bytes as vllm_transcribe,
|
||||
transcribe_with_realtime,
|
||||
check_health,
|
||||
)
|
||||
from app.voxtral_api_service import (
|
||||
is_available as api_available,
|
||||
transcribe_audio_bytes as api_transcribe,
|
||||
)
|
||||
|
||||
if language not in SUPPORTED_LANGUAGES:
|
||||
raise HTTPException(
|
||||
status_code=400,
|
||||
detail=f"Unsupported language: {language}. Supported: {SUPPORTED_LANGUAGES}"
|
||||
)
|
||||
|
||||
try:
|
||||
audio_bytes = await file.read()
|
||||
if len(audio_bytes) > 100 * 1024 * 1024:
|
||||
raise HTTPException(status_code=400, detail="File too large (max 100MB)")
|
||||
|
||||
# Try vLLM first
|
||||
if USE_VLLM:
|
||||
health = await check_health()
|
||||
if health.get("status") == "healthy":
|
||||
if use_realtime:
|
||||
result = await transcribe_with_realtime(
|
||||
audio_bytes=audio_bytes, filename=file.filename, language=language,
|
||||
)
|
||||
else:
|
||||
result = await vllm_transcribe(
|
||||
audio_bytes=audio_bytes, filename=file.filename, language=language,
|
||||
)
|
||||
return TranscriptionResponse(
|
||||
text=result.text, language=result.language, model=result.model,
|
||||
latency_ms=result.latency_ms, duration_seconds=result.duration_seconds,
|
||||
)
|
||||
|
||||
# Fallback to Mistral API
|
||||
if api_available():
|
||||
result = await api_transcribe(
|
||||
audio_bytes=audio_bytes, filename=file.filename, language=language,
|
||||
)
|
||||
return TranscriptionResponse(
|
||||
text=result.text, language=result.language, model=result.model,
|
||||
duration_seconds=result.duration_seconds,
|
||||
)
|
||||
|
||||
raise HTTPException(status_code=503, detail="Voxtral not available.")
|
||||
|
||||
except HTTPException:
|
||||
raise
|
||||
except Exception as e:
|
||||
logger.error(f"Voxtral transcription error: {e}")
|
||||
raise HTTPException(status_code=500, detail=str(e))
|
||||
|
||||
|
||||
@app.post("/transcribe/auto", response_model=TranscriptionResponse)
|
||||
async def transcribe_auto(
|
||||
response: Response,
|
||||
file: UploadFile = File(..., description="Audio file to transcribe"),
|
||||
language: Optional[str] = Form(None, description="Language hint"),
|
||||
prefer: str = Form("whisper", description="Preferred: 'whisper' or 'voxtral'"),
|
||||
auth: AuthResult = Depends(verify_api_key),
|
||||
):
|
||||
"""Auto-select best model with fallback chain."""
|
||||
if auth.rate_limit_remaining is not None:
|
||||
response.headers["X-RateLimit-Remaining"] = str(auth.rate_limit_remaining)
|
||||
|
||||
if prefer == "voxtral":
|
||||
try:
|
||||
return await transcribe_voxtral(response, file, language or "de", False, auth)
|
||||
except Exception:
|
||||
await file.seek(0)
|
||||
return await transcribe_whisper(response, file, language, None, True, False, None, None, auth)
|
||||
else:
|
||||
try:
|
||||
return await transcribe_whisper(response, file, language, None, True, False, None, None, auth)
|
||||
except Exception:
|
||||
await file.seek(0)
|
||||
return await transcribe_voxtral(response, file, language or "de", False, auth)
|
||||
|
||||
|
||||
@app.exception_handler(Exception)
|
||||
async def global_exception_handler(request, exc):
|
||||
logger.error(f"Unhandled error: {exc}")
|
||||
return JSONResponse(
|
||||
status_code=500,
|
||||
content={"detail": "Internal server error", "error": str(exc)},
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import uvicorn
|
||||
uvicorn.run("app.main:app", host="0.0.0.0", port=PORT, reload=False)
|
||||
|
|
@ -1,178 +0,0 @@
|
|||
"""
|
||||
vLLM Voxtral Service - Proxy to vLLM server for Voxtral transcription
|
||||
|
||||
vLLM provides optimized inference for Voxtral models with an OpenAI-compatible API.
|
||||
This service proxies requests to the vLLM server.
|
||||
|
||||
Requirements:
|
||||
- vLLM server running on VLLM_URL (default: http://localhost:8100)
|
||||
- Model loaded: Voxtral-Mini-3B-2507 or Voxtral-Mini-4B-Realtime-2602
|
||||
"""
|
||||
|
||||
import os
|
||||
import logging
|
||||
import time
|
||||
import tempfile
|
||||
import httpx
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
from dataclasses import dataclass
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# vLLM server configuration
|
||||
VLLM_URL = os.getenv("VLLM_URL", "http://localhost:8100")
|
||||
VLLM_TIMEOUT = int(os.getenv("VLLM_TIMEOUT", "300")) # 5 minutes for long audio
|
||||
|
||||
# Model IDs
|
||||
VOXTRAL_3B = "mistralai/Voxtral-Mini-3B-2507"
|
||||
VOXTRAL_4B_REALTIME = "mistralai/Voxtral-Mini-4B-Realtime-2602"
|
||||
|
||||
|
||||
@dataclass
|
||||
class VllmTranscriptionResult:
|
||||
text: str
|
||||
language: Optional[str] = None
|
||||
model: str = "voxtral-vllm"
|
||||
latency_ms: Optional[float] = None
|
||||
duration_seconds: Optional[float] = None
|
||||
|
||||
|
||||
async def check_health() -> dict:
|
||||
"""Check if vLLM server is healthy."""
|
||||
try:
|
||||
async with httpx.AsyncClient(timeout=5.0) as client:
|
||||
response = await client.get(f"{VLLM_URL}/health")
|
||||
if response.status_code == 200:
|
||||
return {"status": "healthy", "url": VLLM_URL}
|
||||
return {"status": "unhealthy", "url": VLLM_URL, "code": response.status_code}
|
||||
except Exception as e:
|
||||
return {"status": "unavailable", "url": VLLM_URL, "error": str(e)}
|
||||
|
||||
|
||||
async def get_models() -> list:
|
||||
"""Get available models from vLLM server."""
|
||||
try:
|
||||
async with httpx.AsyncClient(timeout=5.0) as client:
|
||||
response = await client.get(f"{VLLM_URL}/v1/models")
|
||||
if response.status_code == 200:
|
||||
data = response.json()
|
||||
return [m["id"] for m in data.get("data", [])]
|
||||
return []
|
||||
except Exception:
|
||||
return []
|
||||
|
||||
|
||||
def is_available() -> bool:
|
||||
"""Check if vLLM server is configured."""
|
||||
return bool(VLLM_URL)
|
||||
|
||||
|
||||
async def transcribe_audio_bytes(
|
||||
audio_bytes: bytes,
|
||||
filename: str,
|
||||
language: Optional[str] = "de",
|
||||
model: Optional[str] = None,
|
||||
) -> VllmTranscriptionResult:
|
||||
"""
|
||||
Transcribe audio using vLLM Voxtral server.
|
||||
|
||||
Args:
|
||||
audio_bytes: Raw audio bytes
|
||||
filename: Original filename (for format detection)
|
||||
language: Language code (de, en, fr, etc.)
|
||||
model: Model to use (defaults to Voxtral-Mini-3B-2507)
|
||||
|
||||
Returns:
|
||||
VllmTranscriptionResult with transcription
|
||||
"""
|
||||
start_time = time.time()
|
||||
model_id = model or VOXTRAL_3B
|
||||
|
||||
logger.info(f"Transcribing via vLLM: {filename} ({len(audio_bytes)} bytes)")
|
||||
|
||||
# Save to temp file (vLLM API accepts file uploads)
|
||||
ext = Path(filename).suffix or ".wav"
|
||||
with tempfile.NamedTemporaryFile(suffix=ext, delete=False) as tmp:
|
||||
tmp.write(audio_bytes)
|
||||
tmp_path = tmp.name
|
||||
|
||||
try:
|
||||
async with httpx.AsyncClient(timeout=VLLM_TIMEOUT) as client:
|
||||
# Use OpenAI-compatible transcription endpoint
|
||||
with open(tmp_path, "rb") as f:
|
||||
files = {"file": (filename, f, "audio/wav")}
|
||||
data = {
|
||||
"model": model_id,
|
||||
"language": language or "de",
|
||||
"response_format": "json",
|
||||
"temperature": 0.0, # Deterministic for transcription
|
||||
}
|
||||
|
||||
response = await client.post(
|
||||
f"{VLLM_URL}/v1/audio/transcriptions",
|
||||
files=files,
|
||||
data=data,
|
||||
)
|
||||
|
||||
if response.status_code != 200:
|
||||
error_detail = response.text
|
||||
logger.error(f"vLLM error: {response.status_code} - {error_detail}")
|
||||
raise RuntimeError(f"vLLM transcription failed: {error_detail}")
|
||||
|
||||
result = response.json()
|
||||
text = result.get("text", "")
|
||||
duration = result.get("duration")
|
||||
|
||||
latency_ms = (time.time() - start_time) * 1000
|
||||
logger.info(f"vLLM transcription complete: {len(text)} chars in {latency_ms:.0f}ms")
|
||||
|
||||
return VllmTranscriptionResult(
|
||||
text=text.strip(),
|
||||
language=language,
|
||||
model=f"vllm-{model_id.split('/')[-1]}",
|
||||
latency_ms=latency_ms,
|
||||
duration_seconds=duration,
|
||||
)
|
||||
|
||||
finally:
|
||||
try:
|
||||
os.unlink(tmp_path)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
|
||||
async def transcribe_with_realtime(
|
||||
audio_bytes: bytes,
|
||||
filename: str,
|
||||
language: Optional[str] = "de",
|
||||
) -> VllmTranscriptionResult:
|
||||
"""
|
||||
Transcribe using Voxtral 4B Realtime model.
|
||||
|
||||
Optimized for low latency (<500ms).
|
||||
"""
|
||||
return await transcribe_audio_bytes(
|
||||
audio_bytes=audio_bytes,
|
||||
filename=filename,
|
||||
language=language,
|
||||
model=VOXTRAL_4B_REALTIME,
|
||||
)
|
||||
|
||||
|
||||
# Supported languages (same as Voxtral)
|
||||
SUPPORTED_LANGUAGES = [
|
||||
"en", # English
|
||||
"zh", # Chinese
|
||||
"hi", # Hindi
|
||||
"es", # Spanish
|
||||
"ar", # Arabic
|
||||
"fr", # French
|
||||
"pt", # Portuguese
|
||||
"ru", # Russian
|
||||
"de", # German
|
||||
"ja", # Japanese
|
||||
"ko", # Korean
|
||||
"it", # Italian
|
||||
"nl", # Dutch
|
||||
]
|
||||
|
|
@ -1,213 +0,0 @@
|
|||
"""
|
||||
Voxtral API Service - Mistral Cloud API Fallback
|
||||
Uses Mistral's hosted Voxtral Mini Transcribe V2 when local service is overloaded.
|
||||
|
||||
Features:
|
||||
- Speaker diarization
|
||||
- Word-level timestamps
|
||||
- Context biasing for domain-specific terms
|
||||
- 13 language support
|
||||
"""
|
||||
|
||||
import os
|
||||
import logging
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
from typing import Optional, Literal
|
||||
from dataclasses import dataclass, field
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Lazy load client
|
||||
_mistral_client = None
|
||||
|
||||
MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY")
|
||||
DEFAULT_MODEL = "voxtral-mini-latest" # voxtral-mini-2602
|
||||
|
||||
|
||||
@dataclass
|
||||
class Speaker:
|
||||
"""Speaker information from diarization."""
|
||||
id: str
|
||||
start: float
|
||||
end: float
|
||||
|
||||
|
||||
@dataclass
|
||||
class WordTimestamp:
|
||||
"""Word-level timestamp."""
|
||||
word: str
|
||||
start: float
|
||||
end: float
|
||||
|
||||
|
||||
@dataclass
|
||||
class SegmentTimestamp:
|
||||
"""Segment-level timestamp."""
|
||||
text: str
|
||||
start: float
|
||||
end: float
|
||||
speaker: Optional[str] = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class VoxtralApiResult:
|
||||
"""Result from Voxtral API transcription."""
|
||||
text: str
|
||||
language: Optional[str] = None
|
||||
model: str = "voxtral-api"
|
||||
duration_seconds: Optional[float] = None
|
||||
words: list[WordTimestamp] = field(default_factory=list)
|
||||
segments: list[SegmentTimestamp] = field(default_factory=list)
|
||||
speakers: list[Speaker] = field(default_factory=list)
|
||||
|
||||
|
||||
def get_mistral_client():
|
||||
"""Get or create Mistral client instance."""
|
||||
global _mistral_client
|
||||
|
||||
if _mistral_client is None:
|
||||
if not MISTRAL_API_KEY:
|
||||
raise RuntimeError(
|
||||
"MISTRAL_API_KEY environment variable not set. "
|
||||
"Get your API key at https://console.mistral.ai/"
|
||||
)
|
||||
|
||||
try:
|
||||
from mistralai import Mistral
|
||||
_mistral_client = Mistral(api_key=MISTRAL_API_KEY)
|
||||
logger.info("Mistral API client initialized")
|
||||
except ImportError:
|
||||
raise RuntimeError(
|
||||
"mistralai package not installed. "
|
||||
"Run: pip install mistralai"
|
||||
)
|
||||
|
||||
return _mistral_client
|
||||
|
||||
|
||||
def is_available() -> bool:
|
||||
"""Check if Mistral API is configured and available."""
|
||||
return bool(MISTRAL_API_KEY)
|
||||
|
||||
|
||||
async def transcribe_audio_bytes(
|
||||
audio_bytes: bytes,
|
||||
filename: str,
|
||||
language: Optional[str] = None,
|
||||
timestamp_granularity: Optional[Literal["word", "segment"]] = None,
|
||||
diarization: bool = False,
|
||||
context_bias: Optional[list[str]] = None,
|
||||
) -> VoxtralApiResult:
|
||||
"""
|
||||
Transcribe audio using Mistral's Voxtral API.
|
||||
|
||||
Args:
|
||||
audio_bytes: Raw audio bytes
|
||||
filename: Original filename (for extension detection)
|
||||
language: Language code (de, en, fr, etc.) - auto-detect if None
|
||||
timestamp_granularity: "word" or "segment" for timestamps
|
||||
diarization: Enable speaker diarization
|
||||
context_bias: List of domain-specific terms to improve accuracy (max 100)
|
||||
|
||||
Returns:
|
||||
VoxtralApiResult with transcription and optional metadata
|
||||
"""
|
||||
client = get_mistral_client()
|
||||
|
||||
logger.info(f"Transcribing via Mistral API: {filename} ({len(audio_bytes)} bytes)")
|
||||
|
||||
try:
|
||||
# Build request parameters
|
||||
request_params = {
|
||||
"model": DEFAULT_MODEL,
|
||||
"file": {
|
||||
"content": audio_bytes,
|
||||
"file_name": filename,
|
||||
},
|
||||
}
|
||||
|
||||
# Language and timestamps are mutually exclusive in current API
|
||||
if language and not timestamp_granularity:
|
||||
request_params["language"] = language
|
||||
|
||||
if timestamp_granularity:
|
||||
request_params["timestamp_granularities"] = [timestamp_granularity]
|
||||
|
||||
if diarization:
|
||||
request_params["diarization"] = True
|
||||
|
||||
if context_bias:
|
||||
# API accepts comma-separated string, max 100 terms
|
||||
bias_terms = context_bias[:100]
|
||||
request_params["context_bias"] = ",".join(bias_terms)
|
||||
|
||||
# Make API call
|
||||
response = client.audio.transcriptions.complete(**request_params)
|
||||
|
||||
# Parse response
|
||||
result = VoxtralApiResult(
|
||||
text=response.text,
|
||||
language=getattr(response, "language", language),
|
||||
model=f"voxtral-api-{DEFAULT_MODEL}",
|
||||
duration_seconds=getattr(response, "duration", None),
|
||||
)
|
||||
|
||||
# Parse word timestamps if present
|
||||
if hasattr(response, "words") and response.words:
|
||||
result.words = [
|
||||
WordTimestamp(
|
||||
word=w.word,
|
||||
start=w.start,
|
||||
end=w.end,
|
||||
)
|
||||
for w in response.words
|
||||
]
|
||||
|
||||
# Parse segment timestamps if present
|
||||
if hasattr(response, "segments") and response.segments:
|
||||
result.segments = [
|
||||
SegmentTimestamp(
|
||||
text=s.text,
|
||||
start=s.start,
|
||||
end=s.end,
|
||||
speaker=getattr(s, "speaker", None),
|
||||
)
|
||||
for s in response.segments
|
||||
]
|
||||
|
||||
# Parse speakers if diarization enabled
|
||||
if hasattr(response, "speakers") and response.speakers:
|
||||
result.speakers = [
|
||||
Speaker(
|
||||
id=sp.id,
|
||||
start=sp.start,
|
||||
end=sp.end,
|
||||
)
|
||||
for sp in response.speakers
|
||||
]
|
||||
|
||||
logger.info(f"Mistral API transcription complete: {len(result.text)} characters")
|
||||
return result
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Mistral API transcription failed: {e}")
|
||||
raise
|
||||
|
||||
|
||||
# Supported languages by Voxtral API (13 languages)
|
||||
SUPPORTED_LANGUAGES = [
|
||||
"en", # English
|
||||
"zh", # Chinese
|
||||
"hi", # Hindi
|
||||
"es", # Spanish
|
||||
"ar", # Arabic
|
||||
"fr", # French
|
||||
"pt", # Portuguese
|
||||
"ru", # Russian
|
||||
"de", # German
|
||||
"ja", # Japanese
|
||||
"ko", # Korean
|
||||
"it", # Italian
|
||||
"nl", # Dutch
|
||||
]
|
||||
|
|
@ -1,267 +0,0 @@
|
|||
"""
|
||||
Voxtral STT Service using Hugging Face Transformers
|
||||
Mistral AI's Speech-to-Text model (Apache 2.0 License)
|
||||
|
||||
Uses VoxtralForConditionalGeneration with apply_transcription_request
|
||||
as per official HuggingFace documentation.
|
||||
"""
|
||||
|
||||
import os
|
||||
import tempfile
|
||||
import logging
|
||||
import time
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
from dataclasses import dataclass
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Lazy load to avoid import errors
|
||||
_voxtral_model = None
|
||||
_voxtral_processor = None
|
||||
_model_name = None
|
||||
|
||||
# Default model
|
||||
DEFAULT_MODEL = "mistralai/Voxtral-Mini-3B-2507"
|
||||
|
||||
|
||||
@dataclass
|
||||
class VoxtralTranscriptionResult:
|
||||
text: str
|
||||
language: Optional[str] = None
|
||||
model: str = "voxtral-mini-3b"
|
||||
latency_ms: Optional[float] = None
|
||||
|
||||
|
||||
def get_voxtral_model(model_name: str = DEFAULT_MODEL):
|
||||
"""
|
||||
Get or create Voxtral model instance.
|
||||
|
||||
Uses VoxtralForConditionalGeneration (the correct class for Voxtral).
|
||||
"""
|
||||
global _voxtral_model, _voxtral_processor, _model_name
|
||||
|
||||
# Reload if different model requested
|
||||
if _voxtral_model is not None and _model_name != model_name:
|
||||
logger.info(f"Switching model from {_model_name} to {model_name}")
|
||||
_voxtral_model = None
|
||||
_voxtral_processor = None
|
||||
|
||||
if _voxtral_model is None:
|
||||
logger.info(f"Loading Voxtral model: {model_name}")
|
||||
try:
|
||||
import torch
|
||||
from transformers import VoxtralForConditionalGeneration, AutoProcessor
|
||||
|
||||
# Determine device and dtype
|
||||
if torch.backends.mps.is_available():
|
||||
device = "mps"
|
||||
# MPS works better with float16
|
||||
torch_dtype = torch.float16
|
||||
elif torch.cuda.is_available():
|
||||
device = "cuda"
|
||||
torch_dtype = torch.bfloat16
|
||||
else:
|
||||
device = "cpu"
|
||||
torch_dtype = torch.float32
|
||||
|
||||
logger.info(f"Using device: {device}, dtype: {torch_dtype}")
|
||||
|
||||
# Load processor
|
||||
_voxtral_processor = AutoProcessor.from_pretrained(model_name)
|
||||
|
||||
# Load model with VoxtralForConditionalGeneration
|
||||
if device == "mps":
|
||||
# MPS doesn't support device_map, load to CPU first then move
|
||||
_voxtral_model = VoxtralForConditionalGeneration.from_pretrained(
|
||||
model_name,
|
||||
torch_dtype=torch_dtype,
|
||||
)
|
||||
_voxtral_model = _voxtral_model.to(device)
|
||||
else:
|
||||
_voxtral_model = VoxtralForConditionalGeneration.from_pretrained(
|
||||
model_name,
|
||||
torch_dtype=torch_dtype,
|
||||
device_map=device,
|
||||
)
|
||||
|
||||
_model_name = model_name
|
||||
logger.info(f"Voxtral model loaded successfully on {device}")
|
||||
|
||||
except ImportError as e:
|
||||
logger.error(f"Failed to import transformers: {e}")
|
||||
raise RuntimeError(
|
||||
"transformers >= 4.54.0 required. "
|
||||
"Run: pip install --upgrade transformers"
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to load Voxtral model: {e}")
|
||||
raise
|
||||
|
||||
return _voxtral_model, _voxtral_processor
|
||||
|
||||
|
||||
def transcribe_audio(
|
||||
audio_path: str,
|
||||
language: Optional[str] = "de",
|
||||
model_name: str = DEFAULT_MODEL,
|
||||
) -> VoxtralTranscriptionResult:
|
||||
"""
|
||||
Transcribe audio file using Voxtral.
|
||||
|
||||
Uses the official apply_transcription_request method.
|
||||
|
||||
Args:
|
||||
audio_path: Path to audio file
|
||||
language: Language code (de, en, fr, etc.)
|
||||
model_name: Hugging Face model ID
|
||||
|
||||
Returns:
|
||||
VoxtralTranscriptionResult with transcribed text
|
||||
"""
|
||||
import torch
|
||||
|
||||
model, processor = get_voxtral_model(model_name)
|
||||
device = next(model.parameters()).device
|
||||
dtype = next(model.parameters()).dtype
|
||||
|
||||
logger.info(f"Transcribing with Voxtral: {audio_path}")
|
||||
start_time = time.time()
|
||||
|
||||
try:
|
||||
# Use apply_transcription_request (official method)
|
||||
# This handles audio loading and preprocessing internally
|
||||
inputs = processor.apply_transcription_request(
|
||||
language=language or "en",
|
||||
audio=audio_path,
|
||||
model_id=model_name,
|
||||
)
|
||||
|
||||
# Move inputs to device and dtype
|
||||
inputs = inputs.to(device, dtype=dtype)
|
||||
|
||||
# Generate transcription
|
||||
with torch.no_grad():
|
||||
outputs = model.generate(
|
||||
**inputs,
|
||||
max_new_tokens=500,
|
||||
do_sample=False,
|
||||
)
|
||||
|
||||
# Decode - skip input tokens
|
||||
input_len = inputs.input_ids.shape[1]
|
||||
decoded = processor.batch_decode(
|
||||
outputs[:, input_len:],
|
||||
skip_special_tokens=True,
|
||||
)
|
||||
|
||||
text = decoded[0] if decoded else ""
|
||||
latency_ms = (time.time() - start_time) * 1000
|
||||
|
||||
logger.info(f"Voxtral transcription complete: {len(text)} chars in {latency_ms:.0f}ms")
|
||||
|
||||
return VoxtralTranscriptionResult(
|
||||
text=text.strip(),
|
||||
language=language,
|
||||
model=model_name.split("/")[-1],
|
||||
latency_ms=latency_ms,
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Voxtral transcription failed: {e}")
|
||||
raise
|
||||
|
||||
|
||||
async def transcribe_audio_bytes(
|
||||
audio_bytes: bytes,
|
||||
filename: str,
|
||||
language: Optional[str] = "de",
|
||||
model_name: str = DEFAULT_MODEL,
|
||||
) -> VoxtralTranscriptionResult:
|
||||
"""
|
||||
Transcribe audio from bytes (for API uploads).
|
||||
"""
|
||||
ext = Path(filename).suffix or ".wav"
|
||||
|
||||
with tempfile.NamedTemporaryFile(suffix=ext, delete=False) as tmp:
|
||||
tmp.write(audio_bytes)
|
||||
tmp_path = tmp.name
|
||||
|
||||
try:
|
||||
result = transcribe_audio(
|
||||
audio_path=tmp_path,
|
||||
language=language,
|
||||
model_name=model_name,
|
||||
)
|
||||
return result
|
||||
finally:
|
||||
try:
|
||||
os.unlink(tmp_path)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
|
||||
def unload_model():
|
||||
"""Unload model to free memory."""
|
||||
global _voxtral_model, _voxtral_processor, _model_name
|
||||
|
||||
if _voxtral_model is not None:
|
||||
del _voxtral_model
|
||||
del _voxtral_processor
|
||||
_voxtral_model = None
|
||||
_voxtral_processor = None
|
||||
_model_name = None
|
||||
|
||||
import gc
|
||||
gc.collect()
|
||||
|
||||
try:
|
||||
import torch
|
||||
if torch.backends.mps.is_available():
|
||||
torch.mps.empty_cache()
|
||||
elif torch.cuda.is_available():
|
||||
torch.cuda.empty_cache()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
logger.info("Voxtral model unloaded")
|
||||
|
||||
|
||||
def is_loaded() -> bool:
|
||||
"""Check if model is currently loaded."""
|
||||
return _voxtral_model is not None
|
||||
|
||||
|
||||
def get_loaded_model_name() -> Optional[str]:
|
||||
"""Get name of currently loaded model."""
|
||||
return _model_name
|
||||
|
||||
|
||||
# Supported languages (13 languages as per Mistral docs)
|
||||
SUPPORTED_LANGUAGES = [
|
||||
"en", # English
|
||||
"zh", # Chinese
|
||||
"hi", # Hindi
|
||||
"es", # Spanish
|
||||
"ar", # Arabic
|
||||
"fr", # French
|
||||
"pt", # Portuguese
|
||||
"ru", # Russian
|
||||
"de", # German
|
||||
"ja", # Japanese
|
||||
"ko", # Korean
|
||||
"it", # Italian
|
||||
"nl", # Dutch
|
||||
]
|
||||
|
||||
# Available models
|
||||
AVAILABLE_MODELS = [
|
||||
{
|
||||
"id": "voxtral-mini-3b",
|
||||
"name": "Voxtral-Mini-3B-2507",
|
||||
"huggingface_id": "mistralai/Voxtral-Mini-3B-2507",
|
||||
"params": "3B",
|
||||
"vram": "~6GB",
|
||||
"description": "Balanced quality and speed for local deployment",
|
||||
},
|
||||
]
|
||||
|
|
@ -1,114 +0,0 @@
|
|||
"""
|
||||
VRAM Manager — Automatic model unloading after idle timeout.
|
||||
|
||||
Tracks last usage time per model and unloads after configurable timeout.
|
||||
Designed for shared GPU environments (multiple services on one RTX 3090).
|
||||
|
||||
Usage in a service:
|
||||
from vram_manager import VramManager
|
||||
|
||||
vram = VramManager(idle_timeout=300) # 5 min
|
||||
|
||||
# Before using a model
|
||||
vram.touch()
|
||||
|
||||
# Call periodically (e.g., from health check or background task)
|
||||
vram.check_idle(unload_fn=my_unload_function)
|
||||
"""
|
||||
|
||||
import os
|
||||
import time
|
||||
import logging
|
||||
import threading
|
||||
from typing import Optional, Callable
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
DEFAULT_IDLE_TIMEOUT = int(os.getenv("VRAM_IDLE_TIMEOUT", "300")) # 5 minutes
|
||||
|
||||
|
||||
class VramManager:
|
||||
def __init__(self, idle_timeout: int = DEFAULT_IDLE_TIMEOUT, service_name: str = "unknown"):
|
||||
self.idle_timeout = idle_timeout
|
||||
self.service_name = service_name
|
||||
self.last_used: float = 0.0
|
||||
self.model_loaded: bool = False
|
||||
self._lock = threading.Lock()
|
||||
self._timer: Optional[threading.Timer] = None
|
||||
|
||||
def touch(self):
|
||||
"""Mark the model as recently used. Call before/after each inference."""
|
||||
with self._lock:
|
||||
self.last_used = time.time()
|
||||
self.model_loaded = True
|
||||
self._schedule_check()
|
||||
|
||||
def mark_loaded(self):
|
||||
"""Mark that a model has been loaded into VRAM."""
|
||||
with self._lock:
|
||||
self.model_loaded = True
|
||||
self.last_used = time.time()
|
||||
self._schedule_check()
|
||||
logger.info(f"[{self.service_name}] Model loaded, idle timeout: {self.idle_timeout}s")
|
||||
|
||||
def mark_unloaded(self):
|
||||
"""Mark that a model has been unloaded from VRAM."""
|
||||
with self._lock:
|
||||
self.model_loaded = False
|
||||
if self._timer:
|
||||
self._timer.cancel()
|
||||
self._timer = None
|
||||
logger.info(f"[{self.service_name}] Model unloaded, VRAM freed")
|
||||
|
||||
def is_idle(self) -> bool:
|
||||
"""Check if the model has been idle longer than the timeout."""
|
||||
if not self.model_loaded:
|
||||
return False
|
||||
return (time.time() - self.last_used) > self.idle_timeout
|
||||
|
||||
def seconds_until_unload(self) -> Optional[float]:
|
||||
"""Seconds until the model will be unloaded, or None if not loaded."""
|
||||
if not self.model_loaded:
|
||||
return None
|
||||
remaining = self.idle_timeout - (time.time() - self.last_used)
|
||||
return max(0, remaining)
|
||||
|
||||
def check_and_unload(self, unload_fn: Callable[[], None]) -> bool:
|
||||
"""Check if idle and unload if so. Returns True if unloaded."""
|
||||
if self.is_idle():
|
||||
logger.info(f"[{self.service_name}] Idle for >{self.idle_timeout}s, unloading model...")
|
||||
try:
|
||||
unload_fn()
|
||||
self.mark_unloaded()
|
||||
return True
|
||||
except Exception as e:
|
||||
logger.error(f"[{self.service_name}] Failed to unload: {e}")
|
||||
return False
|
||||
|
||||
def _schedule_check(self):
|
||||
"""Schedule an idle check after the timeout period."""
|
||||
if self._timer:
|
||||
self._timer.cancel()
|
||||
|
||||
self._timer = threading.Timer(
|
||||
self.idle_timeout + 5, # Small buffer
|
||||
self._auto_check,
|
||||
)
|
||||
self._timer.daemon = True
|
||||
self._timer.start()
|
||||
|
||||
def _auto_check(self):
|
||||
"""Auto-triggered idle check (called by timer)."""
|
||||
# This is just a log — actual unloading needs the unload_fn
|
||||
# which depends on the service. The service should call check_and_unload.
|
||||
if self.is_idle():
|
||||
logger.info(f"[{self.service_name}] Model idle for >{self.idle_timeout}s — ready to unload")
|
||||
|
||||
def status(self) -> dict:
|
||||
"""Get current VRAM manager status."""
|
||||
return {
|
||||
"model_loaded": self.model_loaded,
|
||||
"idle_seconds": round(time.time() - self.last_used, 1) if self.model_loaded else None,
|
||||
"idle_timeout": self.idle_timeout,
|
||||
"seconds_until_unload": round(self.seconds_until_unload(), 1) if self.model_loaded else None,
|
||||
}
|
||||
|
|
@ -1,358 +0,0 @@
|
|||
"""
|
||||
Whisper STT Service using WhisperX (CUDA)
|
||||
Provides: transcription, word-level timestamps, speaker diarization.
|
||||
|
||||
WhisperX pipeline:
|
||||
1. faster-whisper for transcription
|
||||
2. wav2vec2 for forced alignment (precise word timestamps)
|
||||
3. pyannote-audio for speaker diarization
|
||||
"""
|
||||
|
||||
import os
|
||||
import tempfile
|
||||
import logging
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
from dataclasses import dataclass, field
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Lazy load
|
||||
_whisperx_model = None
|
||||
_align_model = None
|
||||
_align_metadata = None
|
||||
_diarize_pipeline = None
|
||||
|
||||
# Config
|
||||
HF_TOKEN = os.getenv("HF_TOKEN", "")
|
||||
|
||||
# VRAM management — unload after 10 min idle (STT uses ~3GB)
|
||||
from app.vram_manager import VramManager
|
||||
_vram = VramManager(
|
||||
idle_timeout=int(os.getenv("VRAM_IDLE_TIMEOUT", "600")),
|
||||
service_name="mana-stt",
|
||||
)
|
||||
|
||||
|
||||
def unload_models():
|
||||
"""Unload all WhisperX models from GPU to free VRAM."""
|
||||
global _whisperx_model, _align_model, _align_metadata, _diarize_pipeline
|
||||
import torch
|
||||
|
||||
if _whisperx_model is not None:
|
||||
del _whisperx_model
|
||||
_whisperx_model = None
|
||||
if _align_model is not None:
|
||||
del _align_model
|
||||
_align_model = None
|
||||
_align_metadata = None
|
||||
if _diarize_pipeline is not None:
|
||||
del _diarize_pipeline
|
||||
_diarize_pipeline = None
|
||||
|
||||
torch.cuda.empty_cache()
|
||||
_vram.mark_unloaded()
|
||||
logger.info("WhisperX models unloaded, VRAM freed")
|
||||
|
||||
|
||||
@dataclass
|
||||
class WordSegment:
|
||||
word: str
|
||||
start: float
|
||||
end: float
|
||||
score: Optional[float] = None
|
||||
speaker: Optional[str] = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class TranscriptionResult:
|
||||
text: str
|
||||
language: Optional[str] = None
|
||||
duration: Optional[float] = None
|
||||
segments: Optional[list] = None
|
||||
words: Optional[list[WordSegment]] = field(default_factory=list)
|
||||
speakers: Optional[list[str]] = field(default_factory=list)
|
||||
|
||||
|
||||
def get_whisper_model(model_name: str = "large-v3", **kwargs):
|
||||
"""Get or create WhisperX model instance (singleton)."""
|
||||
global _whisperx_model
|
||||
|
||||
if _whisperx_model is not None:
|
||||
return _whisperx_model
|
||||
|
||||
logger.info(f"Loading WhisperX model: {model_name}")
|
||||
try:
|
||||
import whisperx
|
||||
|
||||
device = os.getenv("WHISPER_DEVICE", "cuda")
|
||||
compute_type = os.getenv("WHISPER_COMPUTE_TYPE", "float16")
|
||||
|
||||
default_language = os.getenv("WHISPER_DEFAULT_LANGUAGE", "de")
|
||||
_whisperx_model = whisperx.load_model(
|
||||
model_name,
|
||||
device=device,
|
||||
compute_type=compute_type,
|
||||
language=default_language,
|
||||
)
|
||||
logger.info(f"WhisperX model loaded: {model_name} on {device} ({compute_type})")
|
||||
_vram.mark_loaded()
|
||||
except ImportError as e:
|
||||
logger.error(f"Failed to import whisperx: {e}")
|
||||
raise RuntimeError("whisperx not installed. Run: pip install whisperx")
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to load WhisperX model: {e}")
|
||||
raise
|
||||
|
||||
return _whisperx_model
|
||||
|
||||
|
||||
def _get_align_model(language: str, device: str = "cuda"):
|
||||
"""Get or create alignment model for a language."""
|
||||
global _align_model, _align_metadata
|
||||
|
||||
import whisperx
|
||||
|
||||
# Reload if language changed (alignment models are language-specific)
|
||||
if _align_model is None:
|
||||
logger.info(f"Loading alignment model for language: {language}")
|
||||
_align_model, _align_metadata = whisperx.load_align_model(
|
||||
language_code=language,
|
||||
device=device,
|
||||
)
|
||||
logger.info("Alignment model loaded")
|
||||
|
||||
return _align_model, _align_metadata
|
||||
|
||||
|
||||
def _get_diarize_pipeline(device: str = "cuda"):
|
||||
"""Get or create speaker diarization pipeline."""
|
||||
global _diarize_pipeline
|
||||
|
||||
if _diarize_pipeline is not None:
|
||||
return _diarize_pipeline
|
||||
|
||||
import torch
|
||||
from pyannote.audio import Pipeline
|
||||
|
||||
token = HF_TOKEN or os.getenv("HUGGING_FACE_HUB_TOKEN", "")
|
||||
if not token:
|
||||
logger.warning("No HF_TOKEN set — speaker diarization may fail for gated models")
|
||||
|
||||
logger.info("Loading speaker diarization pipeline (pyannote)...")
|
||||
_diarize_pipeline = Pipeline.from_pretrained(
|
||||
"pyannote/speaker-diarization-3.1",
|
||||
token=token,
|
||||
)
|
||||
_diarize_pipeline.to(torch.device(device))
|
||||
logger.info("Diarization pipeline loaded")
|
||||
return _diarize_pipeline
|
||||
|
||||
|
||||
def transcribe_audio(
|
||||
audio_path: str,
|
||||
language: Optional[str] = None,
|
||||
model_name: str = "large-v3",
|
||||
align: bool = True,
|
||||
diarize: bool = False,
|
||||
min_speakers: Optional[int] = None,
|
||||
max_speakers: Optional[int] = None,
|
||||
) -> TranscriptionResult:
|
||||
"""
|
||||
Transcribe audio using WhisperX with optional alignment and diarization.
|
||||
|
||||
Args:
|
||||
audio_path: Path to audio file
|
||||
language: Language code (auto-detect if None)
|
||||
model_name: Whisper model to use
|
||||
align: Enable word-level timestamp alignment
|
||||
diarize: Enable speaker diarization
|
||||
min_speakers: Minimum expected speakers (helps diarization)
|
||||
max_speakers: Maximum expected speakers
|
||||
|
||||
Returns:
|
||||
TranscriptionResult with text, word timestamps, and speaker info
|
||||
"""
|
||||
import whisperx
|
||||
|
||||
device = os.getenv("WHISPER_DEVICE", "cuda")
|
||||
model = get_whisper_model(model_name)
|
||||
|
||||
logger.info(f"Transcribing: {audio_path} (align={align}, diarize={diarize})")
|
||||
|
||||
# Check and unload if idle, then reload
|
||||
_vram.check_and_unload(unload_models)
|
||||
_vram.touch()
|
||||
|
||||
# Step 1: Load audio
|
||||
audio = whisperx.load_audio(audio_path)
|
||||
|
||||
# Step 2: Transcribe with faster-whisper
|
||||
transcribe_kwargs = {"batch_size": 16}
|
||||
if language:
|
||||
transcribe_kwargs["language"] = language
|
||||
result = model.transcribe(audio, **transcribe_kwargs)
|
||||
detected_language = result.get("language", language or "en")
|
||||
|
||||
# Step 3: Align (word-level timestamps)
|
||||
if align and result["segments"]:
|
||||
try:
|
||||
align_model, metadata = _get_align_model(detected_language, device)
|
||||
result = whisperx.align(
|
||||
result["segments"],
|
||||
align_model,
|
||||
metadata,
|
||||
audio,
|
||||
device,
|
||||
return_char_alignments=False,
|
||||
)
|
||||
logger.info("Word alignment complete")
|
||||
except Exception as e:
|
||||
logger.warning(f"Alignment failed (continuing without): {e}")
|
||||
|
||||
# Step 4: Diarize (speaker identification)
|
||||
if diarize:
|
||||
try:
|
||||
import torch
|
||||
import torchaudio
|
||||
|
||||
diarize_pipe = _get_diarize_pipeline(device)
|
||||
|
||||
# pyannote needs waveform as tensor, not the whisperx audio array
|
||||
waveform = torch.from_numpy(audio).unsqueeze(0).float()
|
||||
diarize_input = {"waveform": waveform, "sample_rate": 16000}
|
||||
|
||||
diarize_kwargs = {}
|
||||
if min_speakers is not None:
|
||||
diarize_kwargs["min_speakers"] = min_speakers
|
||||
if max_speakers is not None:
|
||||
diarize_kwargs["max_speakers"] = max_speakers
|
||||
|
||||
diarize_output = diarize_pipe(diarize_input, **diarize_kwargs)
|
||||
|
||||
# pyannote 4.x returns DiarizeOutput, extract the Annotation
|
||||
if hasattr(diarize_output, "speaker_diarization"):
|
||||
diarize_annotation = diarize_output.speaker_diarization
|
||||
else:
|
||||
diarize_annotation = diarize_output
|
||||
|
||||
# Convert pyannote output to DataFrame for whisperx
|
||||
import pandas as pd
|
||||
diarize_rows = []
|
||||
for turn, _, speaker in diarize_annotation.itertracks(yield_label=True):
|
||||
diarize_rows.append({
|
||||
"start": turn.start,
|
||||
"end": turn.end,
|
||||
"speaker": speaker,
|
||||
})
|
||||
|
||||
diarize_df = pd.DataFrame(diarize_rows)
|
||||
result = whisperx.assign_word_speakers(diarize_df, result)
|
||||
logger.info("Speaker diarization complete")
|
||||
except Exception as e:
|
||||
logger.warning(f"Diarization failed (continuing without): {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
|
||||
# Build response
|
||||
segments = result.get("segments", [])
|
||||
full_text_parts = []
|
||||
all_words = []
|
||||
speaker_set = set()
|
||||
|
||||
for seg in segments:
|
||||
full_text_parts.append(seg.get("text", ""))
|
||||
speaker = seg.get("speaker")
|
||||
if speaker:
|
||||
speaker_set.add(speaker)
|
||||
|
||||
for word_info in seg.get("words", []):
|
||||
all_words.append(WordSegment(
|
||||
word=word_info.get("word", ""),
|
||||
start=word_info.get("start", 0.0),
|
||||
end=word_info.get("end", 0.0),
|
||||
score=word_info.get("score"),
|
||||
speaker=word_info.get("speaker", speaker),
|
||||
))
|
||||
|
||||
text = " ".join(full_text_parts)
|
||||
|
||||
_vram.touch()
|
||||
logger.info(
|
||||
f"Transcription complete: {len(text)} chars, "
|
||||
f"{len(all_words)} words, {len(speaker_set)} speakers"
|
||||
)
|
||||
|
||||
return TranscriptionResult(
|
||||
text=text.strip(),
|
||||
language=detected_language,
|
||||
segments=[{
|
||||
"start": s.get("start", 0),
|
||||
"end": s.get("end", 0),
|
||||
"text": s.get("text", ""),
|
||||
"speaker": s.get("speaker"),
|
||||
} for s in segments],
|
||||
words=all_words,
|
||||
speakers=sorted(speaker_set),
|
||||
)
|
||||
|
||||
|
||||
async def transcribe_audio_bytes(
|
||||
audio_bytes: bytes,
|
||||
filename: str,
|
||||
language: Optional[str] = None,
|
||||
model_name: str = "large-v3",
|
||||
align: bool = True,
|
||||
diarize: bool = False,
|
||||
min_speakers: Optional[int] = None,
|
||||
max_speakers: Optional[int] = None,
|
||||
) -> TranscriptionResult:
|
||||
"""Transcribe audio from bytes (for API uploads)."""
|
||||
import asyncio
|
||||
|
||||
ext = Path(filename).suffix or ".wav"
|
||||
|
||||
with tempfile.NamedTemporaryFile(suffix=ext, delete=False) as tmp:
|
||||
tmp.write(audio_bytes)
|
||||
tmp_path = tmp.name
|
||||
|
||||
try:
|
||||
# Run in thread pool to avoid blocking the event loop
|
||||
loop = asyncio.get_event_loop()
|
||||
result = await loop.run_in_executor(
|
||||
None,
|
||||
lambda: transcribe_audio(
|
||||
audio_path=tmp_path,
|
||||
language=language,
|
||||
model_name=model_name,
|
||||
align=align,
|
||||
diarize=diarize,
|
||||
min_speakers=min_speakers,
|
||||
max_speakers=max_speakers,
|
||||
),
|
||||
)
|
||||
return result
|
||||
finally:
|
||||
try:
|
||||
os.unlink(tmp_path)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
|
||||
# Available models
|
||||
AVAILABLE_MODELS = [
|
||||
"tiny",
|
||||
"tiny.en",
|
||||
"base",
|
||||
"base.en",
|
||||
"small",
|
||||
"small.en",
|
||||
"medium",
|
||||
"medium.en",
|
||||
"large-v1",
|
||||
"large-v2",
|
||||
"large-v3",
|
||||
"large-v3-turbo",
|
||||
"distil-large-v2",
|
||||
"distil-large-v3",
|
||||
]
|
||||
|
|
@ -1,740 +0,0 @@
|
|||
{
|
||||
"annotations": {
|
||||
"list": []
|
||||
},
|
||||
"description": "Mana Speech-to-Text Service Monitoring",
|
||||
"editable": true,
|
||||
"fiscalYearStartMonth": 0,
|
||||
"graphTooltip": 1,
|
||||
"links": [],
|
||||
"panels": [
|
||||
{
|
||||
"collapsed": false,
|
||||
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 0 },
|
||||
"id": 100,
|
||||
"panels": [],
|
||||
"title": "Overview",
|
||||
"type": "row"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": { "mode": "thresholds" },
|
||||
"mappings": [
|
||||
{
|
||||
"options": {
|
||||
"0": { "color": "red", "index": 1, "text": "DOWN" },
|
||||
"1": { "color": "green", "index": 0, "text": "UP" }
|
||||
},
|
||||
"type": "value"
|
||||
}
|
||||
],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{ "color": "red", "value": null },
|
||||
{ "color": "green", "value": 1 }
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": { "h": 4, "w": 3, "x": 0, "y": 1 },
|
||||
"id": 1,
|
||||
"options": {
|
||||
"colorMode": "background",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
|
||||
"textMode": "auto"
|
||||
},
|
||||
"pluginVersion": "10.4.1",
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"expr": "up{job=\"mana-stt\"}",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "Service Status",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": { "mode": "thresholds" },
|
||||
"mappings": [
|
||||
{
|
||||
"options": {
|
||||
"0": { "color": "yellow", "index": 0, "text": "Not Loaded" },
|
||||
"1": { "color": "green", "index": 1, "text": "Loaded" }
|
||||
},
|
||||
"type": "value"
|
||||
}
|
||||
],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{ "color": "yellow", "value": null },
|
||||
{ "color": "green", "value": 1 }
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": { "h": 4, "w": 3, "x": 3, "y": 1 },
|
||||
"id": 2,
|
||||
"options": {
|
||||
"colorMode": "background",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"expr": "mana_stt_model_loaded{model=\"whisper\"}",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "Whisper Model",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": { "mode": "thresholds" },
|
||||
"mappings": [
|
||||
{
|
||||
"options": {
|
||||
"0": { "color": "yellow", "index": 0, "text": "Not Loaded" },
|
||||
"1": { "color": "green", "index": 1, "text": "Loaded" }
|
||||
},
|
||||
"type": "value"
|
||||
}
|
||||
],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{ "color": "yellow", "value": null },
|
||||
{ "color": "green", "value": 1 }
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": { "h": 4, "w": 3, "x": 6, "y": 1 },
|
||||
"id": 3,
|
||||
"options": {
|
||||
"colorMode": "background",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"expr": "mana_stt_model_loaded{model=\"voxtral\"}",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "Voxtral Model",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": { "mode": "palette-classic" },
|
||||
"mappings": [],
|
||||
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }] },
|
||||
"unit": "short"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": { "h": 4, "w": 3, "x": 9, "y": 1 },
|
||||
"id": 4,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "area",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"expr": "sum(mana_stt_requests_total{status=\"success\"})",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "Total Transcriptions",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": { "mode": "palette-classic" },
|
||||
"mappings": [],
|
||||
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }] },
|
||||
"unit": "short"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": { "h": 4, "w": 3, "x": 12, "y": 1 },
|
||||
"id": 5,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "area",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"expr": "sum(mana_stt_characters_transcribed_total)",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "Characters Transcribed",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": { "mode": "thresholds" },
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{ "color": "green", "value": null },
|
||||
{ "color": "yellow", "value": 1 },
|
||||
{ "color": "red", "value": 3 }
|
||||
]
|
||||
},
|
||||
"unit": "short"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": { "h": 4, "w": 3, "x": 15, "y": 1 },
|
||||
"id": 6,
|
||||
"options": {
|
||||
"colorMode": "background",
|
||||
"graphMode": "area",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"expr": "sum(mana_stt_active_requests)",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "Active Requests",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": { "mode": "thresholds" },
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{ "color": "green", "value": null },
|
||||
{ "color": "red", "value": 1 }
|
||||
]
|
||||
},
|
||||
"unit": "short"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": { "h": 4, "w": 3, "x": 18, "y": 1 },
|
||||
"id": 7,
|
||||
"options": {
|
||||
"colorMode": "background",
|
||||
"graphMode": "area",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"expr": "sum(mana_stt_requests_total{status=\"error\"})",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "Total Errors",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": { "mode": "palette-classic" },
|
||||
"mappings": [],
|
||||
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }] },
|
||||
"unit": "s"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": { "h": 4, "w": 3, "x": 21, "y": 1 },
|
||||
"id": 8,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"expr": "histogram_quantile(0.50, sum(rate(mana_stt_transcription_duration_seconds_bucket[5m])) by (le))",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "Median Duration",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"collapsed": false,
|
||||
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 5 },
|
||||
"id": 101,
|
||||
"panels": [],
|
||||
"title": "Performance",
|
||||
"type": "row"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": { "mode": "palette-classic" },
|
||||
"custom": {
|
||||
"axisBorderShow": false,
|
||||
"axisCenteredZero": false,
|
||||
"axisColorMode": "text",
|
||||
"axisLabel": "",
|
||||
"axisPlacement": "auto",
|
||||
"barAlignment": 0,
|
||||
"drawStyle": "line",
|
||||
"fillOpacity": 20,
|
||||
"gradientMode": "none",
|
||||
"hideFrom": { "legend": false, "tooltip": false, "viz": false },
|
||||
"insertNulls": false,
|
||||
"lineInterpolation": "smooth",
|
||||
"lineWidth": 2,
|
||||
"pointSize": 5,
|
||||
"scaleDistribution": { "type": "linear" },
|
||||
"showPoints": "auto",
|
||||
"spanNulls": false,
|
||||
"stacking": { "group": "A", "mode": "none" },
|
||||
"thresholdsStyle": { "mode": "off" }
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }] },
|
||||
"unit": "s"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 6 },
|
||||
"id": 10,
|
||||
"options": {
|
||||
"legend": {
|
||||
"calcs": ["mean", "max"],
|
||||
"displayMode": "table",
|
||||
"placement": "bottom",
|
||||
"showLegend": true
|
||||
},
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"expr": "histogram_quantile(0.50, sum(rate(mana_stt_transcription_duration_seconds_bucket{model=\"whisper\"}[5m])) by (le))",
|
||||
"legendFormat": "Whisper p50",
|
||||
"refId": "A"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"expr": "histogram_quantile(0.95, sum(rate(mana_stt_transcription_duration_seconds_bucket{model=\"whisper\"}[5m])) by (le))",
|
||||
"legendFormat": "Whisper p95",
|
||||
"refId": "B"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"expr": "histogram_quantile(0.50, sum(rate(mana_stt_transcription_duration_seconds_bucket{model=\"voxtral\"}[5m])) by (le))",
|
||||
"legendFormat": "Voxtral p50",
|
||||
"refId": "C"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"expr": "histogram_quantile(0.95, sum(rate(mana_stt_transcription_duration_seconds_bucket{model=\"voxtral\"}[5m])) by (le))",
|
||||
"legendFormat": "Voxtral p95",
|
||||
"refId": "D"
|
||||
}
|
||||
],
|
||||
"title": "Transcription Duration (p50 / p95)",
|
||||
"type": "timeseries"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": { "mode": "palette-classic" },
|
||||
"custom": {
|
||||
"axisBorderShow": false,
|
||||
"axisCenteredZero": false,
|
||||
"axisColorMode": "text",
|
||||
"axisLabel": "",
|
||||
"axisPlacement": "auto",
|
||||
"barAlignment": 0,
|
||||
"drawStyle": "line",
|
||||
"fillOpacity": 20,
|
||||
"gradientMode": "none",
|
||||
"hideFrom": { "legend": false, "tooltip": false, "viz": false },
|
||||
"insertNulls": false,
|
||||
"lineInterpolation": "smooth",
|
||||
"lineWidth": 2,
|
||||
"pointSize": 5,
|
||||
"scaleDistribution": { "type": "linear" },
|
||||
"showPoints": "auto",
|
||||
"spanNulls": false,
|
||||
"stacking": { "group": "A", "mode": "none" },
|
||||
"thresholdsStyle": { "mode": "off" }
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }] },
|
||||
"unit": "reqps"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 6 },
|
||||
"id": 11,
|
||||
"options": {
|
||||
"legend": {
|
||||
"calcs": ["mean", "sum"],
|
||||
"displayMode": "table",
|
||||
"placement": "bottom",
|
||||
"showLegend": true
|
||||
},
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"expr": "sum(rate(mana_stt_requests_total{model=\"whisper\", status=\"success\"}[5m]))",
|
||||
"legendFormat": "Whisper Success",
|
||||
"refId": "A"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"expr": "sum(rate(mana_stt_requests_total{model=\"voxtral\", status=\"success\"}[5m]))",
|
||||
"legendFormat": "Voxtral Success",
|
||||
"refId": "B"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"expr": "sum(rate(mana_stt_requests_total{status=\"error\"}[5m]))",
|
||||
"legendFormat": "Errors",
|
||||
"refId": "C"
|
||||
}
|
||||
],
|
||||
"title": "Request Rate",
|
||||
"type": "timeseries"
|
||||
},
|
||||
{
|
||||
"collapsed": false,
|
||||
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 14 },
|
||||
"id": 102,
|
||||
"panels": [],
|
||||
"title": "Details",
|
||||
"type": "row"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": { "mode": "palette-classic" },
|
||||
"custom": { "hideFrom": { "legend": false, "tooltip": false, "viz": false } },
|
||||
"mappings": []
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": { "h": 8, "w": 6, "x": 0, "y": 15 },
|
||||
"id": 12,
|
||||
"options": {
|
||||
"legend": { "displayMode": "list", "placement": "right", "showLegend": true },
|
||||
"pieType": "pie",
|
||||
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
|
||||
"tooltip": { "mode": "single", "sort": "none" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"expr": "sum(mana_stt_requests_total{status=\"success\"}) by (model)",
|
||||
"legendFormat": "{{model}}",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "Requests by Model",
|
||||
"type": "piechart"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": { "mode": "palette-classic" },
|
||||
"custom": { "hideFrom": { "legend": false, "tooltip": false, "viz": false } },
|
||||
"mappings": []
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": { "h": 8, "w": 6, "x": 6, "y": 15 },
|
||||
"id": 13,
|
||||
"options": {
|
||||
"legend": { "displayMode": "list", "placement": "right", "showLegend": true },
|
||||
"pieType": "pie",
|
||||
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
|
||||
"tooltip": { "mode": "single", "sort": "none" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"expr": "sum(mana_stt_requests_total{status=\"success\"}) by (language)",
|
||||
"legendFormat": "{{language}}",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "Requests by Language",
|
||||
"type": "piechart"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": { "mode": "palette-classic" },
|
||||
"custom": {
|
||||
"axisBorderShow": false,
|
||||
"axisCenteredZero": false,
|
||||
"axisColorMode": "text",
|
||||
"axisLabel": "",
|
||||
"axisPlacement": "auto",
|
||||
"barAlignment": 0,
|
||||
"drawStyle": "bars",
|
||||
"fillOpacity": 80,
|
||||
"gradientMode": "none",
|
||||
"hideFrom": { "legend": false, "tooltip": false, "viz": false },
|
||||
"insertNulls": false,
|
||||
"lineInterpolation": "linear",
|
||||
"lineWidth": 1,
|
||||
"pointSize": 5,
|
||||
"scaleDistribution": { "type": "linear" },
|
||||
"showPoints": "never",
|
||||
"spanNulls": false,
|
||||
"stacking": { "group": "A", "mode": "normal" },
|
||||
"thresholdsStyle": { "mode": "off" }
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }] },
|
||||
"unit": "decbytes"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 15 },
|
||||
"id": 14,
|
||||
"options": {
|
||||
"legend": { "calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true },
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"expr": "sum(rate(mana_stt_file_size_mb_sum{model=\"whisper\"}[5m])) * 1024 * 1024",
|
||||
"legendFormat": "Whisper",
|
||||
"refId": "A"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"expr": "sum(rate(mana_stt_file_size_mb_sum{model=\"voxtral\"}[5m])) * 1024 * 1024",
|
||||
"legendFormat": "Voxtral",
|
||||
"refId": "B"
|
||||
}
|
||||
],
|
||||
"title": "Data Processed",
|
||||
"type": "timeseries"
|
||||
},
|
||||
{
|
||||
"collapsed": false,
|
||||
"gridPos": { "h": 1, "w": 24, "x": 0, "y": 23 },
|
||||
"id": 103,
|
||||
"panels": [],
|
||||
"title": "Model Loading",
|
||||
"type": "row"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": { "mode": "thresholds" },
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{ "color": "green", "value": null },
|
||||
{ "color": "yellow", "value": 30 },
|
||||
{ "color": "red", "value": 60 }
|
||||
]
|
||||
},
|
||||
"unit": "s"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": { "h": 6, "w": 8, "x": 0, "y": 24 },
|
||||
"id": 15,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "area",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"expr": "mana_stt_model_load_duration_seconds_sum{model=\"whisper\"} / mana_stt_model_load_duration_seconds_count{model=\"whisper\"}",
|
||||
"legendFormat": "Whisper",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "Whisper Load Time",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": { "mode": "thresholds" },
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{ "color": "green", "value": null },
|
||||
{ "color": "yellow", "value": 60 },
|
||||
{ "color": "red", "value": 120 }
|
||||
]
|
||||
},
|
||||
"unit": "s"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": { "h": 6, "w": 8, "x": 8, "y": 24 },
|
||||
"id": 16,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "area",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": { "calcs": ["lastNotNull"], "fields": "", "values": false },
|
||||
"textMode": "auto"
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"expr": "mana_stt_model_load_duration_seconds_sum{model=\"voxtral\"} / mana_stt_model_load_duration_seconds_count{model=\"voxtral\"}",
|
||||
"legendFormat": "Voxtral",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "Voxtral Load Time",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": { "mode": "palette-classic" },
|
||||
"custom": {
|
||||
"axisBorderShow": false,
|
||||
"axisCenteredZero": false,
|
||||
"axisColorMode": "text",
|
||||
"axisLabel": "",
|
||||
"axisPlacement": "auto",
|
||||
"barAlignment": 0,
|
||||
"drawStyle": "line",
|
||||
"fillOpacity": 10,
|
||||
"gradientMode": "none",
|
||||
"hideFrom": { "legend": false, "tooltip": false, "viz": false },
|
||||
"insertNulls": false,
|
||||
"lineInterpolation": "smooth",
|
||||
"lineWidth": 2,
|
||||
"pointSize": 5,
|
||||
"scaleDistribution": { "type": "linear" },
|
||||
"showPoints": "auto",
|
||||
"spanNulls": false,
|
||||
"stacking": { "group": "A", "mode": "none" },
|
||||
"thresholdsStyle": { "mode": "off" }
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": { "mode": "absolute", "steps": [{ "color": "green", "value": null }] },
|
||||
"unit": "short"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": { "h": 6, "w": 8, "x": 16, "y": 24 },
|
||||
"id": 17,
|
||||
"options": {
|
||||
"legend": { "calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true },
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"expr": "sum(rate(mana_stt_characters_transcribed_total{model=\"whisper\"}[5m]))",
|
||||
"legendFormat": "Whisper",
|
||||
"refId": "A"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "prometheus" },
|
||||
"expr": "sum(rate(mana_stt_characters_transcribed_total{model=\"voxtral\"}[5m]))",
|
||||
"legendFormat": "Voxtral",
|
||||
"refId": "B"
|
||||
}
|
||||
],
|
||||
"title": "Characters/sec Transcribed",
|
||||
"type": "timeseries"
|
||||
}
|
||||
],
|
||||
"refresh": "30s",
|
||||
"schemaVersion": 39,
|
||||
"tags": ["mana", "stt", "ai"],
|
||||
"templating": { "list": [] },
|
||||
"time": { "from": "now-1h", "to": "now" },
|
||||
"timepicker": {},
|
||||
"timezone": "browser",
|
||||
"title": "Mana STT Service",
|
||||
"uid": "mana-stt-dashboard",
|
||||
"version": 1,
|
||||
"weekStart": "monday"
|
||||
}
|
||||
|
|
@ -1,35 +0,0 @@
|
|||
# Mana STT Service Dependencies
|
||||
# For GPU Server (NVIDIA RTX 3090 / CUDA)
|
||||
|
||||
# Web Framework
|
||||
fastapi==0.115.6
|
||||
uvicorn[standard]==0.34.0
|
||||
python-multipart==0.0.20
|
||||
|
||||
# Audio Processing
|
||||
pydub==0.25.1
|
||||
soundfile==0.13.1
|
||||
|
||||
# WhisperX (CUDA) — includes faster-whisper + alignment
|
||||
whisperx @ git+https://github.com/m-bain/whisperX.git
|
||||
|
||||
# faster-whisper with CTranslate2 (CUDA backend)
|
||||
faster-whisper>=1.1.0
|
||||
|
||||
# Speaker Diarization (pyannote.audio)
|
||||
# Requires HF_TOKEN with accepted terms:
|
||||
# https://huggingface.co/pyannote/speaker-diarization-3.1
|
||||
# https://huggingface.co/pyannote/segmentation-3.0
|
||||
pyannote.audio>=3.3.0
|
||||
|
||||
# PyTorch CUDA — install separately for your CUDA version:
|
||||
# pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121
|
||||
torch>=2.5.0
|
||||
torchaudio>=2.5.0
|
||||
|
||||
# Utilities
|
||||
numpy>=1.26.0
|
||||
tqdm>=4.67.0
|
||||
|
||||
# External Auth (mana-core-auth integration)
|
||||
httpx>=0.27.0
|
||||
|
|
@ -1,28 +0,0 @@
|
|||
# Mana STT Service Dependencies
|
||||
# For Mac Mini M4 (Apple Silicon)
|
||||
|
||||
# Web Framework
|
||||
fastapi==0.115.6
|
||||
uvicorn[standard]==0.34.0
|
||||
python-multipart==0.0.20
|
||||
|
||||
# Audio Processing
|
||||
pydub==0.25.1
|
||||
soundfile==0.13.1
|
||||
|
||||
# Whisper (Apple Silicon optimized)
|
||||
lightning-whisper-mlx==0.0.10
|
||||
mlx>=0.21.0
|
||||
|
||||
# Voxtral (Hugging Face Transformers)
|
||||
transformers>=4.47.0
|
||||
torch>=2.5.0
|
||||
accelerate>=1.2.0
|
||||
sentencepiece>=0.2.0
|
||||
|
||||
# Utilities
|
||||
numpy>=1.26.0
|
||||
tqdm>=4.67.0
|
||||
|
||||
# External Auth (mana-core-auth integration)
|
||||
httpx>=0.27.0
|
||||
|
|
@ -1,34 +0,0 @@
|
|||
"""mana-stt service runner."""
|
||||
import os
|
||||
import sys
|
||||
|
||||
os.chdir(r"C:\mana\services\mana-stt")
|
||||
sys.path.insert(0, r"C:\mana\services\mana-stt")
|
||||
|
||||
# Redirect stdout/stderr to log file FIRST (before any imports that warn)
|
||||
log = open(r"C:\mana\services\mana-stt\service.log", "w", buffering=1)
|
||||
sys.stdout = log
|
||||
sys.stderr = log
|
||||
|
||||
# Load .env file
|
||||
from dotenv import load_dotenv
|
||||
load_dotenv(r"C:\mana\services\mana-stt\.env")
|
||||
|
||||
# Ensure FFmpeg is in PATH
|
||||
ffmpeg_dir = r"C:\Users\tills\AppData\Local\Microsoft\WinGet\Links"
|
||||
if ffmpeg_dir not in os.environ.get("PATH", ""):
|
||||
os.environ["PATH"] = ffmpeg_dir + os.pathsep + os.environ.get("PATH", "")
|
||||
|
||||
# Set HF token
|
||||
hf_token = os.environ.get("HF_TOKEN", "")
|
||||
if hf_token:
|
||||
os.environ["HUGGING_FACE_HUB_TOKEN"] = hf_token
|
||||
|
||||
# Pre-initialize CUDA before importing whisperx (avoids hangs)
|
||||
import torch
|
||||
if torch.cuda.is_available():
|
||||
torch.cuda.init()
|
||||
print(f"CUDA initialized: {torch.cuda.get_device_name(0)}", flush=True)
|
||||
|
||||
import uvicorn
|
||||
uvicorn.run("app.main:app", host="0.0.0.0", port=3020, log_level="info")
|
||||
Loading…
Add table
Add a link
Reference in a new issue