mirror of
https://github.com/Memo-2023/mana-monorepo.git
synced 2026-05-20 02:21:25 +02:00
- `X-Mana-LLM-Resolved: <provider>/<model>` header on non-streaming
responses. Streaming clients read the same info from each chunk's
`model` field (SSE headers go out before the chain is walked).
- Three new Prometheus metrics: `mana_llm_alias_resolved_total{alias,
target}` (which concrete model an alias resolved to per request),
`mana_llm_fallback_total{from_model, to_model, reason}` (each
fallback transition), `mana_llm_provider_healthy{provider}` (gauge,
mirrors the circuit-breaker).
- New debug endpoints: `GET /v1/aliases` (registry inspection — chain
+ description per alias, useful for confirming SIGHUP reloads),
`GET /v1/health` (full per-provider liveness snapshot — failure
counter, last error, unhealthy-until backoff).
- `kill -HUP <pid>` reloads `aliases.yaml`. Parse errors leave the
previous good state in memory and log the rejection.
- `ProviderHealthCache.add_listener()` for cache→metrics decoupling:
the gauge is updated via a transition-only listener wired in main.py
rather than the cache importing prometheus_client itself.
- Request-side metrics now use the requested model string, success-side
uses the resolved one. So `mana_llm_llm_requests_total{provider="ollama",
model="gemma3:12b"}` reflects actual upstream load even when callers
used `mana/long-form` aliases.
16 new observability tests (test_m4_observability.py): listener
fire-on-transition semantics, exception-isolation, multi-listener,
counter increments, gauge writes, end-to-end alias→metric flow,
v1/aliases + v1/health endpoint shape, response.model carries the
resolved target after fallback. Total suite: 115/115 in 1.6s.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
376 lines
13 KiB
Markdown
376 lines
13 KiB
Markdown
# mana-llm
|
|
|
|
Central LLM abstraction service providing a unified OpenAI-compatible API for Ollama and cloud LLM providers.
|
|
|
|
## Overview
|
|
|
|
mana-llm acts as a central gateway for all LLM requests in the monorepo, providing:
|
|
- Unified OpenAI-compatible API
|
|
- Provider routing (Ollama, OpenRouter, Groq, Together)
|
|
- Streaming via Server-Sent Events (SSE)
|
|
- Vision/multimodal support
|
|
- Embeddings generation
|
|
- Prometheus metrics
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ Consumer Apps │
|
|
│ chat-backend │ mana web │ todo (LLM enrich) │ etc. │
|
|
└────────────────────────────────┬────────────────────────────────────┘
|
|
│ HTTP/SSE
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ mana-llm (Port 3025) │
|
|
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
|
│ │ Router │ │ Cache │ │ Metrics │ │
|
|
│ │ (Provider) │ │ (Redis) │ │ (Prometheus)│ │
|
|
│ └──────┬──────┘ └─────────────┘ └─────────────┘ │
|
|
│ │ │
|
|
│ ┌──────┴──────────────────────────────────────────┐ │
|
|
│ │ Provider Adapters │ │
|
|
│ │ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │ │
|
|
│ │ │ Ollama │ │ OpenAI │ │ OpenRouter │ │ │
|
|
│ │ │ Adapter │ │ Adapter │ │ Adapter │ │ │
|
|
│ │ └──────────┘ └──────────┘ └──────────────┘ │ │
|
|
│ └─────────────────────────────────────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Quick Start
|
|
|
|
### Prerequisites
|
|
|
|
- Python 3.11+
|
|
- Ollama running locally (http://localhost:11434)
|
|
- Redis (optional, for caching)
|
|
|
|
### Development
|
|
|
|
```bash
|
|
cd services/mana-llm
|
|
|
|
# Create virtual environment
|
|
python -m venv venv
|
|
source venv/bin/activate # or venv\Scripts\activate on Windows
|
|
|
|
# Install dependencies
|
|
pip install -r requirements.txt
|
|
|
|
# Copy environment file
|
|
cp .env.example .env
|
|
|
|
# Start Redis (optional)
|
|
docker-compose -f docker-compose.dev.yml up -d
|
|
|
|
# Run service
|
|
python -m uvicorn src.main:app --port 3025 --reload
|
|
```
|
|
|
|
### Docker
|
|
|
|
```bash
|
|
# Full stack (mana-llm + Redis)
|
|
docker-compose up -d
|
|
|
|
# View logs
|
|
docker-compose logs -f mana-llm
|
|
```
|
|
|
|
## API Endpoints
|
|
|
|
### Chat Completions
|
|
|
|
```bash
|
|
# Non-streaming
|
|
curl -X POST http://localhost:3025/v1/chat/completions \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "ollama/gemma3:4b",
|
|
"messages": [{"role": "user", "content": "Hello!"}],
|
|
"stream": false
|
|
}'
|
|
|
|
# Streaming (SSE)
|
|
curl -X POST http://localhost:3025/v1/chat/completions \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "ollama/gemma3:4b",
|
|
"messages": [{"role": "user", "content": "Hello!"}],
|
|
"stream": true
|
|
}'
|
|
```
|
|
|
|
### Vision/Multimodal
|
|
|
|
```bash
|
|
curl -X POST http://localhost:3025/v1/chat/completions \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "ollama/llava:7b",
|
|
"messages": [{
|
|
"role": "user",
|
|
"content": [
|
|
{"type": "text", "text": "What is in this image?"},
|
|
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
|
|
]
|
|
}]
|
|
}'
|
|
```
|
|
|
|
### Models
|
|
|
|
```bash
|
|
# List all models
|
|
curl http://localhost:3025/v1/models
|
|
|
|
# Get specific model
|
|
curl http://localhost:3025/v1/models/ollama/gemma3:4b
|
|
```
|
|
|
|
### Embeddings
|
|
|
|
```bash
|
|
curl -X POST http://localhost:3025/v1/embeddings \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "ollama/nomic-embed-text",
|
|
"input": "Text to embed"
|
|
}'
|
|
```
|
|
|
|
### Health & Metrics
|
|
|
|
```bash
|
|
# Liveness summary (legacy, terse shape — only status + per-provider status string)
|
|
curl http://localhost:3025/health
|
|
|
|
# Detailed per-provider liveness snapshot (M4)
|
|
curl http://localhost:3025/v1/health
|
|
|
|
# Prometheus metrics
|
|
curl http://localhost:3025/metrics
|
|
```
|
|
|
|
### Aliases (M4 — see `aliases.yaml` and the fallback section below)
|
|
|
|
```bash
|
|
# What does each `mana/<class>` resolve to?
|
|
curl http://localhost:3025/v1/aliases
|
|
```
|
|
|
|
## Aliases & Fallback
|
|
|
|
> Background: [`docs/plans/llm-fallback-aliases.md`](../../docs/plans/llm-fallback-aliases.md)
|
|
|
|
### What callers send
|
|
|
|
Two acceptable shapes for the `model` field of `/v1/chat/completions`:
|
|
|
|
1. **Aliases** in the reserved `mana/` namespace — recommended for product code.
|
|
The router resolves them via `aliases.yaml` to a chain of concrete
|
|
`provider/model` strings and tries them in order.
|
|
2. **Direct `provider/model`** — bypasses the alias layer, no fallback.
|
|
Useful for tests, debugging, and one-off integrations.
|
|
|
|
| Alias | Class |
|
|
|---|---|
|
|
| `mana/fast-text` | Short answers, classification, single-shot Q&A |
|
|
| `mana/long-form` | Writing, essays, stories, longer prose |
|
|
| `mana/structured` | JSON output (comic storyboards, research subqueries, tag suggestions) |
|
|
| `mana/reasoning` | Agent missions, tool calls, multi-step plans |
|
|
| `mana/vision` | Multimodal (image + text) |
|
|
|
|
The chain for each alias lives in `services/mana-llm/aliases.yaml`. Edit
|
|
the file and `kill -HUP <pid>` to reload — no restart needed. Reload
|
|
errors keep the previous good state; check the service logs.
|
|
|
|
### Fallback semantics
|
|
|
|
Every chain is tried in order. The router skips an entry if the provider
|
|
isn't configured at this deployment (no API key) or is currently marked
|
|
unhealthy by the health-cache. For each remaining entry the request is
|
|
attempted; on a **retryable** error (connection failure, timeout, 5xx,
|
|
rate-limit, RemoteProtocolError) the provider is marked unhealthy and
|
|
the next entry is tried. **Non-retryable** errors (auth, capability,
|
|
content-blocked, 4xx, unknown exception types) propagate immediately —
|
|
no fallback, the cache is not poisoned.
|
|
|
|
Streaming follows the same logic up to the **first byte**. Once a chunk
|
|
has been yielded the provider is committed; mid-stream errors surface
|
|
as-is so we never splice two providers' voices into one output.
|
|
|
|
If every entry was skipped or failed, the response is `503` carrying a
|
|
structured `attempts: list[(model, reason)]` log so the cause is
|
|
visible to the caller, not only in service logs.
|
|
|
|
### Resolved-model header
|
|
|
|
Non-streaming responses carry `X-Mana-LLM-Resolved: <provider>/<model>`
|
|
(e.g. `groq/llama-3.3-70b-versatile`) — the concrete model that
|
|
actually answered. Use this for token-cost attribution when the request
|
|
used an alias. For streaming, each chunk's `model` field carries the
|
|
same info (headers go out before the chain is walked).
|
|
|
|
### Health-cache + probe
|
|
|
|
`ProviderHealthCache` keeps a per-provider circuit-breaker:
|
|
|
|
* 1 failure: still healthy (transient blip, don't bounce).
|
|
* 2 consecutive failures: `is_healthy → False` for 60 s; the router
|
|
fail-fasts straight to the next chain entry.
|
|
* After 60 s: half-open. Next call exercises the provider; success
|
|
fully resets, failure re-arms the backoff.
|
|
|
|
A background `HealthProbe` task runs every 30 s with a 3 s timeout per
|
|
provider, calling cheap endpoints (`/api/tags` for Ollama, `/v1/models`
|
|
for OpenAI-compat). One bad probe can't sink the loop; results feed
|
|
into the same cache as the call-site fallback.
|
|
|
|
### Prometheus metrics added in M4
|
|
|
|
| Metric | Labels | Purpose |
|
|
|---|---|---|
|
|
| `mana_llm_alias_resolved_total` | `alias`, `target` | How often an alias resolved to which concrete model — useful for spotting cases where the primary always falls through. |
|
|
| `mana_llm_fallback_total` | `from_model`, `to_model`, `reason` | Each fallback transition. `reason` is the exception class name or `cache-unhealthy` / `unconfigured`. |
|
|
| `mana_llm_provider_healthy` | `provider` | Gauge: 1 healthy, 0 in backoff. Mirrors the circuit-breaker. |
|
|
|
|
## Provider Routing
|
|
|
|
Models use the format `provider/model`:
|
|
|
|
| Model | Provider | Target |
|
|
|-------|----------|--------|
|
|
| `ollama/gemma3:4b` | Ollama | localhost:11434 |
|
|
| `ollama/llava:7b` | Ollama | localhost:11434 |
|
|
| `openrouter/meta-llama/llama-3.1-8b-instruct` | OpenRouter | api.openrouter.ai |
|
|
| `groq/llama-3.1-8b-instant` | Groq | api.groq.com |
|
|
| `together/meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo` | Together | api.together.xyz |
|
|
|
|
**Default:** If no provider prefix is given (e.g., `gemma3:4b`), Ollama is used.
|
|
|
|
## Configuration
|
|
|
|
Environment variables (see `.env.example`):
|
|
|
|
| Variable | Default | Description |
|
|
|----------|---------|-------------|
|
|
| `PORT` | 3025 | Service port |
|
|
| `LOG_LEVEL` | info | Logging level |
|
|
| `OLLAMA_URL` | http://localhost:11434 | Ollama server URL |
|
|
| `OLLAMA_DEFAULT_MODEL` | gemma3:4b | Default Ollama model |
|
|
| `OLLAMA_TIMEOUT` | 120 | Ollama request timeout (seconds) |
|
|
| `OPENROUTER_API_KEY` | - | OpenRouter API key |
|
|
| `GROQ_API_KEY` | - | Groq API key |
|
|
| `TOGETHER_API_KEY` | - | Together API key |
|
|
| `REDIS_URL` | - | Redis URL for caching |
|
|
| `CACHE_TTL` | 3600 | Cache TTL in seconds |
|
|
| `CORS_ORIGINS` | localhost | Allowed CORS origins |
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
services/mana-llm/
|
|
├── src/
|
|
│ ├── main.py # FastAPI app entry point
|
|
│ ├── config.py # Settings via pydantic-settings
|
|
│ ├── providers/
|
|
│ │ ├── base.py # Abstract provider interface
|
|
│ │ ├── ollama.py # Ollama provider
|
|
│ │ ├── openai_compat.py # OpenAI-compatible provider
|
|
│ │ └── router.py # Provider routing logic
|
|
│ ├── models/
|
|
│ │ ├── requests.py # Request Pydantic models
|
|
│ │ └── responses.py # Response Pydantic models
|
|
│ ├── streaming/
|
|
│ │ └── sse.py # SSE response handling
|
|
│ └── utils/
|
|
│ ├── cache.py # Redis caching
|
|
│ └── metrics.py # Prometheus metrics
|
|
├── tests/
|
|
│ ├── test_api.py # API endpoint tests
|
|
│ ├── test_providers.py # Provider tests
|
|
│ └── test_streaming.py # Streaming tests
|
|
├── Dockerfile
|
|
├── docker-compose.yml
|
|
├── docker-compose.dev.yml
|
|
├── requirements.txt
|
|
├── pyproject.toml
|
|
└── .env.example
|
|
```
|
|
|
|
## Testing
|
|
|
|
```bash
|
|
# Run tests
|
|
pytest
|
|
|
|
# Run with coverage
|
|
pytest --cov=src
|
|
|
|
# Run specific test file
|
|
pytest tests/test_providers.py -v
|
|
```
|
|
|
|
## Integration Example
|
|
|
|
### TypeScript/Node.js Client
|
|
|
|
```typescript
|
|
// Using fetch
|
|
const response = await fetch('http://localhost:3025/v1/chat/completions', {
|
|
method: 'POST',
|
|
headers: { 'Content-Type': 'application/json' },
|
|
body: JSON.stringify({
|
|
model: 'ollama/gemma3:4b',
|
|
messages: [{ role: 'user', content: 'Hello!' }],
|
|
stream: false,
|
|
}),
|
|
});
|
|
|
|
const data = await response.json();
|
|
console.log(data.choices[0].message.content);
|
|
```
|
|
|
|
### Streaming with EventSource
|
|
|
|
```typescript
|
|
const response = await fetch('http://localhost:3025/v1/chat/completions', {
|
|
method: 'POST',
|
|
headers: { 'Content-Type': 'application/json' },
|
|
body: JSON.stringify({
|
|
model: 'ollama/gemma3:4b',
|
|
messages: [{ role: 'user', content: 'Hello!' }],
|
|
stream: true,
|
|
}),
|
|
});
|
|
|
|
const reader = response.body?.getReader();
|
|
const decoder = new TextDecoder();
|
|
|
|
while (true) {
|
|
const { done, value } = await reader!.read();
|
|
if (done) break;
|
|
|
|
const chunk = decoder.decode(value);
|
|
const lines = chunk.split('\n').filter(line => line.startsWith('data: '));
|
|
|
|
for (const line of lines) {
|
|
const data = line.slice(6);
|
|
if (data === '[DONE]') break;
|
|
|
|
const parsed = JSON.parse(data);
|
|
const content = parsed.choices[0]?.delta?.content;
|
|
if (content) process.stdout.write(content);
|
|
}
|
|
}
|
|
```
|
|
|
|
## Related Services
|
|
|
|
| Service | Port | Description |
|
|
|---------|------|-------------|
|
|
| mana-tts | 3022 | Text-to-speech service |
|
|
| mana-stt | 3023 | Speech-to-text service |
|
|
| mana-search | 3021 | Web search & extraction |
|