mirror of
https://github.com/Memo-2023/mana-monorepo.git
synced 2026-05-14 20:21:09 +02:00
Python/FastAPI service providing unified OpenAI-compatible API for Ollama and cloud LLM providers (OpenRouter, Groq, Together). Features: - Chat completions with streaming (SSE) - Vision/multimodal support - Embeddings generation - Multi-provider routing (provider/model format) - Prometheus metrics - Optional Redis caching
9.2 KiB
9.2 KiB
mana-llm
Central LLM abstraction service providing a unified OpenAI-compatible API for Ollama and cloud LLM providers.
Overview
mana-llm acts as a central gateway for all LLM requests in the monorepo, providing:
- Unified OpenAI-compatible API
- Provider routing (Ollama, OpenRouter, Groq, Together)
- Streaming via Server-Sent Events (SSE)
- Vision/multimodal support
- Embeddings generation
- Prometheus metrics
Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ Consumer Apps │
│ matrix-ollama-bot │ telegram-ollama-bot │ chat-backend │ etc. │
└────────────────────────────────┬────────────────────────────────────┘
│ HTTP/SSE
▼
┌─────────────────────────────────────────────────────────────────────┐
│ mana-llm (Port 3025) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Router │ │ Cache │ │ Metrics │ │
│ │ (Provider) │ │ (Redis) │ │ (Prometheus)│ │
│ └──────┬──────┘ └─────────────┘ └─────────────┘ │
│ │ │
│ ┌──────┴──────────────────────────────────────────┐ │
│ │ Provider Adapters │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │ │
│ │ │ Ollama │ │ OpenAI │ │ OpenRouter │ │ │
│ │ │ Adapter │ │ Adapter │ │ Adapter │ │ │
│ │ └──────────┘ └──────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Quick Start
Prerequisites
- Python 3.11+
- Ollama running locally (http://localhost:11434)
- Redis (optional, for caching)
Development
cd services/mana-llm
# Create virtual environment
python -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windows
# Install dependencies
pip install -r requirements.txt
# Copy environment file
cp .env.example .env
# Start Redis (optional)
docker-compose -f docker-compose.dev.yml up -d
# Run service
python -m uvicorn src.main:app --port 3025 --reload
Docker
# Full stack (mana-llm + Redis)
docker-compose up -d
# View logs
docker-compose logs -f mana-llm
API Endpoints
Chat Completions
# Non-streaming
curl -X POST http://localhost:3025/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ollama/gemma3:4b",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": false
}'
# Streaming (SSE)
curl -X POST http://localhost:3025/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ollama/gemma3:4b",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'
Vision/Multimodal
curl -X POST http://localhost:3025/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ollama/llava:7b",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
]
}]
}'
Models
# List all models
curl http://localhost:3025/v1/models
# Get specific model
curl http://localhost:3025/v1/models/ollama/gemma3:4b
Embeddings
curl -X POST http://localhost:3025/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "ollama/nomic-embed-text",
"input": "Text to embed"
}'
Health & Metrics
# Health check
curl http://localhost:3025/health
# Prometheus metrics
curl http://localhost:3025/metrics
Provider Routing
Models use the format provider/model:
| Model | Provider | Target |
|---|---|---|
ollama/gemma3:4b |
Ollama | localhost:11434 |
ollama/llava:7b |
Ollama | localhost:11434 |
openrouter/meta-llama/llama-3.1-8b-instruct |
OpenRouter | api.openrouter.ai |
groq/llama-3.1-8b-instant |
Groq | api.groq.com |
together/meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo |
Together | api.together.xyz |
Default: If no provider prefix is given (e.g., gemma3:4b), Ollama is used.
Configuration
Environment variables (see .env.example):
| Variable | Default | Description |
|---|---|---|
PORT |
3025 | Service port |
LOG_LEVEL |
info | Logging level |
OLLAMA_URL |
http://localhost:11434 | Ollama server URL |
OLLAMA_DEFAULT_MODEL |
gemma3:4b | Default Ollama model |
OLLAMA_TIMEOUT |
120 | Ollama request timeout (seconds) |
OPENROUTER_API_KEY |
- | OpenRouter API key |
GROQ_API_KEY |
- | Groq API key |
TOGETHER_API_KEY |
- | Together API key |
REDIS_URL |
- | Redis URL for caching |
CACHE_TTL |
3600 | Cache TTL in seconds |
CORS_ORIGINS |
localhost | Allowed CORS origins |
Project Structure
services/mana-llm/
├── src/
│ ├── main.py # FastAPI app entry point
│ ├── config.py # Settings via pydantic-settings
│ ├── providers/
│ │ ├── base.py # Abstract provider interface
│ │ ├── ollama.py # Ollama provider
│ │ ├── openai_compat.py # OpenAI-compatible provider
│ │ └── router.py # Provider routing logic
│ ├── models/
│ │ ├── requests.py # Request Pydantic models
│ │ └── responses.py # Response Pydantic models
│ ├── streaming/
│ │ └── sse.py # SSE response handling
│ └── utils/
│ ├── cache.py # Redis caching
│ └── metrics.py # Prometheus metrics
├── tests/
│ ├── test_api.py # API endpoint tests
│ ├── test_providers.py # Provider tests
│ └── test_streaming.py # Streaming tests
├── Dockerfile
├── docker-compose.yml
├── docker-compose.dev.yml
├── requirements.txt
├── pyproject.toml
└── .env.example
Testing
# Run tests
pytest
# Run with coverage
pytest --cov=src
# Run specific test file
pytest tests/test_providers.py -v
Integration Example
TypeScript/Node.js Client
// Using fetch
const response = await fetch('http://localhost:3025/v1/chat/completions', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model: 'ollama/gemma3:4b',
messages: [{ role: 'user', content: 'Hello!' }],
stream: false,
}),
});
const data = await response.json();
console.log(data.choices[0].message.content);
Streaming with EventSource
const response = await fetch('http://localhost:3025/v1/chat/completions', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model: 'ollama/gemma3:4b',
messages: [{ role: 'user', content: 'Hello!' }],
stream: true,
}),
});
const reader = response.body?.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader!.read();
if (done) break;
const chunk = decoder.decode(value);
const lines = chunk.split('\n').filter(line => line.startsWith('data: '));
for (const line of lines) {
const data = line.slice(6);
if (data === '[DONE]') break;
const parsed = JSON.parse(data);
const content = parsed.choices[0]?.delta?.content;
if (content) process.stdout.write(content);
}
}
Related Services
| Service | Port | Description |
|---|---|---|
| mana-tts | 3022 | Text-to-speech service |
| mana-stt | 3023 | Speech-to-text service |
| mana-search | 3021 | Web search & extraction |
| matrix-ollama-bot | - | Matrix bot (consumer) |
| telegram-ollama-bot | - | Telegram bot (consumer) |