mirror of https://github.com/Memo-2023/mana-monorepo.git synced 2026-05-14 20:21:09 +02:00

Till-JS 1495dbe476 ✨ feat(mana-llm): add central LLM abstraction service

Python/FastAPI service providing unified OpenAI-compatible API for
Ollama and cloud LLM providers (OpenRouter, Groq, Together).

Features:
- Chat completions with streaming (SSE)
- Vision/multimodal support
- Embeddings generation
- Multi-provider routing (provider/model format)
- Prometheus metrics
- Optional Redis caching

2026-01-29 22:01:00 +01:00

9.2 KiB

Raw Blame History

mana-llm

Central LLM abstraction service providing a unified OpenAI-compatible API for Ollama and cloud LLM providers.

Overview

mana-llm acts as a central gateway for all LLM requests in the monorepo, providing:

Unified OpenAI-compatible API
Provider routing (Ollama, OpenRouter, Groq, Together)
Streaming via Server-Sent Events (SSE)
Vision/multimodal support
Embeddings generation
Prometheus metrics

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        Consumer Apps                                 │
│  matrix-ollama-bot │ telegram-ollama-bot │ chat-backend │ etc.     │
└────────────────────────────────┬────────────────────────────────────┘
                                 │ HTTP/SSE
                                 ▼
┌─────────────────────────────────────────────────────────────────────┐
│                     mana-llm (Port 3025)                            │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                 │
│  │   Router    │  │   Cache     │  │   Metrics   │                 │
│  │ (Provider)  │  │  (Redis)    │  │ (Prometheus)│                 │
│  └──────┬──────┘  └─────────────┘  └─────────────┘                 │
│         │                                                           │
│  ┌──────┴──────────────────────────────────────────┐               │
│  │              Provider Adapters                   │               │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────────┐  │               │
│  │  │  Ollama  │  │ OpenAI   │  │  OpenRouter  │  │               │
│  │  │  Adapter │  │ Adapter  │  │   Adapter    │  │               │
│  │  └──────────┘  └──────────┘  └──────────────┘  │               │
│  └─────────────────────────────────────────────────┘               │
└─────────────────────────────────────────────────────────────────────┘

Quick Start

Prerequisites

Python 3.11+
Ollama running locally (http://localhost:11434)
Redis (optional, for caching)

Development

cd services/mana-llm

# Create virtual environment
python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows

# Install dependencies
pip install -r requirements.txt

# Copy environment file
cp .env.example .env

# Start Redis (optional)
docker-compose -f docker-compose.dev.yml up -d

# Run service
python -m uvicorn src.main:app --port 3025 --reload

Docker

# Full stack (mana-llm + Redis)
docker-compose up -d

# View logs
docker-compose logs -f mana-llm

API Endpoints

Chat Completions

# Non-streaming
curl -X POST http://localhost:3025/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ollama/gemma3:4b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": false
  }'

# Streaming (SSE)
curl -X POST http://localhost:3025/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ollama/gemma3:4b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Vision/Multimodal

curl -X POST http://localhost:3025/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ollama/llava:7b",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What is in this image?"},
        {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
      ]
    }]
  }'

Models

# List all models
curl http://localhost:3025/v1/models

# Get specific model
curl http://localhost:3025/v1/models/ollama/gemma3:4b

Embeddings

curl -X POST http://localhost:3025/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ollama/nomic-embed-text",
    "input": "Text to embed"
  }'

Health & Metrics

# Health check
curl http://localhost:3025/health

# Prometheus metrics
curl http://localhost:3025/metrics

Provider Routing

Models use the format provider/model:

Model	Provider	Target
`ollama/gemma3:4b`	Ollama	localhost:11434
`ollama/llava:7b`	Ollama	localhost:11434
`openrouter/meta-llama/llama-3.1-8b-instruct`	OpenRouter	api.openrouter.ai
`groq/llama-3.1-8b-instant`	Groq	api.groq.com
`together/meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo`	Together	api.together.xyz

Default: If no provider prefix is given (e.g., gemma3:4b), Ollama is used.

Configuration

Environment variables (see .env.example):

Variable	Default	Description
`PORT`	3025	Service port
`LOG_LEVEL`	info	Logging level
`OLLAMA_URL`	http://localhost:11434	Ollama server URL
`OLLAMA_DEFAULT_MODEL`	gemma3:4b	Default Ollama model
`OLLAMA_TIMEOUT`	120	Ollama request timeout (seconds)
`OPENROUTER_API_KEY`	-	OpenRouter API key
`GROQ_API_KEY`	-	Groq API key
`TOGETHER_API_KEY`	-	Together API key
`REDIS_URL`	-	Redis URL for caching
`CACHE_TTL`	3600	Cache TTL in seconds
`CORS_ORIGINS`	localhost	Allowed CORS origins

Project Structure

services/mana-llm/
├── src/
│   ├── main.py                 # FastAPI app entry point
│   ├── config.py               # Settings via pydantic-settings
│   ├── providers/
│   │   ├── base.py             # Abstract provider interface
│   │   ├── ollama.py           # Ollama provider
│   │   ├── openai_compat.py    # OpenAI-compatible provider
│   │   └── router.py           # Provider routing logic
│   ├── models/
│   │   ├── requests.py         # Request Pydantic models
│   │   └── responses.py        # Response Pydantic models
│   ├── streaming/
│   │   └── sse.py              # SSE response handling
│   └── utils/
│       ├── cache.py            # Redis caching
│       └── metrics.py          # Prometheus metrics
├── tests/
│   ├── test_api.py             # API endpoint tests
│   ├── test_providers.py       # Provider tests
│   └── test_streaming.py       # Streaming tests
├── Dockerfile
├── docker-compose.yml
├── docker-compose.dev.yml
├── requirements.txt
├── pyproject.toml
└── .env.example

Testing

# Run tests
pytest

# Run with coverage
pytest --cov=src

# Run specific test file
pytest tests/test_providers.py -v

Integration Example

TypeScript/Node.js Client

// Using fetch
const response = await fetch('http://localhost:3025/v1/chat/completions', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    model: 'ollama/gemma3:4b',
    messages: [{ role: 'user', content: 'Hello!' }],
    stream: false,
  }),
});

const data = await response.json();
console.log(data.choices[0].message.content);

Streaming with EventSource

const response = await fetch('http://localhost:3025/v1/chat/completions', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    model: 'ollama/gemma3:4b',
    messages: [{ role: 'user', content: 'Hello!' }],
    stream: true,
  }),
});

const reader = response.body?.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader!.read();
  if (done) break;

  const chunk = decoder.decode(value);
  const lines = chunk.split('\n').filter(line => line.startsWith('data: '));

  for (const line of lines) {
    const data = line.slice(6);
    if (data === '[DONE]') break;

    const parsed = JSON.parse(data);
    const content = parsed.choices[0]?.delta?.content;
    if (content) process.stdout.write(content);
  }
}

Service	Port	Description
mana-tts	3022	Text-to-speech service
mana-stt	3023	Speech-to-text service
mana-search	3021	Web search & extraction
matrix-ollama-bot	-	Matrix bot (consumer)
telegram-ollama-bot	-	Telegram bot (consumer)

9.2 KiB Raw Blame History