Commit graph

7 commits

Author SHA1 Message Date
Till JS
3be4612f04 fix(mana-llm): google-genai v1.73 keyword-only Part.from_text()
google-genai >=1.70 changed Part.from_text() from positional to
keyword-only argument. The production container installed v1.73.1
and crashed on startup with "Part.from_text() takes 1 positional
argument but 2 were given".

Fix: Part.from_text(msg.content) → Part.from_text(text=msg.content)

Tested live: curl https://llm.mana.how/v1/chat/completions with
model=google/gemini-2.5-flash returns correct response.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 12:47:23 +02:00
Till JS
8a0bf93699 chore(cloud-tier): upgrade default model gemini-2.0-flash → gemini-2.5-flash
gemini-2.0-flash is deprecated June 1 2026. gemini-2.5-flash has been
stable since Q1 2026 with similar pricing ($0.15/$0.60 per 1M tokens
vs $0.10/$0.40 — pricing table already had the entry).

Three files touched:
- packages/shared-llm/src/backends/cloud.ts — client default
- services/mana-llm/src/config.py — server default
- services/mana-llm/src/providers/google.py — Ollama→Gemini fallback
  map + constructor default + deduplicated model list

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 12:32:03 +02:00
Till JS
5520f1385e fix(mana-llm): add response_format to ChatCompletionRequest model
The first iteration of the Ollama response_format passthrough crashed
with 'ChatCompletionRequest object has no attribute response_format'
because the Pydantic request model didn't declare the field at all —
incoming response_format from OpenAI-compatible clients was being
silently dropped at the parsing layer before the provider could see it.

Fix: declare a typed ResponseFormat sub-model with the two OpenAI shapes
('json_object' and 'json_schema'), add it as an optional field on
ChatCompletionRequest, and let the Ollama provider read it directly
without defensive getattr fallbacks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 18:50:54 +02:00
Till JS
3ef095aaff fix(mana-llm/ollama): pass response_format to Ollama + strip markdown fences
The Ollama provider was completely ignoring `response_format` from the
incoming OpenAI-compatible request. Two consequences:

  1. Clients that asked for `{"type":"json_object"}` or
     `{"type":"json_schema",...}` got back JSON wrapped in
     ```json ... ``` markdown fences, because Ollama defaults to
     conversational output.
  2. Strict downstream parsers (Vercel AI SDK `generateObject`,
     manual `JSON.parse`) failed to decode the response and threw,
     even though the underlying JSON was valid inside the fences.

Fix: when response_format is set, translate it to Ollama's native
`format` field:

  - `{"type":"json_object"}` → `format: "json"`
  - `{"type":"json_schema","json_schema":{"schema":{...}}}`
    → `format: <the schema dict>` (Ollama 0.5+ supports full JSON
    schemas in the format field)

Defensive belt-and-suspenders: a small `_strip_json_fences` helper
runs after the Ollama response is decoded and removes any leftover
```json ... ``` wrapping. Some older vision models still wrap
output in fences even when `format` is set; this catches them.

Streaming path is unchanged because the nutriphi/planta refactor uses
non-streaming `generateObject`. Streaming structured output with
Ollama deserves its own pass when someone actually needs it.

Discovered during the AI SDK + Zod refactor smoke test — neither the
old nor the new vision routes ever returned validated JSON locally
because of this bug. Production uses Google Gemini directly via
fallback so the issue was masked there.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 18:12:01 +02:00
Till JS
45063b88be feat(mana-llm): add Google Gemini fallback provider with auto-routing
Add Google Gemini as a fallback provider that activates automatically
when Ollama is overloaded or unavailable, ensuring LLM requests always
succeed even under load.

New provider (src/providers/google.py):
- Full LLMProvider implementation using google-genai SDK
- Chat completions (streaming + non-streaming)
- Vision/multimodal support (base64 images)
- Embeddings via text-embedding-004
- Model mapping: Ollama models → Gemini equivalents
  (gemma3:4b → gemini-2.0-flash, llava:7b → gemini-2.0-flash, etc.)

Auto-fallback routing (src/providers/router.py):
- Concurrent request tracking for Ollama (OLLAMA_MAX_CONCURRENT=3)
- When Ollama concurrent > max: route to Google automatically
- When Ollama fails: retry on Google with model mapping
- Health check caching (5s TTL) to avoid hammering Ollama
- Non-Ollama providers (openrouter, groq, together) are never fallback-routed
- Fallback info included in /health endpoint response

New config (src/config.py):
- GOOGLE_API_KEY: enables Google provider
- GOOGLE_DEFAULT_MODEL: default gemini-2.0-flash
- AUTO_FALLBACK_ENABLED: toggle fallback (default: true)
- OLLAMA_MAX_CONCURRENT: concurrent request threshold (default: 3)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 22:44:09 +01:00
Till-JS
3edbd0cb26 chore: update dependencies and mana-llm improvements
- Update pnpm-lock.yaml with matrix bot dependencies
- Add environment variables to generate-env.mjs
- Improve mana-llm config and ollama provider

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-30 17:50:58 +01:00
Till-JS
1495dbe476 feat(mana-llm): add central LLM abstraction service
Python/FastAPI service providing unified OpenAI-compatible API for
Ollama and cloud LLM providers (OpenRouter, Groq, Together).

Features:
- Chat completions with streaming (SSE)
- Vision/multimodal support
- Embeddings generation
- Multi-provider routing (provider/model format)
- Prometheus metrics
- Optional Redis caching
2026-01-29 22:01:00 +01:00