google-genai >=1.70 changed Part.from_text() from positional to
keyword-only argument. The production container installed v1.73.1
and crashed on startup with "Part.from_text() takes 1 positional
argument but 2 were given".
Fix: Part.from_text(msg.content) → Part.from_text(text=msg.content)
Tested live: curl https://llm.mana.how/v1/chat/completions with
model=google/gemini-2.5-flash returns correct response.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
gemini-2.0-flash is deprecated June 1 2026. gemini-2.5-flash has been
stable since Q1 2026 with similar pricing ($0.15/$0.60 per 1M tokens
vs $0.10/$0.40 — pricing table already had the entry).
Three files touched:
- packages/shared-llm/src/backends/cloud.ts — client default
- services/mana-llm/src/config.py — server default
- services/mana-llm/src/providers/google.py — Ollama→Gemini fallback
map + constructor default + deduplicated model list
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The first iteration of the Ollama response_format passthrough crashed
with 'ChatCompletionRequest object has no attribute response_format'
because the Pydantic request model didn't declare the field at all —
incoming response_format from OpenAI-compatible clients was being
silently dropped at the parsing layer before the provider could see it.
Fix: declare a typed ResponseFormat sub-model with the two OpenAI shapes
('json_object' and 'json_schema'), add it as an optional field on
ChatCompletionRequest, and let the Ollama provider read it directly
without defensive getattr fallbacks.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Ollama provider was completely ignoring `response_format` from the
incoming OpenAI-compatible request. Two consequences:
1. Clients that asked for `{"type":"json_object"}` or
`{"type":"json_schema",...}` got back JSON wrapped in
```json ... ``` markdown fences, because Ollama defaults to
conversational output.
2. Strict downstream parsers (Vercel AI SDK `generateObject`,
manual `JSON.parse`) failed to decode the response and threw,
even though the underlying JSON was valid inside the fences.
Fix: when response_format is set, translate it to Ollama's native
`format` field:
- `{"type":"json_object"}` → `format: "json"`
- `{"type":"json_schema","json_schema":{"schema":{...}}}`
→ `format: <the schema dict>` (Ollama 0.5+ supports full JSON
schemas in the format field)
Defensive belt-and-suspenders: a small `_strip_json_fences` helper
runs after the Ollama response is decoded and removes any leftover
```json ... ``` wrapping. Some older vision models still wrap
output in fences even when `format` is set; this catches them.
Streaming path is unchanged because the nutriphi/planta refactor uses
non-streaming `generateObject`. Streaming structured output with
Ollama deserves its own pass when someone actually needs it.
Discovered during the AI SDK + Zod refactor smoke test — neither the
old nor the new vision routes ever returned validated JSON locally
because of this bug. Production uses Google Gemini directly via
fallback so the issue was masked there.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add Google Gemini as a fallback provider that activates automatically
when Ollama is overloaded or unavailable, ensuring LLM requests always
succeed even under load.
New provider (src/providers/google.py):
- Full LLMProvider implementation using google-genai SDK
- Chat completions (streaming + non-streaming)
- Vision/multimodal support (base64 images)
- Embeddings via text-embedding-004
- Model mapping: Ollama models → Gemini equivalents
(gemma3:4b → gemini-2.0-flash, llava:7b → gemini-2.0-flash, etc.)
Auto-fallback routing (src/providers/router.py):
- Concurrent request tracking for Ollama (OLLAMA_MAX_CONCURRENT=3)
- When Ollama concurrent > max: route to Google automatically
- When Ollama fails: retry on Google with model mapping
- Health check caching (5s TTL) to avoid hammering Ollama
- Non-Ollama providers (openrouter, groq, together) are never fallback-routed
- Fallback info included in /health endpoint response
New config (src/config.py):
- GOOGLE_API_KEY: enables Google provider
- GOOGLE_DEFAULT_MODEL: default gemini-2.0-flash
- AUTO_FALLBACK_ENABLED: toggle fallback (default: true)
- OLLAMA_MAX_CONCURRENT: concurrent request threshold (default: 3)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Update pnpm-lock.yaml with matrix bot dependencies
- Add environment variables to generate-env.mjs
- Improve mana-llm config and ollama provider
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>