managarten

mirror of https://github.com/Memo-2023/mana-monorepo.git synced 2026-05-14 20:21:09 +02:00

Author	SHA1	Message	Date
Till JS	164d5dab8b	fix(mana-llm): copy aliases.yaml into Docker image main.py's lifespan handler loads `Path(__file__).parent.parent / 'aliases.yaml'` (= /app/aliases.yaml) on startup. The Dockerfile only copied `src/`, so prod containers always crashlooped on first start with `AliasConfigError: alias config not found at /app/aliases.yaml` — which is why mana-llm has been silently absent from prod. Surfaced today after a manual `gh workflow run cd-macmini.yml -f service= mana-llm` actually attempted to start the container instead of relying on a long-stale image. Tested locally: container now starts cleanly, /health returns 200, and `/v1/aliases` lists the configured chains. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 15:47:48 +02:00
Till JS	8a49e3ffd5	feat(mana-llm): M4 — observability, debug endpoints, SIGHUP reload - `X-Mana-LLM-Resolved: <provider>/<model>` header on non-streaming responses. Streaming clients read the same info from each chunk's `model` field (SSE headers go out before the chain is walked). - Three new Prometheus metrics: `mana_llm_alias_resolved_total{alias, target}` (which concrete model an alias resolved to per request), `mana_llm_fallback_total{from_model, to_model, reason}` (each fallback transition), `mana_llm_provider_healthy{provider}` (gauge, mirrors the circuit-breaker). - New debug endpoints: `GET /v1/aliases` (registry inspection — chain + description per alias, useful for confirming SIGHUP reloads), `GET /v1/health` (full per-provider liveness snapshot — failure counter, last error, unhealthy-until backoff). - `kill -HUP <pid>` reloads `aliases.yaml`. Parse errors leave the previous good state in memory and log the rejection. - `ProviderHealthCache.add_listener()` for cache→metrics decoupling: the gauge is updated via a transition-only listener wired in main.py rather than the cache importing prometheus_client itself. - Request-side metrics now use the requested model string, success-side uses the resolved one. So `mana_llm_llm_requests_total{provider="ollama", model="gemma3:12b"}` reflects actual upstream load even when callers used `mana/long-form` aliases. 16 new observability tests (test_m4_observability.py): listener fire-on-transition semantics, exception-isolation, multi-listener, counter increments, gauge writes, end-to-end alias→metric flow, v1/aliases + v1/health endpoint shape, response.model carries the resolved target after fallback. Total suite: 115/115 in 1.6s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 20:52:28 +02:00
Till JS	3046da3b19	feat(mana-llm): M3 — health-aware router with alias + chain fallback Replaces the old Ollama→Google special-case auto-fallback with the unified pipeline: caller passes either a direct provider/model or an alias from the `mana/` namespace; the router resolves to a chain and walks it skipping unhealthy providers (per ProviderHealthCache from M2), trying each entry, marking provider unhealthy on retryable errors and falling through to the next. Retryable: ConnectError, ReadTimeout, RemoteProtocolError, 5xx, ProviderRateLimitError. Propagated (don't fall back, don't poison the cache): ProviderCapabilityError, ProviderAuthError, ProviderBlockedError, 4xx, unknown exception types. The cache stays "what the network told us about this provider's liveness" — caller errors don't muddy that signal. Streaming: pre-first-byte fallback only. Once a chunk has been yielded the provider is committed; mid-stream errors propagate as-is so we don't splice two voices into one output. `NoHealthyProviderError` (HTTP 503) carries a structured attempt log — each chain entry shows up as `(model, reason)` so the cause of a 503 is visible in the response and metrics, not only in service logs. main.py wires the lifespan: aliases.yaml is loaded, ProviderHealthCache created, ProviderRouter takes both as constructor deps, HealthProbe spawned with cheap HTTP probes per configured provider (Ollama /api/tags, OpenAI-compat /v1/models with Bearer header). Google is skipped — google-genai SDK has no obvious cheap probe; the call-site fallback handles real errors. 22 new router tests (test_router_fallback.py): chain walking, capability & auth propagation, 5xx vs 4xx differentiation, rate-limit retry, all-fail → NoHealthyProviderError, direct provider strings bypass aliases, streaming pre-first-byte fallback, mid-stream-failure does NOT fall back, empty stream commits without retry, cache feedback on success/failure/non-retryable. Existing test_providers.py updated for the new constructor signature; all 99 service tests green via the dev container (Python 3.12). Legacy purged: `_ollama_concurrent`, `_ollama_health_cache`, `_can_fallback_to_google`, `_should_use_ollama`, `_fallback_to_google`, `_get_ollama_health_cached` all gone. The `auto_fallback_enabled` / `ollama_max_concurrent` settings remain in config.py for now (M5 will remove them along with the per-feature env-var overrides). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 20:44:16 +02:00
Till JS	59557e62d7	feat(mana-llm): M2 — ProviderHealthCache + background probe loop Per-provider liveness with circuit-breaker semantics. The router (M3) will read `is_healthy()` to skip dead providers in a chain; the probe loop and the call-site fallback handler write state via `mark_healthy` / `mark_unhealthy`. State machine: 1st failure stays healthy (transient blips happen); 2nd consecutive failure trips the breaker and sets a 60s backoff window during which `is_healthy → False`. After the window the provider is half-open again — next call exercises it, success resets, failure re-arms. HealthProbe is the background asyncio.Task that pings every registered provider every 30s with a 3s timeout. Probes run concurrently per tick and one bad probe can't sink the loop. Probe functions are injected (`{name: async-fn}`) so this module stays decoupled from the provider classes — the wiring lives in main.py where we already know which providers are configured. 32 new tests (FakeClock for deterministic backoff timing, slow-probe helpers for parallelism + timeout, lifecycle tests for start/stop idempotency and tick-after-error survival). 64/64 alias+health tests green. Not yet wired into the request path — that's M3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 20:29:57 +02:00
Till JS	dff8629e1d	feat(mana-llm): M1 — AliasRegistry + aliases.yaml SSOT First milestone of the LLM-fallback plan (docs/plans/llm-fallback-aliases.md). Introduces the `mana/<class>` namespace; the registry parses + validates aliases.yaml at startup and reloads on demand. Schema-rejects empty chains, missing provider prefixes, alias names outside the reserved namespace, default→unknown references, etc. Reload semantics: parse error keeps the previous good state in memory so a typo + SIGHUP doesn't take the service down. 5 aliases ship with the initial config: fast-text, long-form, structured, reasoning, vision. Each chain ends with a cloud provider so the system keeps working when the GPU server is offline. 32 unit tests covering happy path, schema validation, namespace check, reload safety, and a guard that the shipped aliases.yaml itself parses. M2 (health-cache + probe-loop) and M3 (router fallback execution) build on this; aliases are not yet wired into the request path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 20:23:51 +02:00
Till JS	da373491b8	chore(mana-llm): thread GOOGLE_API_KEY + default model into local compose Matches the macmini compose — Google Gemini was already wired in the provider adapter (commit 2 of the function-calling migration) but the local dev stack's compose never passed the env through, so the container booted without the provider and every tool-calling request fell back to Ollama (unreachable in local dev, LAN-only GPU box). With this in place the local mana-llm healthcheck reports both `google` and `openrouter` as healthy and the webapp planner hits Gemini Flash for real. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:42:21 +02:00
Till JS	e757470cb0	feat(mana-llm): add OpenAI-style tools + tool_calls passthrough Extends the chat-completions surface so callers can ask any provider to call named functions and get structured tool_calls back. Wired through all three provider adapters so the planner and companion can switch off the fragile JSON-parsing pathway. - Request: tools[], tool_choice, assistant tool_calls, tool-role messages with tool_call_id. - Response: MessageResponse.tool_calls, Choice.finish_reason adds "tool_calls", DeltaContent streams tool_calls. - Google provider: Tool(function_declarations=...) build, result normalised (args dict → JSON string), function_response parts on a user turn for tool-role messages. - OpenAI-compat: 1:1 passthrough of the OpenAI spec. - Ollama: /api/chat passthrough; model-level capability check via a TOOL_CAPABLE_OLLAMA_PATTERNS whitelist (llama3.1+, qwen2.5+, mistral, command-r, …) — unsupported models rejected rather than silently falling back to prose. - Router: model_supports_tools() check upfront for both streaming and non-streaming paths; ProviderCapabilityError bubbles as 400. No silent downgrade. Missing tool support = explicit error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 15:22:48 +02:00
Till JS	4b8fede7fc	fix(mana-llm): surface Gemini finish_reason errors instead of returning "" The google provider called response.text after a chat completion and passed the resulting string downstream unchanged. When Gemini's content filter, recitation guard, or max_tokens ceiling fired, response.text quietly returned "" — which the planner then reported as "no JSON block found", masking the real cause. Empirically this failed in 45 ms on a simple Quiz mission. Introduces providers/errors.py with a small ProviderError hierarchy (Blocked / Truncated / Auth / RateLimit / Capability). google.py now inspects response.candidates[0].finish_reason and raises the matching structured error; the non-streaming path maps it to 422/502/429 via a new except-branch in main.py, and the streaming path surfaces the kind as the SSE error type. Capability is wired but not yet used — it lands with the tool-schema passthrough in the next commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 15:15:37 +02:00
Till JS	3be4612f04	fix(mana-llm): google-genai v1.73 keyword-only Part.from_text() google-genai >=1.70 changed Part.from_text() from positional to keyword-only argument. The production container installed v1.73.1 and crashed on startup with "Part.from_text() takes 1 positional argument but 2 were given". Fix: Part.from_text(msg.content) → Part.from_text(text=msg.content) Tested live: curl https://llm.mana.how/v1/chat/completions with model=google/gemini-2.5-flash returns correct response. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 12:47:23 +02:00
Till JS	659a7d9774	fix(mana-llm): add google-genai to requirements.txt for Docker builds google-genai was in pyproject.toml but missing from requirements.txt. The Dockerfile uses pip install -r requirements.txt, so the Google provider never loaded in production. Now that the key is set and the cloud tier upgraded to gemini-2.5-flash, the import fires on startup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 12:40:30 +02:00
Till JS	8a0bf93699	chore(cloud-tier): upgrade default model gemini-2.0-flash → gemini-2.5-flash gemini-2.0-flash is deprecated June 1 2026. gemini-2.5-flash has been stable since Q1 2026 with similar pricing ($0.15/$0.60 per 1M tokens vs $0.10/$0.40 — pricing table already had the entry). Three files touched: - packages/shared-llm/src/backends/cloud.ts — client default - services/mana-llm/src/config.py — server default - services/mana-llm/src/providers/google.py — Ollama→Gemini fallback map + constructor default + deduplicated model list Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 12:32:03 +02:00
Till JS	5520f1385e	fix(mana-llm): add response_format to ChatCompletionRequest model The first iteration of the Ollama response_format passthrough crashed with 'ChatCompletionRequest object has no attribute response_format' because the Pydantic request model didn't declare the field at all — incoming response_format from OpenAI-compatible clients was being silently dropped at the parsing layer before the provider could see it. Fix: declare a typed ResponseFormat sub-model with the two OpenAI shapes ('json_object' and 'json_schema'), add it as an optional field on ChatCompletionRequest, and let the Ollama provider read it directly without defensive getattr fallbacks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 18:50:54 +02:00
Till JS	3ef095aaff	fix(mana-llm/ollama): pass response_format to Ollama + strip markdown fences The Ollama provider was completely ignoring `response_format` from the incoming OpenAI-compatible request. Two consequences: 1. Clients that asked for `{"type":"json_object"}` or `{"type":"json_schema",...}` got back JSON wrapped in ```json ... ``` markdown fences, because Ollama defaults to conversational output. 2. Strict downstream parsers (Vercel AI SDK `generateObject`, manual `JSON.parse`) failed to decode the response and threw, even though the underlying JSON was valid inside the fences. Fix: when response_format is set, translate it to Ollama's native `format` field: - `{"type":"json_object"}` → `format: "json"` - `{"type":"json_schema","json_schema":{"schema":{...}}}` → `format: <the schema dict>` (Ollama 0.5+ supports full JSON schemas in the format field) Defensive belt-and-suspenders: a small `_strip_json_fences` helper runs after the Ollama response is decoded and removes any leftover ```json ... ``` wrapping. Some older vision models still wrap output in fences even when `format` is set; this catches them. Streaming path is unchanged because the nutriphi/planta refactor uses non-streaming `generateObject`. Streaming structured output with Ollama deserves its own pass when someone actually needs it. Discovered during the AI SDK + Zod refactor smoke test — neither the old nor the new vision routes ever returned validated JSON locally because of this bug. Production uses Google Gemini directly via fallback so the issue was masked there. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 18:12:01 +02:00
Till JS	bfeeef7819	chore(matrix): final scrub of stale matrix references A grep audit after the previous matrix removal commits found a handful of stragglers in non-runtime files that the earlier sweeps missed: - services/mana-llm/CLAUDE.md: removed matrix-ollama-bot from the consumer-apps diagram and from the related-services table - services/mana-video-gen/CLAUDE.md: removed "Matrix Bots" integration bullet - packages/notify-client/README.md: removed sendMatrix() doc entry (the method itself was already gone in the prior cleanup) - docker/grafana/dashboards/logs-explorer.json: dropped the "Matrix Stack" log row that queried tier="matrix" (would show no data forever) - docker/grafana/dashboards/master-overview.json: dropped the "Matrix Bots" stat panel that counted up{job=~"matrix-.*-bot"} - apps/mana/apps/landing/src/data/ecosystem-health.json: regenerated via scripts/ecosystem-audit.mjs to drop matrix from the app list, icon counts, file analytics, top offenders and authGuard missing list - .gitignore: removed services/matrix-stt-bot/data/ pattern (the service itself was deleted long ago) Production-side stragglers also addressed (not in this commit): - DROP USER synapse on prod Postgres (the parallel cleanup commit `2514831a3` dropped DATABASE matrix + DATABASE synapse but left the role behind) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 16:47:54 +02:00
Till JS	b8e18b7f82	chore(ai-services): adopt Windows GPU as source of truth for llm/stt/tts The Windows GPU server has been the actual production home for these services for some time, and the running code there has drifted ahead of the repo. This sync pulls the live versions back into the repo so the Windows box is no longer the only place those changes exist. Pulled from C:\mana\services\* on mana-server-gpu (192.168.178.11): mana-llm: - src/main.py, src/config.py — small fixes (auth wiring, config tweaks) - src/api_auth.py — NEW (cross-service GPU_API_KEY validator) - service.pyw — Windows runner used by the ManaLLM scheduled task (sets up logging redirect, loads .env, calls uvicorn) mana-stt: - app/main.py — substantial cleanup (684→392 lines), drops the whisperx-as-separate-backend branching now that whisper_service.py rolls whisperx in directly - app/whisper_service.py — full CUDA + whisperx rewrite (158→358 lines) - app/auth.py + external_auth.py — significantly expanded auth - app/vram_manager.py — NEW (shared VRAM accounting helper) - service.pyw — Windows runner with CUDA pre-init, FFmpeg PATH injection, .env loading - removed: app/whisper_service_cuda.py (folded into whisper_service.py) - removed: app/whisperx_service.py (folded into whisper_service.py) mana-tts: - app/auth.py, external_auth.py — same auth expansion as stt - app/f5_service.py, kokoro_service.py — Windows tweaks - app/vram_manager.py — NEW (same shared helper as stt) - service.pyw — Windows runner mana-video-gen: - service.pyw — Windows runner (no other changes; the .py code on the GPU box is byte-identical to what's already in the repo) The service.pyw files contain absolute Windows paths (C:\mana\services\<svc>) and a hardcoded FFmpeg PATH for the tills user profile. Kept as-is intentionally — they exist to be deployed to that one machine and any abstraction layer would just hide what's actually happening. Anyone redeploying to a different layout will need to edit the path strings, which is a known and obvious change. Mac-Mini infrastructure for these services (launchd plists, install scripts, scripts/mac-mini/setup-{stt,tts}.sh, the Mac-flux2c image-gen implementation) is still on disk and will be removed in a follow-up commit, along with replacing mana-image-gen with the Windows diffusers+CUDA implementation. This commit is just the live-code sync.	2026-04-08 12:46:03 +02:00
Till JS	45063b88be	feat(mana-llm): add Google Gemini fallback provider with auto-routing Add Google Gemini as a fallback provider that activates automatically when Ollama is overloaded or unavailable, ensuring LLM requests always succeed even under load. New provider (src/providers/google.py): - Full LLMProvider implementation using google-genai SDK - Chat completions (streaming + non-streaming) - Vision/multimodal support (base64 images) - Embeddings via text-embedding-004 - Model mapping: Ollama models → Gemini equivalents (gemma3:4b → gemini-2.0-flash, llava:7b → gemini-2.0-flash, etc.) Auto-fallback routing (src/providers/router.py): - Concurrent request tracking for Ollama (OLLAMA_MAX_CONCURRENT=3) - When Ollama concurrent > max: route to Google automatically - When Ollama fails: retry on Google with model mapping - Health check caching (5s TTL) to avoid hammering Ollama - Non-Ollama providers (openrouter, groq, together) are never fallback-routed - Fallback info included in /health endpoint response New config (src/config.py): - GOOGLE_API_KEY: enables Google provider - GOOGLE_DEFAULT_MODEL: default gemini-2.0-flash - AUTO_FALLBACK_ENABLED: toggle fallback (default: true) - OLLAMA_MAX_CONCURRENT: concurrent request threshold (default: 3) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-23 22:44:09 +01:00
Till-JS	aba79f5c16	fix(mana-llm): fix SSE double data prefix causing message parsing issues EventSourceResponse from sse-starlette adds its own 'data:' prefix, so we should yield dicts with a 'data' key instead of pre-formatted SSE strings. This was causing 'data: data:' double prefixes and backticks appearing in chat messages. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-02 15:29:11 +01:00
Till-JS	d605366460	✨ feat(llm-playground): add model comparison feature - Add modality detection (text/vision/code) to models store - Create comparison store for parallel multi-model streaming - Add ModelModalityFilter and ModelComparisonSelector components - Add ComparisonResponseCard with metrics (duration, tokens, t/s) - Add ComparisonMessageBubble for side-by-side response view - Integrate comparison mode into ChatInput, MessageList, Sidebar - Add dev:full script to start mana-llm + playground together - Add start.sh script for mana-llm Python service	2026-01-31 23:30:16 +01:00
Till-JS	fdba0e3425	feat(llm-playground): add production deployment with auth - Add Dockerfile for multi-stage Docker build - Add mana-core-auth integration with login/register pages - Add auth store using Svelte 5 runes - Add protected route layout with auth guard - Add health endpoint for container health checks - Add runtime URL injection via hooks.server.ts - Add logout button to header - Update docker-compose.macmini.yml with llm-playground service - Update cloudflared-config.yml with playground.mana.how route - Update mana-llm CORS config for playground domain - Update generate-env.mjs with auth URL variable Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-30 18:15:02 +01:00
Till-JS	3edbd0cb26	chore: update dependencies and mana-llm improvements - Update pnpm-lock.yaml with matrix bot dependencies - Add environment variables to generate-env.mjs - Improve mana-llm config and ollama provider Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-30 17:50:58 +01:00
Till-JS	1495dbe476	✨ feat(mana-llm): add central LLM abstraction service Python/FastAPI service providing unified OpenAI-compatible API for Ollama and cloud LLM providers (OpenRouter, Groq, Together). Features: - Chat completions with streaming (SSE) - Vision/multimodal support - Embeddings generation - Multi-provider routing (provider/model format) - Prometheus metrics - Optional Redis caching	2026-01-29 22:01:00 +01:00

21 commits