managarten

mirror of https://github.com/Memo-2023/mana-monorepo.git synced 2026-05-14 23:21:08 +02:00

Author	SHA1	Message	Date
Till JS	8a49e3ffd5	feat(mana-llm): M4 — observability, debug endpoints, SIGHUP reload - `X-Mana-LLM-Resolved: <provider>/<model>` header on non-streaming responses. Streaming clients read the same info from each chunk's `model` field (SSE headers go out before the chain is walked). - Three new Prometheus metrics: `mana_llm_alias_resolved_total{alias, target}` (which concrete model an alias resolved to per request), `mana_llm_fallback_total{from_model, to_model, reason}` (each fallback transition), `mana_llm_provider_healthy{provider}` (gauge, mirrors the circuit-breaker). - New debug endpoints: `GET /v1/aliases` (registry inspection — chain + description per alias, useful for confirming SIGHUP reloads), `GET /v1/health` (full per-provider liveness snapshot — failure counter, last error, unhealthy-until backoff). - `kill -HUP <pid>` reloads `aliases.yaml`. Parse errors leave the previous good state in memory and log the rejection. - `ProviderHealthCache.add_listener()` for cache→metrics decoupling: the gauge is updated via a transition-only listener wired in main.py rather than the cache importing prometheus_client itself. - Request-side metrics now use the requested model string, success-side uses the resolved one. So `mana_llm_llm_requests_total{provider="ollama", model="gemma3:12b"}` reflects actual upstream load even when callers used `mana/long-form` aliases. 16 new observability tests (test_m4_observability.py): listener fire-on-transition semantics, exception-isolation, multi-listener, counter increments, gauge writes, end-to-end alias→metric flow, v1/aliases + v1/health endpoint shape, response.model carries the resolved target after fallback. Total suite: 115/115 in 1.6s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 20:52:28 +02:00
Till JS	3046da3b19	feat(mana-llm): M3 — health-aware router with alias + chain fallback Replaces the old Ollama→Google special-case auto-fallback with the unified pipeline: caller passes either a direct provider/model or an alias from the `mana/` namespace; the router resolves to a chain and walks it skipping unhealthy providers (per ProviderHealthCache from M2), trying each entry, marking provider unhealthy on retryable errors and falling through to the next. Retryable: ConnectError, ReadTimeout, RemoteProtocolError, 5xx, ProviderRateLimitError. Propagated (don't fall back, don't poison the cache): ProviderCapabilityError, ProviderAuthError, ProviderBlockedError, 4xx, unknown exception types. The cache stays "what the network told us about this provider's liveness" — caller errors don't muddy that signal. Streaming: pre-first-byte fallback only. Once a chunk has been yielded the provider is committed; mid-stream errors propagate as-is so we don't splice two voices into one output. `NoHealthyProviderError` (HTTP 503) carries a structured attempt log — each chain entry shows up as `(model, reason)` so the cause of a 503 is visible in the response and metrics, not only in service logs. main.py wires the lifespan: aliases.yaml is loaded, ProviderHealthCache created, ProviderRouter takes both as constructor deps, HealthProbe spawned with cheap HTTP probes per configured provider (Ollama /api/tags, OpenAI-compat /v1/models with Bearer header). Google is skipped — google-genai SDK has no obvious cheap probe; the call-site fallback handles real errors. 22 new router tests (test_router_fallback.py): chain walking, capability & auth propagation, 5xx vs 4xx differentiation, rate-limit retry, all-fail → NoHealthyProviderError, direct provider strings bypass aliases, streaming pre-first-byte fallback, mid-stream-failure does NOT fall back, empty stream commits without retry, cache feedback on success/failure/non-retryable. Existing test_providers.py updated for the new constructor signature; all 99 service tests green via the dev container (Python 3.12). Legacy purged: `_ollama_concurrent`, `_ollama_health_cache`, `_can_fallback_to_google`, `_should_use_ollama`, `_fallback_to_google`, `_get_ollama_health_cached` all gone. The `auto_fallback_enabled` / `ollama_max_concurrent` settings remain in config.py for now (M5 will remove them along with the per-feature env-var overrides). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 20:44:16 +02:00
Till JS	59557e62d7	feat(mana-llm): M2 — ProviderHealthCache + background probe loop Per-provider liveness with circuit-breaker semantics. The router (M3) will read `is_healthy()` to skip dead providers in a chain; the probe loop and the call-site fallback handler write state via `mark_healthy` / `mark_unhealthy`. State machine: 1st failure stays healthy (transient blips happen); 2nd consecutive failure trips the breaker and sets a 60s backoff window during which `is_healthy → False`. After the window the provider is half-open again — next call exercises it, success resets, failure re-arms. HealthProbe is the background asyncio.Task that pings every registered provider every 30s with a 3s timeout. Probes run concurrently per tick and one bad probe can't sink the loop. Probe functions are injected (`{name: async-fn}`) so this module stays decoupled from the provider classes — the wiring lives in main.py where we already know which providers are configured. 32 new tests (FakeClock for deterministic backoff timing, slow-probe helpers for parallelism + timeout, lifecycle tests for start/stop idempotency and tick-after-error survival). 64/64 alias+health tests green. Not yet wired into the request path — that's M3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 20:29:57 +02:00
Till JS	dff8629e1d	feat(mana-llm): M1 — AliasRegistry + aliases.yaml SSOT First milestone of the LLM-fallback plan (docs/plans/llm-fallback-aliases.md). Introduces the `mana/<class>` namespace; the registry parses + validates aliases.yaml at startup and reloads on demand. Schema-rejects empty chains, missing provider prefixes, alias names outside the reserved namespace, default→unknown references, etc. Reload semantics: parse error keeps the previous good state in memory so a typo + SIGHUP doesn't take the service down. 5 aliases ship with the initial config: fast-text, long-form, structured, reasoning, vision. Each chain ends with a cloud provider so the system keeps working when the GPU server is offline. 32 unit tests covering happy path, schema validation, namespace check, reload safety, and a guard that the shipped aliases.yaml itself parses. M2 (health-cache + probe-loop) and M3 (router fallback execution) build on this; aliases are not yet wired into the request path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 20:23:51 +02:00
Till-JS	1495dbe476	✨ feat(mana-llm): add central LLM abstraction service Python/FastAPI service providing unified OpenAI-compatible API for Ollama and cloud LLM providers (OpenRouter, Groq, Together). Features: - Chat completions with streaming (SSE) - Vision/multimodal support - Embeddings generation - Multi-provider routing (provider/model format) - Prometheus metrics - Optional Redis caching	2026-01-29 22:01:00 +01:00

5 commits