managarten/services/mana-llm/src
Till JS 59557e62d7 feat(mana-llm): M2 — ProviderHealthCache + background probe loop
Per-provider liveness with circuit-breaker semantics. The router (M3)
will read `is_healthy()` to skip dead providers in a chain; the probe
loop and the call-site fallback handler write state via
`mark_healthy` / `mark_unhealthy`.

State machine: 1st failure stays healthy (transient blips happen);
2nd consecutive failure trips the breaker and sets a 60s backoff
window during which `is_healthy → False`. After the window the
provider is half-open again — next call exercises it, success
resets, failure re-arms.

HealthProbe is the background asyncio.Task that pings every
registered provider every 30s with a 3s timeout. Probes run
concurrently per tick and one bad probe can't sink the loop. Probe
functions are injected (`{name: async-fn}`) so this module stays
decoupled from the provider classes — the wiring lives in main.py
where we already know which providers are configured.

32 new tests (FakeClock for deterministic backoff timing, slow-probe
helpers for parallelism + timeout, lifecycle tests for start/stop
idempotency and tick-after-error survival). 64/64 alias+health tests
green.

Not yet wired into the request path — that's M3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 20:29:57 +02:00
..
models feat(mana-llm): add OpenAI-style tools + tool_calls passthrough 2026-04-20 15:22:48 +02:00
providers feat(mana-llm): add OpenAI-style tools + tool_calls passthrough 2026-04-20 15:22:48 +02:00
streaming fix(mana-llm): surface Gemini finish_reason errors instead of returning "" 2026-04-20 15:15:37 +02:00
utils feat(mana-llm): add central LLM abstraction service 2026-01-29 22:01:00 +01:00
__init__.py feat(mana-llm): add central LLM abstraction service 2026-01-29 22:01:00 +01:00
aliases.py feat(mana-llm): M1 — AliasRegistry + aliases.yaml SSOT 2026-04-26 20:23:51 +02:00
api_auth.py chore(ai-services): adopt Windows GPU as source of truth for llm/stt/tts 2026-04-08 12:46:03 +02:00
config.py chore(cloud-tier): upgrade default model gemini-2.0-flash → gemini-2.5-flash 2026-04-16 12:32:03 +02:00
health.py feat(mana-llm): M2 — ProviderHealthCache + background probe loop 2026-04-26 20:29:57 +02:00
health_probe.py feat(mana-llm): M2 — ProviderHealthCache + background probe loop 2026-04-26 20:29:57 +02:00
main.py fix(mana-llm): surface Gemini finish_reason errors instead of returning "" 2026-04-20 15:15:37 +02:00