Commit graph

21 commits

Author SHA1 Message Date
Till JS
164d5dab8b fix(mana-llm): copy aliases.yaml into Docker image
main.py's lifespan handler loads `Path(__file__).parent.parent /
'aliases.yaml'` (= /app/aliases.yaml) on startup. The Dockerfile only
copied `src/`, so prod containers always crashlooped on first start
with `AliasConfigError: alias config not found at /app/aliases.yaml`
— which is why mana-llm has been silently absent from prod. Surfaced
today after a manual `gh workflow run cd-macmini.yml -f service=
mana-llm` actually attempted to start the container instead of
relying on a long-stale image.

Tested locally: container now starts cleanly, /health returns 200,
and `/v1/aliases` lists the configured chains.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 15:47:48 +02:00
Till JS
8a49e3ffd5 feat(mana-llm): M4 — observability, debug endpoints, SIGHUP reload
- `X-Mana-LLM-Resolved: <provider>/<model>` header on non-streaming
  responses. Streaming clients read the same info from each chunk's
  `model` field (SSE headers go out before the chain is walked).
- Three new Prometheus metrics: `mana_llm_alias_resolved_total{alias,
  target}` (which concrete model an alias resolved to per request),
  `mana_llm_fallback_total{from_model, to_model, reason}` (each
  fallback transition), `mana_llm_provider_healthy{provider}` (gauge,
  mirrors the circuit-breaker).
- New debug endpoints: `GET /v1/aliases` (registry inspection — chain
  + description per alias, useful for confirming SIGHUP reloads),
  `GET /v1/health` (full per-provider liveness snapshot — failure
  counter, last error, unhealthy-until backoff).
- `kill -HUP <pid>` reloads `aliases.yaml`. Parse errors leave the
  previous good state in memory and log the rejection.
- `ProviderHealthCache.add_listener()` for cache→metrics decoupling:
  the gauge is updated via a transition-only listener wired in main.py
  rather than the cache importing prometheus_client itself.
- Request-side metrics now use the requested model string, success-side
  uses the resolved one. So `mana_llm_llm_requests_total{provider="ollama",
  model="gemma3:12b"}` reflects actual upstream load even when callers
  used `mana/long-form` aliases.

16 new observability tests (test_m4_observability.py): listener
fire-on-transition semantics, exception-isolation, multi-listener,
counter increments, gauge writes, end-to-end alias→metric flow,
v1/aliases + v1/health endpoint shape, response.model carries the
resolved target after fallback. Total suite: 115/115 in 1.6s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 20:52:28 +02:00
Till JS
3046da3b19 feat(mana-llm): M3 — health-aware router with alias + chain fallback
Replaces the old Ollama→Google special-case auto-fallback with the
unified pipeline: caller passes either a direct provider/model or an
alias from the `mana/` namespace; the router resolves to a chain and
walks it skipping unhealthy providers (per ProviderHealthCache from M2),
trying each entry, marking provider unhealthy on retryable errors and
falling through to the next.

Retryable: ConnectError, ReadTimeout, RemoteProtocolError, 5xx,
ProviderRateLimitError. Propagated (don't fall back, don't poison the
cache): ProviderCapabilityError, ProviderAuthError, ProviderBlockedError,
4xx, unknown exception types. The cache stays "what the network told us
about this provider's liveness" — caller errors don't muddy that signal.

Streaming: pre-first-byte fallback only. Once a chunk has been yielded
the provider is committed; mid-stream errors propagate as-is so we
don't splice two voices into one output.

`NoHealthyProviderError` (HTTP 503) carries a structured attempt log —
each chain entry shows up as `(model, reason)` so the cause of a 503
is visible in the response and metrics, not only in service logs.

main.py wires the lifespan: aliases.yaml is loaded, ProviderHealthCache
created, ProviderRouter takes both as constructor deps, HealthProbe
spawned with cheap HTTP probes per configured provider (Ollama
/api/tags, OpenAI-compat /v1/models with Bearer header). Google is
skipped — google-genai SDK has no obvious cheap probe; the call-site
fallback handles real errors.

22 new router tests (test_router_fallback.py): chain walking, capability
& auth propagation, 5xx vs 4xx differentiation, rate-limit retry,
all-fail → NoHealthyProviderError, direct provider strings bypass
aliases, streaming pre-first-byte fallback, mid-stream-failure does
NOT fall back, empty stream commits without retry, cache feedback on
success/failure/non-retryable. Existing test_providers.py updated for
the new constructor signature; all 99 service tests green via the dev
container (Python 3.12).

Legacy purged: `_ollama_concurrent`, `_ollama_health_cache`,
`_can_fallback_to_google`, `_should_use_ollama`, `_fallback_to_google`,
`_get_ollama_health_cached` all gone. The `auto_fallback_enabled` /
`ollama_max_concurrent` settings remain in config.py for now (M5 will
remove them along with the per-feature env-var overrides).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 20:44:16 +02:00
Till JS
59557e62d7 feat(mana-llm): M2 — ProviderHealthCache + background probe loop
Per-provider liveness with circuit-breaker semantics. The router (M3)
will read `is_healthy()` to skip dead providers in a chain; the probe
loop and the call-site fallback handler write state via
`mark_healthy` / `mark_unhealthy`.

State machine: 1st failure stays healthy (transient blips happen);
2nd consecutive failure trips the breaker and sets a 60s backoff
window during which `is_healthy → False`. After the window the
provider is half-open again — next call exercises it, success
resets, failure re-arms.

HealthProbe is the background asyncio.Task that pings every
registered provider every 30s with a 3s timeout. Probes run
concurrently per tick and one bad probe can't sink the loop. Probe
functions are injected (`{name: async-fn}`) so this module stays
decoupled from the provider classes — the wiring lives in main.py
where we already know which providers are configured.

32 new tests (FakeClock for deterministic backoff timing, slow-probe
helpers for parallelism + timeout, lifecycle tests for start/stop
idempotency and tick-after-error survival). 64/64 alias+health tests
green.

Not yet wired into the request path — that's M3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 20:29:57 +02:00
Till JS
dff8629e1d feat(mana-llm): M1 — AliasRegistry + aliases.yaml SSOT
First milestone of the LLM-fallback plan (docs/plans/llm-fallback-aliases.md).
Introduces the `mana/<class>` namespace; the registry parses + validates
aliases.yaml at startup and reloads on demand. Schema-rejects empty
chains, missing provider prefixes, alias names outside the reserved
namespace, default→unknown references, etc.

Reload semantics: parse error keeps the previous good state in memory
so a typo + SIGHUP doesn't take the service down.

5 aliases ship with the initial config: fast-text, long-form, structured,
reasoning, vision. Each chain ends with a cloud provider so the system
keeps working when the GPU server is offline.

32 unit tests covering happy path, schema validation, namespace check,
reload safety, and a guard that the shipped aliases.yaml itself parses.
M2 (health-cache + probe-loop) and M3 (router fallback execution) build
on this; aliases are not yet wired into the request path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 20:23:51 +02:00
Till JS
da373491b8 chore(mana-llm): thread GOOGLE_API_KEY + default model into local compose
Matches the macmini compose — Google Gemini was already wired in the
provider adapter (commit 2 of the function-calling migration) but the
local dev stack's compose never passed the env through, so the
container booted without the provider and every tool-calling request
fell back to Ollama (unreachable in local dev, LAN-only GPU box).

With this in place the local mana-llm healthcheck reports both
`google` and `openrouter` as healthy and the webapp planner hits
Gemini Flash for real.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 20:42:21 +02:00
Till JS
e757470cb0 feat(mana-llm): add OpenAI-style tools + tool_calls passthrough
Extends the chat-completions surface so callers can ask any provider
to call named functions and get structured tool_calls back. Wired
through all three provider adapters so the planner and companion can
switch off the fragile JSON-parsing pathway.

- Request: tools[], tool_choice, assistant tool_calls, tool-role
  messages with tool_call_id.
- Response: MessageResponse.tool_calls, Choice.finish_reason adds
  "tool_calls", DeltaContent streams tool_calls.
- Google provider: Tool(function_declarations=...) build, result
  normalised (args dict → JSON string), function_response parts on
  a user turn for tool-role messages.
- OpenAI-compat: 1:1 passthrough of the OpenAI spec.
- Ollama: /api/chat passthrough; model-level capability check via a
  TOOL_CAPABLE_OLLAMA_PATTERNS whitelist (llama3.1+, qwen2.5+,
  mistral, command-r, …) — unsupported models rejected rather than
  silently falling back to prose.
- Router: model_supports_tools() check upfront for both streaming
  and non-streaming paths; ProviderCapabilityError bubbles as 400.

No silent downgrade. Missing tool support = explicit error.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 15:22:48 +02:00
Till JS
4b8fede7fc fix(mana-llm): surface Gemini finish_reason errors instead of returning ""
The google provider called response.text after a chat completion and
passed the resulting string downstream unchanged. When Gemini's content
filter, recitation guard, or max_tokens ceiling fired, response.text
quietly returned "" — which the planner then reported as "no JSON block
found", masking the real cause. Empirically this failed in 45 ms on a
simple Quiz mission.

Introduces providers/errors.py with a small ProviderError hierarchy
(Blocked / Truncated / Auth / RateLimit / Capability). google.py now
inspects response.candidates[0].finish_reason and raises the matching
structured error; the non-streaming path maps it to 422/502/429 via a
new except-branch in main.py, and the streaming path surfaces the kind
as the SSE error type. Capability is wired but not yet used — it lands
with the tool-schema passthrough in the next commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 15:15:37 +02:00
Till JS
3be4612f04 fix(mana-llm): google-genai v1.73 keyword-only Part.from_text()
google-genai >=1.70 changed Part.from_text() from positional to
keyword-only argument. The production container installed v1.73.1
and crashed on startup with "Part.from_text() takes 1 positional
argument but 2 were given".

Fix: Part.from_text(msg.content) → Part.from_text(text=msg.content)

Tested live: curl https://llm.mana.how/v1/chat/completions with
model=google/gemini-2.5-flash returns correct response.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 12:47:23 +02:00
Till JS
659a7d9774 fix(mana-llm): add google-genai to requirements.txt for Docker builds
google-genai was in pyproject.toml but missing from requirements.txt.
The Dockerfile uses pip install -r requirements.txt, so the Google
provider never loaded in production. Now that the key is set and the
cloud tier upgraded to gemini-2.5-flash, the import fires on startup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 12:40:30 +02:00
Till JS
8a0bf93699 chore(cloud-tier): upgrade default model gemini-2.0-flash → gemini-2.5-flash
gemini-2.0-flash is deprecated June 1 2026. gemini-2.5-flash has been
stable since Q1 2026 with similar pricing ($0.15/$0.60 per 1M tokens
vs $0.10/$0.40 — pricing table already had the entry).

Three files touched:
- packages/shared-llm/src/backends/cloud.ts — client default
- services/mana-llm/src/config.py — server default
- services/mana-llm/src/providers/google.py — Ollama→Gemini fallback
  map + constructor default + deduplicated model list

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 12:32:03 +02:00
Till JS
5520f1385e fix(mana-llm): add response_format to ChatCompletionRequest model
The first iteration of the Ollama response_format passthrough crashed
with 'ChatCompletionRequest object has no attribute response_format'
because the Pydantic request model didn't declare the field at all —
incoming response_format from OpenAI-compatible clients was being
silently dropped at the parsing layer before the provider could see it.

Fix: declare a typed ResponseFormat sub-model with the two OpenAI shapes
('json_object' and 'json_schema'), add it as an optional field on
ChatCompletionRequest, and let the Ollama provider read it directly
without defensive getattr fallbacks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 18:50:54 +02:00
Till JS
3ef095aaff fix(mana-llm/ollama): pass response_format to Ollama + strip markdown fences
The Ollama provider was completely ignoring `response_format` from the
incoming OpenAI-compatible request. Two consequences:

  1. Clients that asked for `{"type":"json_object"}` or
     `{"type":"json_schema",...}` got back JSON wrapped in
     ```json ... ``` markdown fences, because Ollama defaults to
     conversational output.
  2. Strict downstream parsers (Vercel AI SDK `generateObject`,
     manual `JSON.parse`) failed to decode the response and threw,
     even though the underlying JSON was valid inside the fences.

Fix: when response_format is set, translate it to Ollama's native
`format` field:

  - `{"type":"json_object"}` → `format: "json"`
  - `{"type":"json_schema","json_schema":{"schema":{...}}}`
    → `format: <the schema dict>` (Ollama 0.5+ supports full JSON
    schemas in the format field)

Defensive belt-and-suspenders: a small `_strip_json_fences` helper
runs after the Ollama response is decoded and removes any leftover
```json ... ``` wrapping. Some older vision models still wrap
output in fences even when `format` is set; this catches them.

Streaming path is unchanged because the nutriphi/planta refactor uses
non-streaming `generateObject`. Streaming structured output with
Ollama deserves its own pass when someone actually needs it.

Discovered during the AI SDK + Zod refactor smoke test — neither the
old nor the new vision routes ever returned validated JSON locally
because of this bug. Production uses Google Gemini directly via
fallback so the issue was masked there.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 18:12:01 +02:00
Till JS
bfeeef7819 chore(matrix): final scrub of stale matrix references
A grep audit after the previous matrix removal commits found a handful
of stragglers in non-runtime files that the earlier sweeps missed:

- services/mana-llm/CLAUDE.md: removed matrix-ollama-bot from the
  consumer-apps diagram and from the related-services table
- services/mana-video-gen/CLAUDE.md: removed "Matrix Bots" integration
  bullet
- packages/notify-client/README.md: removed sendMatrix() doc entry
  (the method itself was already gone in the prior cleanup)
- docker/grafana/dashboards/logs-explorer.json: dropped the "Matrix
  Stack" log row that queried tier="matrix" (would show no data forever)
- docker/grafana/dashboards/master-overview.json: dropped the "Matrix
  Bots" stat panel that counted up{job=~"matrix-.*-bot"}
- apps/mana/apps/landing/src/data/ecosystem-health.json: regenerated via
  scripts/ecosystem-audit.mjs to drop matrix from the app list, icon
  counts, file analytics, top offenders and authGuard missing list
- .gitignore: removed services/matrix-stt-bot/data/ pattern (the
  service itself was deleted long ago)

Production-side stragglers also addressed (not in this commit):
- DROP USER synapse on prod Postgres (the parallel cleanup commit
  2514831a3 dropped DATABASE matrix + DATABASE synapse but left the
  role behind)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 16:47:54 +02:00
Till JS
b8e18b7f82 chore(ai-services): adopt Windows GPU as source of truth for llm/stt/tts
The Windows GPU server has been the actual production home for these
services for some time, and the running code there has drifted ahead of
the repo. This sync pulls the live versions back into the repo so the
Windows box is no longer the only place those changes exist.

Pulled from C:\mana\services\* on mana-server-gpu (192.168.178.11):

mana-llm:
- src/main.py, src/config.py — small fixes (auth wiring, config tweaks)
- src/api_auth.py — NEW (cross-service GPU_API_KEY validator)
- service.pyw — Windows runner used by the ManaLLM scheduled task
  (sets up logging redirect, loads .env, calls uvicorn)

mana-stt:
- app/main.py — substantial cleanup (684→392 lines), drops the
  whisperx-as-separate-backend branching now that whisper_service.py
  rolls whisperx in directly
- app/whisper_service.py — full CUDA + whisperx rewrite (158→358 lines)
- app/auth.py + external_auth.py — significantly expanded auth
- app/vram_manager.py — NEW (shared VRAM accounting helper)
- service.pyw — Windows runner with CUDA pre-init, FFmpeg PATH
  injection, .env loading
- removed: app/whisper_service_cuda.py (folded into whisper_service.py)
- removed: app/whisperx_service.py (folded into whisper_service.py)

mana-tts:
- app/auth.py, external_auth.py — same auth expansion as stt
- app/f5_service.py, kokoro_service.py — Windows tweaks
- app/vram_manager.py — NEW (same shared helper as stt)
- service.pyw — Windows runner

mana-video-gen:
- service.pyw — Windows runner (no other changes; the .py code on the
  GPU box is byte-identical to what's already in the repo)

The service.pyw files contain absolute Windows paths
(C:\mana\services\<svc>) and a hardcoded FFmpeg PATH for the tills user
profile. Kept as-is intentionally — they exist to be deployed to that
one machine and any abstraction layer would just hide what's actually
happening. Anyone redeploying to a different layout will need to edit
the path strings, which is a known and obvious change.

Mac-Mini infrastructure for these services (launchd plists, install
scripts, scripts/mac-mini/setup-{stt,tts}.sh, the Mac-flux2c image-gen
implementation) is still on disk and will be removed in a follow-up
commit, along with replacing mana-image-gen with the Windows
diffusers+CUDA implementation. This commit is just the live-code sync.
2026-04-08 12:46:03 +02:00
Till JS
45063b88be feat(mana-llm): add Google Gemini fallback provider with auto-routing
Add Google Gemini as a fallback provider that activates automatically
when Ollama is overloaded or unavailable, ensuring LLM requests always
succeed even under load.

New provider (src/providers/google.py):
- Full LLMProvider implementation using google-genai SDK
- Chat completions (streaming + non-streaming)
- Vision/multimodal support (base64 images)
- Embeddings via text-embedding-004
- Model mapping: Ollama models → Gemini equivalents
  (gemma3:4b → gemini-2.0-flash, llava:7b → gemini-2.0-flash, etc.)

Auto-fallback routing (src/providers/router.py):
- Concurrent request tracking for Ollama (OLLAMA_MAX_CONCURRENT=3)
- When Ollama concurrent > max: route to Google automatically
- When Ollama fails: retry on Google with model mapping
- Health check caching (5s TTL) to avoid hammering Ollama
- Non-Ollama providers (openrouter, groq, together) are never fallback-routed
- Fallback info included in /health endpoint response

New config (src/config.py):
- GOOGLE_API_KEY: enables Google provider
- GOOGLE_DEFAULT_MODEL: default gemini-2.0-flash
- AUTO_FALLBACK_ENABLED: toggle fallback (default: true)
- OLLAMA_MAX_CONCURRENT: concurrent request threshold (default: 3)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 22:44:09 +01:00
Till-JS
aba79f5c16 fix(mana-llm): fix SSE double data prefix causing message parsing issues
EventSourceResponse from sse-starlette adds its own 'data:' prefix,
so we should yield dicts with a 'data' key instead of pre-formatted
SSE strings. This was causing 'data: data:' double prefixes and
backticks appearing in chat messages.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 15:29:11 +01:00
Till-JS
d605366460 feat(llm-playground): add model comparison feature
- Add modality detection (text/vision/code) to models store
- Create comparison store for parallel multi-model streaming
- Add ModelModalityFilter and ModelComparisonSelector components
- Add ComparisonResponseCard with metrics (duration, tokens, t/s)
- Add ComparisonMessageBubble for side-by-side response view
- Integrate comparison mode into ChatInput, MessageList, Sidebar
- Add dev:full script to start mana-llm + playground together
- Add start.sh script for mana-llm Python service
2026-01-31 23:30:16 +01:00
Till-JS
fdba0e3425 feat(llm-playground): add production deployment with auth
- Add Dockerfile for multi-stage Docker build
- Add mana-core-auth integration with login/register pages
- Add auth store using Svelte 5 runes
- Add protected route layout with auth guard
- Add health endpoint for container health checks
- Add runtime URL injection via hooks.server.ts
- Add logout button to header
- Update docker-compose.macmini.yml with llm-playground service
- Update cloudflared-config.yml with playground.mana.how route
- Update mana-llm CORS config for playground domain
- Update generate-env.mjs with auth URL variable

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-30 18:15:02 +01:00
Till-JS
3edbd0cb26 chore: update dependencies and mana-llm improvements
- Update pnpm-lock.yaml with matrix bot dependencies
- Add environment variables to generate-env.mjs
- Improve mana-llm config and ollama provider

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-30 17:50:58 +01:00
Till-JS
1495dbe476 feat(mana-llm): add central LLM abstraction service
Python/FastAPI service providing unified OpenAI-compatible API for
Ollama and cloud LLM providers (OpenRouter, Groq, Together).

Features:
- Chat completions with streaming (SSE)
- Vision/multimodal support
- Embeddings generation
- Multi-provider routing (provider/model format)
- Prometheus metrics
- Optional Redis caching
2026-01-29 22:01:00 +01:00