Commit graph

2 commits

Author SHA1 Message Date
Till JS
92f8221bfd docs(shared-llm): correct the mana-server tier topology in code + CLAUDE.md
In commit c9e16243c (the gemma3:4b → gemma4:e4b switch) I sloppily
wrote in the ManaServerBackend docstring that mana-llm "routes them
to the local Ollama instance on the Mac Mini (running on the M4's
Metal GPU)". That is wrong AND it's the exact misconception I had
to debug-out-of earlier the same day.

The actual topology — already documented correctly in
docs/MAC_MINI_SERVER.md and docs/WINDOWS_GPU_SERVER_SETUP.md, I
just didn't read those before writing the docstring:

  mana-llm container's OLLAMA_URL points at host.docker.internal:13434
  → ~/gpu-proxy.py (Python TCP forwarder, LaunchAgent on Mac Mini)
  → 192.168.178.11:11434 (LAN)
  → Ollama on the Windows GPU server (RTX 3090, 24 GB VRAM)
  → Inference

The Mac Mini's brew-installed Ollama binary is NOT on the inference
path. It's just a CLI for inspecting the proxied daemon. Today's
"why does the Mac Mini still have Ollama 0.15.4" puzzle has the
answer "because nothing on the Mac Mini actually runs inference, the
binary version was never load-bearing".

Two doc fixes:

1. packages/shared-llm/src/backends/mana-server.ts
   Replace the lying docstring with the real topology, including a
   pointer to the two MAC_MINI_SERVER.md / WINDOWS_GPU_SERVER_SETUP.md
   sections that document it. Also note that gemma4:e4b is a
   reasoning model that emits message.reasoning when given enough
   tokens (cross-reference to remote.ts's fallback parser).

2. packages/local-llm/CLAUDE.md
   Add a paragraph at the top explaining the difference between
   "@mana/local-llm" (browser tier, on-device) and the @mana/shared-llm
   "mana-server" / "cloud" tiers (services/mana-llm proxy → gpu-proxy.py
   → RTX 3090). This was implicit before — "not related to
   services/mana-llm" — but didn't say where mana-server actually
   goes. Future me reading the doc would still have to dig through
   the docker-compose env to find out.

No code changes — only docstring + markdown.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 16:40:34 +02:00
Till JS
7901a5df2f docs(local-llm): document browser-local LLM stack and CSP requirements
Add packages/local-llm/CLAUDE.md as the package-level reference for
browser-local LLM inference. The package went through a non-trivial
engine swap from WebLLM/Qwen to transformers.js/Gemma 4 E2B on
2026-04-08, and the bring-up surfaced enough sharp edges that the
next person (or AI agent) touching this code will save real time
having them written down in one place rather than re-discovering
them error by error.

Captured topics:

- What the package is, what library/model is currently used, and
  the deliberate engine-agnostic API surface that lets future swaps
  stay contained to this package.

- Why we chose transformers.js + Gemma 4 over staying on WebLLM
  (MLC compilation lag for new model architectures) and what the
  return path looks like once MLC ships Gemma 4 builds.

- The seven CSP directives that browser-local inference needs and
  WHY each one is required:
    * script-src: 'wasm-unsafe-eval', cdn.jsdelivr.net, blob:
    * connect-src: huggingface.co + *.huggingface.co + cdn-lfs-*,
                   *.hf.co + cas-bridge.xethub.hf.co (XET CDN),
                   cdn.jsdelivr.net (for the WASM preload fetch)
  Including the subtle "jsDelivr is needed in BOTH script-src and
  connect-src" trap that produces identical-looking error messages
  for two distinct underlying causes.

- The Vite SSR module-cache gotcha: CSP additions made in
  packages/shared-utils/security-headers.ts do NOT hot-reload across
  the workspace package boundary, while additions made directly in
  apps/mana/apps/web/src/hooks.server.ts do. Includes the diagnostic
  pattern (compare which additions show up in the next CSP error
  vs which don't) and the workaround (move them into hooks.server.ts
  via setSecurityHeaders options).

- The two-step tokenization pattern that's mandatory for
  Gemma4Processor: apply_chat_template(tokenize:false) → string, then
  processor.tokenizer(text, return_tensors:'pt'). The collapsed
  apply_chat_template(return_dict:true) path looks shorter but
  produces a malformed input shape and crashes model.generate() deep
  inside the forward pass with "Cannot read properties of null
  (reading 'dims')" — opaque from the call site.

- The transformers.js v4 quirk that model.generate() returns null
  (not a tensor) when a TextStreamer is attached. The streamer is
  the only stable text channel; the engine always attaches one and
  uses the streamer's collected text as the canonical output, with
  a chars/4 fallback for token counts.

- API surface (Svelte 5 example), how to add a new model to the
  registry, deploy notes (no base image rebuild needed for local-llm
  changes alone, but IS needed if shared-utils CSP defaults change),
  browser cache semantics, and hard browser support requirements
  (WebGPU, ~1.5–2 GB VRAM for E2B q4f16, no CPU/WASM fallback).

Also link to the new doc from the root CLAUDE.md Shared Packages
table so people land on it from the standard discovery path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 23:27:50 +02:00