In commit c9e16243c (the gemma3:4b → gemma4:e4b switch) I sloppily
wrote in the ManaServerBackend docstring that mana-llm "routes them
to the local Ollama instance on the Mac Mini (running on the M4's
Metal GPU)". That is wrong AND it's the exact misconception I had
to debug-out-of earlier the same day.
The actual topology — already documented correctly in
docs/MAC_MINI_SERVER.md and docs/WINDOWS_GPU_SERVER_SETUP.md, I
just didn't read those before writing the docstring:
mana-llm container's OLLAMA_URL points at host.docker.internal:13434
→ ~/gpu-proxy.py (Python TCP forwarder, LaunchAgent on Mac Mini)
→ 192.168.178.11:11434 (LAN)
→ Ollama on the Windows GPU server (RTX 3090, 24 GB VRAM)
→ Inference
The Mac Mini's brew-installed Ollama binary is NOT on the inference
path. It's just a CLI for inspecting the proxied daemon. Today's
"why does the Mac Mini still have Ollama 0.15.4" puzzle has the
answer "because nothing on the Mac Mini actually runs inference, the
binary version was never load-bearing".
Two doc fixes:
1. packages/shared-llm/src/backends/mana-server.ts
Replace the lying docstring with the real topology, including a
pointer to the two MAC_MINI_SERVER.md / WINDOWS_GPU_SERVER_SETUP.md
sections that document it. Also note that gemma4:e4b is a
reasoning model that emits message.reasoning when given enough
tokens (cross-reference to remote.ts's fallback parser).
2. packages/local-llm/CLAUDE.md
Add a paragraph at the top explaining the difference between
"@mana/local-llm" (browser tier, on-device) and the @mana/shared-llm
"mana-server" / "cloud" tiers (services/mana-llm proxy → gpu-proxy.py
→ RTX 3090). This was implicit before — "not related to
services/mana-llm" — but didn't say where mana-server actually
goes. Future me reading the doc would still have to dig through
the docker-compose env to find out.
No code changes — only docstring + markdown.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add packages/local-llm/CLAUDE.md as the package-level reference for
browser-local LLM inference. The package went through a non-trivial
engine swap from WebLLM/Qwen to transformers.js/Gemma 4 E2B on
2026-04-08, and the bring-up surfaced enough sharp edges that the
next person (or AI agent) touching this code will save real time
having them written down in one place rather than re-discovering
them error by error.
Captured topics:
- What the package is, what library/model is currently used, and
the deliberate engine-agnostic API surface that lets future swaps
stay contained to this package.
- Why we chose transformers.js + Gemma 4 over staying on WebLLM
(MLC compilation lag for new model architectures) and what the
return path looks like once MLC ships Gemma 4 builds.
- The seven CSP directives that browser-local inference needs and
WHY each one is required:
* script-src: 'wasm-unsafe-eval', cdn.jsdelivr.net, blob:
* connect-src: huggingface.co + *.huggingface.co + cdn-lfs-*,
*.hf.co + cas-bridge.xethub.hf.co (XET CDN),
cdn.jsdelivr.net (for the WASM preload fetch)
Including the subtle "jsDelivr is needed in BOTH script-src and
connect-src" trap that produces identical-looking error messages
for two distinct underlying causes.
- The Vite SSR module-cache gotcha: CSP additions made in
packages/shared-utils/security-headers.ts do NOT hot-reload across
the workspace package boundary, while additions made directly in
apps/mana/apps/web/src/hooks.server.ts do. Includes the diagnostic
pattern (compare which additions show up in the next CSP error
vs which don't) and the workaround (move them into hooks.server.ts
via setSecurityHeaders options).
- The two-step tokenization pattern that's mandatory for
Gemma4Processor: apply_chat_template(tokenize:false) → string, then
processor.tokenizer(text, return_tensors:'pt'). The collapsed
apply_chat_template(return_dict:true) path looks shorter but
produces a malformed input shape and crashes model.generate() deep
inside the forward pass with "Cannot read properties of null
(reading 'dims')" — opaque from the call site.
- The transformers.js v4 quirk that model.generate() returns null
(not a tensor) when a TextStreamer is attached. The streamer is
the only stable text channel; the engine always attaches one and
uses the streamer's collected text as the canonical output, with
a chars/4 fallback for token counts.
- API surface (Svelte 5 example), how to add a new model to the
registry, deploy notes (no base image rebuild needed for local-llm
changes alone, but IS needed if shared-utils CSP defaults change),
browser cache semantics, and hard browser support requirements
(WebGPU, ~1.5–2 GB VRAM for E2B q4f16, no CPU/WASM fallback).
Also link to the new doc from the root CLAUDE.md Shared Packages
table so people land on it from the standard discovery path.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>