docs(shared-llm): correct the mana-server tier topology in code + CLAUDE.md

In commit c9e16243c (the gemma3:4b → gemma4:e4b switch) I sloppily wrote in the ManaServerBackend docstring that mana-llm "routes them to the local Ollama instance on the Mac Mini (running on the M4's Metal GPU)". That is wrong AND it's the exact misconception I had to debug-out-of earlier the same day. The actual topology — already documented correctly in docs/MAC_MINI_SERVER.md and docs/WINDOWS_GPU_SERVER_SETUP.md, I just didn't read those before writing the docstring: mana-llm container's OLLAMA_URL points at host.docker.internal:13434 → ~/gpu-proxy.py (Python TCP forwarder, LaunchAgent on Mac Mini) → 192.168.178.11:11434 (LAN) → Ollama on the Windows GPU server (RTX 3090, 24 GB VRAM) → Inference The Mac Mini's brew-installed Ollama binary is NOT on the inference path. It's just a CLI for inspecting the proxied daemon. Today's "why does the Mac Mini still have Ollama 0.15.4" puzzle has the answer "because nothing on the Mac Mini actually runs inference, the binary version was never load-bearing". Two doc fixes: 1. packages/shared-llm/src/backends/mana-server.ts Replace the lying docstring with the real topology, including a pointer to the two MAC_MINI_SERVER.md / WINDOWS_GPU_SERVER_SETUP.md sections that document it. Also note that gemma4:e4b is a reasoning model that emits message.reasoning when given enough tokens (cross-reference to remote.ts's fallback parser). 2. packages/local-llm/CLAUDE.md Add a paragraph at the top explaining the difference between "@mana/local-llm" (browser tier, on-device) and the @mana/shared-llm "mana-server" / "cloud" tiers (services/mana-llm proxy → gpu-proxy.py → RTX 3090). This was implicit before — "not related to services/mana-llm" — but didn't say where mana-server actually goes. Future me reading the doc would still have to dig through the docker-compose env to find out. No code changes — only docstring + markdown. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-14 20:01:09 +02:00 · 2026-04-09 16:40:34 +02:00 · 2026-04-09 16:40:34 +02:00 · 92f8221bfd
commit 92f8221bfd
parent 3993400013
2 changed files with 26 additions and 5 deletions
--- a/packages/local-llm/CLAUDE.md
+++ b/packages/local-llm/CLAUDE.md
@ -1,6 +1,8 @@
 # `@mana/local-llm` — Browser-Local LLM Inference

-Client-side LLM inference that runs **entirely in the user's browser** via WebGPU. No server roundtrips, no API keys, no data leaving the device. Used by `/llm-test` (developer tool) and the `playground` module in `apps/mana/apps/web`. Not related to `services/mana-llm` (which is the server-side LLM proxy that talks to Ollama, OpenAI, etc.).
+Client-side LLM inference that runs **entirely in the user's browser** via WebGPU. No server roundtrips, no API keys, no data leaving the device. Used by `/llm-test` (developer tool) and the `playground` module in `apps/mana/apps/web`.
+
+**Don't confuse this with the server-side LLM** (`services/mana-llm`). The server-side proxy is what backs the **`mana-server`** and **`cloud`** tiers in `@mana/shared-llm`'s tiered orchestrator — it speaks OpenAI-compatible HTTP and routes to a configured Ollama instance or to Gemini. The Ollama instance is **not** the Mac Mini's local Ollama: traffic goes via `~/gpu-proxy.py` (a Python TCP forwarder running as a LaunchAgent on the Mac Mini host) to the Windows GPU server's Ollama at `192.168.178.11:11434`, where inference runs on the **RTX 3090**. See `docs/MAC_MINI_SERVER.md` and `docs/WINDOWS_GPU_SERVER_SETUP.md` for the full topology. This package (`@mana/local-llm`) is the **only** path that uses the user's own device — `mana-server` and `cloud` both leave the device.

 ## What's currently in the box

--- a/packages/shared-llm/src/backends/mana-server.ts
+++ b/packages/shared-llm/src/backends/mana-server.ts
@ -1,19 +1,38 @@
 /**
 * Mana-server backend — calls services/mana-llm with an Ollama model
 * string. mana-llm's ProviderRouter recognizes plain Ollama model names
- * (no provider prefix) and routes them to the local Ollama instance on
- * the Mac Mini (running on the M4's Metal GPU), with automatic Gemini
- * fallback if Ollama is overloaded.
+ * (no provider prefix) and routes them to its configured Ollama
+ * instance, with automatic Google Gemini fallback if Ollama is
+ * overloaded.
+ *
+ * Where the inference actually runs (subtle, easy to misread):
+ *
+ *   mana-llm container's `OLLAMA_URL` points at
+ *   `host.docker.internal:13434`. That is NOT the Mac Mini's local
+ *   Ollama — it's a Python TCP forwarder (`~/gpu-proxy.py`, running
+ *   as a LaunchAgent on the Mac Mini host) that pipes the traffic to
+ *   `192.168.178.11:11434` over the LAN, where Ollama is running on
+ *   the Windows GPU server with the RTX 3090 (24 GB VRAM). All
+ *   inference happens there, not on the Mac Mini's M4 Metal GPU.
+ *
+ *   See docs/MAC_MINI_SERVER.md and docs/WINDOWS_GPU_SERVER_SETUP.md
+ *   (specifically the "Auf dem Mac Mini läuft ein TCP-Proxy" section)
+ *   for the full topology. The Mac Mini's brew-installed Ollama
+ *   binary is NOT on the inference path — it's just a local CLI for
+ *   inspecting the proxied daemon.
 *
 * The default model is gemma4:e4b — Google's Gemma 4 "Effective 4B"
 * variant, released 2026-04-02. Same family as @mana/local-llm's
 * browser tier model (Gemma 4 E2B is the smaller sibling) so prompts
 * behave consistently when a task auto-falls between tiers. e4b is
 * the right Mana-Server default because:
- *   - 9.6 GB on disk fits comfortably on the M4's 16 GB unified memory
+ *   - 9.6 GB on disk fits comfortably on the 3090's 24 GB VRAM
 *   - 128K context window covers all current title/summarize tasks
 *   - The "Effective 4B" architecture punches well above its weight
 *     class (better than gemma3:4b on most German prompts)
+ *   - It's a reasoning model — uses message.reasoning for chain-of-
+ *     thought when given enough max_tokens budget; remote.ts has a
+ *     fallback parser for that field
 *   - The tier name we surface in the source label stays "Gemma 4"
 *     family for both browser and mana-server, so the UX is coherent
 */