diff --git a/packages/local-llm/CLAUDE.md b/packages/local-llm/CLAUDE.md index 56383a671..9d40b68d0 100644 --- a/packages/local-llm/CLAUDE.md +++ b/packages/local-llm/CLAUDE.md @@ -1,6 +1,8 @@ # `@mana/local-llm` — Browser-Local LLM Inference -Client-side LLM inference that runs **entirely in the user's browser** via WebGPU. No server roundtrips, no API keys, no data leaving the device. Used by `/llm-test` (developer tool) and the `playground` module in `apps/mana/apps/web`. Not related to `services/mana-llm` (which is the server-side LLM proxy that talks to Ollama, OpenAI, etc.). +Client-side LLM inference that runs **entirely in the user's browser** via WebGPU. No server roundtrips, no API keys, no data leaving the device. Used by `/llm-test` (developer tool) and the `playground` module in `apps/mana/apps/web`. + +**Don't confuse this with the server-side LLM** (`services/mana-llm`). The server-side proxy is what backs the **`mana-server`** and **`cloud`** tiers in `@mana/shared-llm`'s tiered orchestrator — it speaks OpenAI-compatible HTTP and routes to a configured Ollama instance or to Gemini. The Ollama instance is **not** the Mac Mini's local Ollama: traffic goes via `~/gpu-proxy.py` (a Python TCP forwarder running as a LaunchAgent on the Mac Mini host) to the Windows GPU server's Ollama at `192.168.178.11:11434`, where inference runs on the **RTX 3090**. See `docs/MAC_MINI_SERVER.md` and `docs/WINDOWS_GPU_SERVER_SETUP.md` for the full topology. This package (`@mana/local-llm`) is the **only** path that uses the user's own device — `mana-server` and `cloud` both leave the device. ## What's currently in the box diff --git a/packages/shared-llm/src/backends/mana-server.ts b/packages/shared-llm/src/backends/mana-server.ts index 474279e54..4ee6dd47e 100644 --- a/packages/shared-llm/src/backends/mana-server.ts +++ b/packages/shared-llm/src/backends/mana-server.ts @@ -1,19 +1,38 @@ /** * Mana-server backend — calls services/mana-llm with an Ollama model * string. mana-llm's ProviderRouter recognizes plain Ollama model names - * (no provider prefix) and routes them to the local Ollama instance on - * the Mac Mini (running on the M4's Metal GPU), with automatic Gemini - * fallback if Ollama is overloaded. + * (no provider prefix) and routes them to its configured Ollama + * instance, with automatic Google Gemini fallback if Ollama is + * overloaded. + * + * Where the inference actually runs (subtle, easy to misread): + * + * mana-llm container's `OLLAMA_URL` points at + * `host.docker.internal:13434`. That is NOT the Mac Mini's local + * Ollama — it's a Python TCP forwarder (`~/gpu-proxy.py`, running + * as a LaunchAgent on the Mac Mini host) that pipes the traffic to + * `192.168.178.11:11434` over the LAN, where Ollama is running on + * the Windows GPU server with the RTX 3090 (24 GB VRAM). All + * inference happens there, not on the Mac Mini's M4 Metal GPU. + * + * See docs/MAC_MINI_SERVER.md and docs/WINDOWS_GPU_SERVER_SETUP.md + * (specifically the "Auf dem Mac Mini läuft ein TCP-Proxy" section) + * for the full topology. The Mac Mini's brew-installed Ollama + * binary is NOT on the inference path — it's just a local CLI for + * inspecting the proxied daemon. * * The default model is gemma4:e4b — Google's Gemma 4 "Effective 4B" * variant, released 2026-04-02. Same family as @mana/local-llm's * browser tier model (Gemma 4 E2B is the smaller sibling) so prompts * behave consistently when a task auto-falls between tiers. e4b is * the right Mana-Server default because: - * - 9.6 GB on disk fits comfortably on the M4's 16 GB unified memory + * - 9.6 GB on disk fits comfortably on the 3090's 24 GB VRAM * - 128K context window covers all current title/summarize tasks * - The "Effective 4B" architecture punches well above its weight * class (better than gemma3:4b on most German prompts) + * - It's a reasoning model — uses message.reasoning for chain-of- + * thought when given enough max_tokens budget; remote.ts has a + * fallback parser for that field * - The tier name we surface in the source label stays "Gemma 4" * family for both browser and mana-server, so the UX is coherent */