managarten/packages/local-llm/CLAUDE.md
Till JS 92f8221bfd docs(shared-llm): correct the mana-server tier topology in code + CLAUDE.md
In commit c9e16243c (the gemma3:4b → gemma4:e4b switch) I sloppily
wrote in the ManaServerBackend docstring that mana-llm "routes them
to the local Ollama instance on the Mac Mini (running on the M4's
Metal GPU)". That is wrong AND it's the exact misconception I had
to debug-out-of earlier the same day.

The actual topology — already documented correctly in
docs/MAC_MINI_SERVER.md and docs/WINDOWS_GPU_SERVER_SETUP.md, I
just didn't read those before writing the docstring:

  mana-llm container's OLLAMA_URL points at host.docker.internal:13434
  → ~/gpu-proxy.py (Python TCP forwarder, LaunchAgent on Mac Mini)
  → 192.168.178.11:11434 (LAN)
  → Ollama on the Windows GPU server (RTX 3090, 24 GB VRAM)
  → Inference

The Mac Mini's brew-installed Ollama binary is NOT on the inference
path. It's just a CLI for inspecting the proxied daemon. Today's
"why does the Mac Mini still have Ollama 0.15.4" puzzle has the
answer "because nothing on the Mac Mini actually runs inference, the
binary version was never load-bearing".

Two doc fixes:

1. packages/shared-llm/src/backends/mana-server.ts
   Replace the lying docstring with the real topology, including a
   pointer to the two MAC_MINI_SERVER.md / WINDOWS_GPU_SERVER_SETUP.md
   sections that document it. Also note that gemma4:e4b is a
   reasoning model that emits message.reasoning when given enough
   tokens (cross-reference to remote.ts's fallback parser).

2. packages/local-llm/CLAUDE.md
   Add a paragraph at the top explaining the difference between
   "@mana/local-llm" (browser tier, on-device) and the @mana/shared-llm
   "mana-server" / "cloud" tiers (services/mana-llm proxy → gpu-proxy.py
   → RTX 3090). This was implicit before — "not related to
   services/mana-llm" — but didn't say where mana-server actually
   goes. Future me reading the doc would still have to dig through
   the docker-compose env to find out.

No code changes — only docstring + markdown.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 16:40:34 +02:00

14 KiB
Raw Blame History

@mana/local-llm — Browser-Local LLM Inference

Client-side LLM inference that runs entirely in the user's browser via WebGPU. No server roundtrips, no API keys, no data leaving the device. Used by /llm-test (developer tool) and the playground module in apps/mana/apps/web.

Don't confuse this with the server-side LLM (services/mana-llm). The server-side proxy is what backs the mana-server and cloud tiers in @mana/shared-llm's tiered orchestrator — it speaks OpenAI-compatible HTTP and routes to a configured Ollama instance or to Gemini. The Ollama instance is not the Mac Mini's local Ollama: traffic goes via ~/gpu-proxy.py (a Python TCP forwarder running as a LaunchAgent on the Mac Mini host) to the Windows GPU server's Ollama at 192.168.178.11:11434, where inference runs on the RTX 3090. See docs/MAC_MINI_SERVER.md and docs/WINDOWS_GPU_SERVER_SETUP.md for the full topology. This package (@mana/local-llm) is the only path that uses the user's own device — mana-server and cloud both leave the device.

What's currently in the box

Field Value
Engine library @huggingface/transformers v4 (transformers.js)
Backend WebGPU (mandatory; no WASM/CPU fallback enabled)
Default model onnx-community/gemma-4-E2B-it-ONNX (Google Gemma 4 E2B, q4f16, ~500 MB on disk)
Quantization 4-bit weights, fp16 activations (q4f16)
API surface LocalLLMEngine class + Svelte 5 reactive bindings (getLocalLlmStatus, loadLocalLlm, generate, extractJson, classify)

The exposed LocalLLMEngine API is intentionally engine-agnostic — load(), generate(), prompt(), extractJson(), classify(), onStatusChange(), isSupported(). Caller code (the test page, the playground module) does not know whether the underlying engine is WebLLM, transformers.js, or anything else. This was deliberate so future engine swaps don't ripple outward.

Why transformers.js and not WebLLM

The package was originally built on @mlc-ai/web-llm with Qwen 2.5 1.5B. We swapped to transformers.js + Gemma 4 on 2026-04-08 because:

  1. MLC compilation lag. WebLLM only runs models that have been pre-compiled to WGSL shaders by the MLC team. Gemma 4 was released 2026-04-02 and MLC had not (and as of the swap still hadn't) published Gemma 4 builds. Waiting was the alternative — we chose not to.
  2. transformers.js Gemma 4 is shipped. HuggingFace's onnx-community org published gemma-4-E2B-it-ONNX six days after release, with a Gemma4ForConditionalGeneration class in transformers.js v4.0.0.
  3. Future flexibility. ONNX is a much broader catalog than MLC's pre-compiled list. Switching to transformers.js opens the door to thousands of community-converted models without a per-model wait.

The trade-off accepted: transformers.js is a generic ONNX runtime, so per-token throughput is ~2040% lower than WebLLM would deliver for the same model size. For a 2B model on a modern WebGPU device that's still well above interactive latency. The JS bundle gains ~23 MB (the ONNX runtime), negligible against the 500 MB model download.

If MLC ships Gemma 4 builds in the future and you want to swap back: rewrite engine.ts to use the WebLLM API again (which is OpenAI-compatible, simpler), keep the same LocalLLMEngine class shape, and update models.ts with the MLC model id. The svelte bindings, the /llm-test page, and the playground module need no changes.

CSP requirements (the seven layers)

Browser-local inference is unusually CSP-hungry. Every Mana web app that bundles @mana/local-llm needs all of the following directives or model loading silently breaks at one of seven different points. Discovered the hard way through seven sequential console errors during the Gemma 4 bring-up.

script-src

'wasm-unsafe-eval'    — instantiate WASM modules (ONNX runtime)
https://cdn.jsdelivr.net    — dynamic import() of onnxruntime-web loader .mjs
blob:                 — Web Worker spawned via URL.createObjectURL

wasm-unsafe-eval is in the shared default in packages/shared-utils/src/security-headers.ts. The other two are added per-app in apps/mana/apps/web/src/hooks.server.ts via the scriptSrc option (see "Vite cache gotcha" below for why they live in the per-app hook, not the shared default).

connect-src

https://huggingface.co              — repo metadata
https://*.huggingface.co            — legacy CDN hosts (cdn-lfs-*.huggingface.co etc.)
https://cdn-lfs.huggingface.co      — explicit fallback for older CSP-strict browsers
https://cdn-lfs-us-1.huggingface.co — same
https://*.hf.co                     — new XET-backed CDN host family
https://cas-bridge.xethub.hf.co     — explicit XET CAS bridge
https://cdn.jsdelivr.net            — fetch() of onnxruntime-web .wasm + .mjs

HF rotates exact CDN hostnames every few months as they migrate from LFS to XET. The wildcards (*.huggingface.co, *.hf.co) catch most rotations; the explicit entries are belt-and-suspenders for browsers that prefer narrow matches.

Why jsDelivr is needed twice

@huggingface/transformers lazy-loads the onnxruntime-web WASM-loader shim from jsDelivr at backend selection time. There are two separate fetches with different CSP semantics:

  1. import('https://cdn.jsdelivr.net/.../ort-wasm-simd-threaded.asyncify.mjs') — dynamic ESM import → routed through script-src.
  2. fetch('https://cdn.jsdelivr.net/.../ort-wasm-simd-threaded.asyncify.wasm') — pre-load of the WASM binary → routed through connect-src.

Allowlisting only one of the two looks like the same identical error message ("no available backend found / Failed to fetch dynamically imported module") because the second failure is masked behind the first. Both directives have to include cdn.jsdelivr.net.

Why blob: is needed

After successfully fetching the loader .mjs from jsDelivr, onnxruntime-web wraps it in URL.createObjectURL(new Blob([...])) and instantiates the result as a multi-threaded Web Worker. The blob: URL scheme is treated as its own CSP source by browsers — neither 'self' nor https://cdn.jsdelivr.net matches it. Adding blob: to script-src allowlists workers spawned from blob URLs scoped to our document origin (you cannot URL.createObjectURL a Blob from another origin, so this does not loosen remote-script protection).

Vite cache gotcha — keep CSP additions in hooks.server.ts

When changing CSP additions, always edit them directly in apps/mana/apps/web/src/hooks.server.ts, not in packages/shared-utils/src/security-headers.ts. The shared-utils file holds the default CSP that applies to every Mana web app, but adding to it from a workspace package boundary triggers a Vite SSR module-cache pitfall:

  • hooks.server.ts is in the SvelteKit app's own source tree → Vite hot-reloads it on every file change. CSP additions made here take effect immediately.
  • packages/shared-utils/src/security-headers.ts is imported as @mana/shared-utils/security-headers from a different workspace package → Vite's SSR module cache holds the OLD compiled version even after a source edit. The dev server has to be restarted (or apps/mana/apps/web/node_modules/.vite deleted) before the change takes effect.

Diagnostic: if you add a CSP entry and the next browser console error still shows the old CSP without your addition, you got bitten by this. The fix is to move the addition into hooks.server.ts via setSecurityHeaders(response, { scriptSrc: [...], connectSrc: [...] }).

API surface (Svelte 5 usage)

<script lang="ts">
  import {
    getLocalLlmStatus,
    loadLocalLlm,
    generate,
    extractJson,
    classify,
    isLocalLlmSupported,
    MODELS,
    DEFAULT_MODEL,
  } from '@mana/local-llm';

  const status = getLocalLlmStatus();
  const supported = isLocalLlmSupported();

  // Load on-demand (idempotent — safe to call repeatedly)
  async function start() {
    await loadLocalLlm(DEFAULT_MODEL);
  }

  // Streaming chat
  let response = $state('');
  async function chat(prompt: string) {
    response = '';
    await generate({
      messages: [{ role: 'user', content: prompt }],
      onToken: (t) => { response += t; },
      temperature: 0.7,
      maxTokens: 1024,
    });
  }
</script>

{#if !supported}
  <p>WebGPU not available — Chrome/Edge 113+ or Safari 18+ required.</p>
{:else if status.current.state === 'downloading'}
  <p>Downloading model: {(status.current.progress * 100).toFixed(0)}%</p>
{:else if status.current.state === 'ready'}
  <button onclick={() => chat('Hello')}>Chat</button>
  <pre>{response}</pre>
{/if}

The full status union is in src/types.ts: idle | checking | downloading | loading | ready | error. The downloading state carries progress: number (0..1) and text: string (human-readable summary including byte counts).

Adding a new model

Append an entry to src/models.ts:

'phi-4-mini': {
  modelId: 'onnx-community/Phi-4-mini-instruct-ONNX',
  displayName: 'Phi-4 Mini',
  dtype: 'q4f16',
  downloadSizeMb: 850,
  ramUsageMb: 2200,
},

The /llm-test picker reads MODELS dynamically so new entries appear without UI changes. Constraints:

  • Model must be ONNX-converted (look on huggingface.co/onnx-community for community converts, or huggingface.co/{org}/{repo}-ONNX for first-party builds).
  • The q4f16 quantization should exist in the repo's onnx/ subdirectory. Other valid dtype values: fp32, fp16, q8, q4, q4f16.
  • Architecture-specific model classes: Gemma 4 needs Gemma4ForConditionalGeneration. For other architectures you may need a different class import in engine.ts. Most LLMs work with AutoModelForCausalLM if they're text-only; multimodal-capable models (like Gemma 4) require their architecture-specific class.

Two-step tokenization gotcha

Inside engine.ts, input prep is two-step:

const promptText = processor.apply_chat_template(messages, {
  add_generation_prompt: true,
  tokenize: false,
});
const inputs = processor.tokenizer(promptText, { return_tensors: 'pt' });

Do not be tempted to collapse this into a single apply_chat_template(messages, { return_dict: true }) call. For multimodal-capable processors (Gemma4Processor in particular), the all-in-one mode does not produce a Tensor-backed { input_ids, attention_mask } pair — it returns a shape that has no .dims on input_ids, and model.generate() then crashes deep inside the forward pass with Cannot read properties of null (reading 'dims'). The two-step pattern is what every transformers.js example for multimodal-capable processors uses.

Note the API name collision: apply_chat_template's tensor option is return_tensor (singular, boolean), while tokenizer()'s tensor option is return_tensors (plural, accepts Python-style 'pt' string). The JS port intentionally diverges from Python here.

model.generate() returns null when streaming

In transformers.js v4, model.generate({ ..., streamer }) returns null instead of a tensor when a TextStreamer is attached — the streamer is the only output channel. The engine always attaches a streamer (even when no caller onToken is provided) so we have one stable text channel that works on every version. Token counts are computed from the tensor return value when present, and fall back to a chars/4 estimate when it isn't.

Deploy notes — base image rebuild required

@mana/local-llm itself is not baked into sveltekit-base:local; it's COPYed fresh by apps/mana/apps/web/Dockerfile on every build. So a change to this package alone does not require a base image rebuild — just ./scripts/mac-mini/build-app.sh mana-web.

However, the CSP additions live in apps/mana/apps/web/src/hooks.server.ts (in the app's own source tree, also COPYed fresh) — those also propagate via a normal mana-web build.

The only time you need a base image rebuild for local-llm-related work is if you change packages/shared-utils/src/security-headers.ts (because shared-utils IS baked into the base). The is_base_image_stale helper in build-app.sh (added 2026-04-08) detects this automatically and triggers a base rebuild before the per-app build.

Browser cache and download size

Models are cached in the standard browser Cache API under https://huggingface.co/{model_id}/resolve/main/... URLs. The package's hasModelInCache(modelId) helper probes for the model's tokenizer.json (always present, downloaded first) as a reliable proxy for "this model has been loaded before". If the user clears site data or the browser evicts the cache under quota pressure, the model has to re-download.

The default Gemma 4 E2B model is ~500 MB on first load. Show the download size in any UI that triggers a model load — users will assume the app is broken if a 500 MB silent download starts.

Browser support

Hard requirements:

  • WebGPU available — navigator.gpu must exist. Chrome/Edge 113+, Safari 18+. Firefox is still gated behind a flag and not supported.
  • Cache API available — present in all modern browsers; no fallback path.
  • Sufficient VRAM — Gemma 4 E2B with q4f16 needs roughly 1.52 GB of WebGPU device memory. On low-end devices the WebGPU adapter request will fail with RequestDeviceFailed or the inference will OOM. The user-facing fallback is "use the server-side LLM proxy via services/mana-llm" — there is no smaller browser-local model in the registry right now.

LocalLLMEngine.isSupported() only checks for WebGPU presence. It does not probe VRAM — there's no reliable WebGPU API for that.