From 7901a5df2fe147438f4b2adbc1f37e444d3abc66 Mon Sep 17 00:00:00 2001
From: Till JS <tills95@gmail.com>
Date: Wed, 8 Apr 2026 23:27:50 +0200
Subject: [PATCH] docs(local-llm): document browser-local LLM stack and CSP
 requirements
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Add packages/local-llm/CLAUDE.md as the package-level reference for
browser-local LLM inference. The package went through a non-trivial
engine swap from WebLLM/Qwen to transformers.js/Gemma 4 E2B on
2026-04-08, and the bring-up surfaced enough sharp edges that the
next person (or AI agent) touching this code will save real time
having them written down in one place rather than re-discovering
them error by error.

Captured topics:

- What the package is, what library/model is currently used, and
  the deliberate engine-agnostic API surface that lets future swaps
  stay contained to this package.

- Why we chose transformers.js + Gemma 4 over staying on WebLLM
  (MLC compilation lag for new model architectures) and what the
  return path looks like once MLC ships Gemma 4 builds.

- The seven CSP directives that browser-local inference needs and
  WHY each one is required:
    * script-src: 'wasm-unsafe-eval', cdn.jsdelivr.net, blob:
    * connect-src: huggingface.co + *.huggingface.co + cdn-lfs-*,
                   *.hf.co + cas-bridge.xethub.hf.co (XET CDN),
                   cdn.jsdelivr.net (for the WASM preload fetch)
  Including the subtle "jsDelivr is needed in BOTH script-src and
  connect-src" trap that produces identical-looking error messages
  for two distinct underlying causes.

- The Vite SSR module-cache gotcha: CSP additions made in
  packages/shared-utils/security-headers.ts do NOT hot-reload across
  the workspace package boundary, while additions made directly in
  apps/mana/apps/web/src/hooks.server.ts do. Includes the diagnostic
  pattern (compare which additions show up in the next CSP error
  vs which don't) and the workaround (move them into hooks.server.ts
  via setSecurityHeaders options).

- The two-step tokenization pattern that's mandatory for
  Gemma4Processor: apply_chat_template(tokenize:false) → string, then
  processor.tokenizer(text, return_tensors:'pt'). The collapsed
  apply_chat_template(return_dict:true) path looks shorter but
  produces a malformed input shape and crashes model.generate() deep
  inside the forward pass with "Cannot read properties of null
  (reading 'dims')" — opaque from the call site.

- The transformers.js v4 quirk that model.generate() returns null
  (not a tensor) when a TextStreamer is attached. The streamer is
  the only stable text channel; the engine always attaches one and
  uses the streamer's collected text as the canonical output, with
  a chars/4 fallback for token counts.

- API surface (Svelte 5 example), how to add a new model to the
  registry, deploy notes (no base image rebuild needed for local-llm
  changes alone, but IS needed if shared-utils CSP defaults change),
  browser cache semantics, and hard browser support requirements
  (WebGPU, ~1.5–2 GB VRAM for E2B q4f16, no CPU/WASM fallback).

Also link to the new doc from the root CLAUDE.md Shared Packages
table so people land on it from the standard discovery path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 CLAUDE.md                    |   1 +
 packages/local-llm/CLAUDE.md | 189 +++++++++++++++++++++++++++++++++++
 2 files changed, 190 insertions(+)
 create mode 100644 packages/local-llm/CLAUDE.md

diff --git a/CLAUDE.md b/CLAUDE.md
index 6dd80ea0c..93b208e1a 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -145,6 +145,7 @@ MinIO (Docker, S3-compatible) in both local and prod. Console: http://localhost:
 | `@mana/shared-theme` | Theme config |
 | `@mana/shared-i18n` | i18n |
 | `@mana/local-store` | Local-first store primitives — used by unified Mana, manavoxel, arcade, and shared-uload/-stores/-links |
+| `@mana/local-llm` | Browser-local LLM inference (transformers.js + Gemma 4 E2B, WebGPU). Powers `/llm-test` and the playground module. See [`packages/local-llm/CLAUDE.md`](packages/local-llm/CLAUDE.md) for the CSP requirements and the transformers.js v4 gotchas. |
 
 ## Adding Dependencies
 
diff --git a/packages/local-llm/CLAUDE.md b/packages/local-llm/CLAUDE.md
new file mode 100644
index 000000000..56383a671
--- /dev/null
+++ b/packages/local-llm/CLAUDE.md
@@ -0,0 +1,189 @@
+# `@mana/local-llm` — Browser-Local LLM Inference
+
+Client-side LLM inference that runs **entirely in the user's browser** via WebGPU. No server roundtrips, no API keys, no data leaving the device. Used by `/llm-test` (developer tool) and the `playground` module in `apps/mana/apps/web`. Not related to `services/mana-llm` (which is the server-side LLM proxy that talks to Ollama, OpenAI, etc.).
+
+## What's currently in the box
+
+| Field | Value |
+|---|---|
+| Engine library | [`@huggingface/transformers`](https://huggingface.co/docs/transformers.js/index) v4 (transformers.js) |
+| Backend | WebGPU (mandatory; no WASM/CPU fallback enabled) |
+| Default model | `onnx-community/gemma-4-E2B-it-ONNX` (Google Gemma 4 E2B, q4f16, ~500 MB on disk) |
+| Quantization | 4-bit weights, fp16 activations (`q4f16`) |
+| API surface | `LocalLLMEngine` class + Svelte 5 reactive bindings (`getLocalLlmStatus`, `loadLocalLlm`, `generate`, `extractJson`, `classify`) |
+
+The exposed `LocalLLMEngine` API is intentionally engine-agnostic — `load()`, `generate()`, `prompt()`, `extractJson()`, `classify()`, `onStatusChange()`, `isSupported()`. Caller code (the test page, the playground module) **does not know** whether the underlying engine is WebLLM, transformers.js, or anything else. This was deliberate so future engine swaps don't ripple outward.
+
+## Why transformers.js and not WebLLM
+
+The package was originally built on `@mlc-ai/web-llm` with Qwen 2.5 1.5B. We swapped to transformers.js + Gemma 4 on **2026-04-08** because:
+
+1. **MLC compilation lag.** WebLLM only runs models that have been pre-compiled to WGSL shaders by the MLC team. Gemma 4 was released 2026-04-02 and MLC had not (and as of the swap still hadn't) published Gemma 4 builds. Waiting was the alternative — we chose not to.
+2. **transformers.js Gemma 4 is shipped.** HuggingFace's `onnx-community` org published `gemma-4-E2B-it-ONNX` six days after release, with a `Gemma4ForConditionalGeneration` class in transformers.js v4.0.0.
+3. **Future flexibility.** ONNX is a much broader catalog than MLC's pre-compiled list. Switching to transformers.js opens the door to thousands of community-converted models without a per-model wait.
+
+The trade-off accepted: transformers.js is a generic ONNX runtime, so per-token throughput is **~20–40% lower** than WebLLM would deliver for the same model size. For a 2B model on a modern WebGPU device that's still well above interactive latency. The JS bundle gains ~2–3 MB (the ONNX runtime), negligible against the 500 MB model download.
+
+**If MLC ships Gemma 4 builds in the future** and you want to swap back: rewrite `engine.ts` to use the WebLLM API again (which is OpenAI-compatible, simpler), keep the same `LocalLLMEngine` class shape, and update `models.ts` with the MLC model id. The svelte bindings, the `/llm-test` page, and the playground module need no changes.
+
+## CSP requirements (the seven layers)
+
+Browser-local inference is unusually CSP-hungry. Every Mana web app that bundles `@mana/local-llm` needs **all** of the following directives or model loading silently breaks at one of seven different points. Discovered the hard way through seven sequential console errors during the Gemma 4 bring-up.
+
+### `script-src`
+
+```
+'wasm-unsafe-eval'    — instantiate WASM modules (ONNX runtime)
+https://cdn.jsdelivr.net    — dynamic import() of onnxruntime-web loader .mjs
+blob:                 — Web Worker spawned via URL.createObjectURL
+```
+
+`wasm-unsafe-eval` is in the shared default in `packages/shared-utils/src/security-headers.ts`. The other two are added per-app in `apps/mana/apps/web/src/hooks.server.ts` via the `scriptSrc` option (see "Vite cache gotcha" below for why they live in the per-app hook, not the shared default).
+
+### `connect-src`
+
+```
+https://huggingface.co              — repo metadata
+https://*.huggingface.co            — legacy CDN hosts (cdn-lfs-*.huggingface.co etc.)
+https://cdn-lfs.huggingface.co      — explicit fallback for older CSP-strict browsers
+https://cdn-lfs-us-1.huggingface.co — same
+https://*.hf.co                     — new XET-backed CDN host family
+https://cas-bridge.xethub.hf.co     — explicit XET CAS bridge
+https://cdn.jsdelivr.net            — fetch() of onnxruntime-web .wasm + .mjs
+```
+
+HF rotates exact CDN hostnames every few months as they migrate from LFS to XET. The wildcards (`*.huggingface.co`, `*.hf.co`) catch most rotations; the explicit entries are belt-and-suspenders for browsers that prefer narrow matches.
+
+### Why jsDelivr is needed twice
+
+`@huggingface/transformers` lazy-loads the `onnxruntime-web` WASM-loader shim from jsDelivr at backend selection time. There are **two** separate fetches with different CSP semantics:
+
+1. `import('https://cdn.jsdelivr.net/.../ort-wasm-simd-threaded.asyncify.mjs')` — dynamic ESM import → routed through `script-src`.
+2. `fetch('https://cdn.jsdelivr.net/.../ort-wasm-simd-threaded.asyncify.wasm')` — pre-load of the WASM binary → routed through `connect-src`.
+
+Allowlisting only one of the two looks like the same identical error message ("no available backend found / Failed to fetch dynamically imported module") because the second failure is masked behind the first. Both directives have to include `cdn.jsdelivr.net`.
+
+### Why `blob:` is needed
+
+After successfully fetching the loader .mjs from jsDelivr, onnxruntime-web wraps it in `URL.createObjectURL(new Blob([...]))` and instantiates the result as a multi-threaded Web Worker. The `blob:` URL scheme is treated as its own CSP source by browsers — neither `'self'` nor `https://cdn.jsdelivr.net` matches it. Adding `blob:` to `script-src` allowlists workers spawned from blob URLs scoped to our document origin (you cannot `URL.createObjectURL` a Blob from another origin, so this does not loosen remote-script protection).
+
+## Vite cache gotcha — keep CSP additions in `hooks.server.ts`
+
+When changing CSP additions, **always edit them directly in `apps/mana/apps/web/src/hooks.server.ts`**, not in `packages/shared-utils/src/security-headers.ts`. The shared-utils file holds the *default* CSP that applies to every Mana web app, but adding to it from a workspace package boundary triggers a Vite SSR module-cache pitfall:
+
+- `hooks.server.ts` is in the SvelteKit app's own source tree → Vite hot-reloads it on every file change. CSP additions made here take effect immediately.
+- `packages/shared-utils/src/security-headers.ts` is imported as `@mana/shared-utils/security-headers` from a different workspace package → Vite's SSR module cache holds the OLD compiled version even after a source edit. The dev server has to be restarted (or `apps/mana/apps/web/node_modules/.vite` deleted) before the change takes effect.
+
+**Diagnostic:** if you add a CSP entry and the next browser console error still shows the old CSP without your addition, you got bitten by this. The fix is to move the addition into `hooks.server.ts` via `setSecurityHeaders(response, { scriptSrc: [...], connectSrc: [...] })`.
+
+## API surface (Svelte 5 usage)
+
+```svelte
+<script lang="ts">
+  import {
+    getLocalLlmStatus,
+    loadLocalLlm,
+    generate,
+    extractJson,
+    classify,
+    isLocalLlmSupported,
+    MODELS,
+    DEFAULT_MODEL,
+  } from '@mana/local-llm';
+
+  const status = getLocalLlmStatus();
+  const supported = isLocalLlmSupported();
+
+  // Load on-demand (idempotent — safe to call repeatedly)
+  async function start() {
+    await loadLocalLlm(DEFAULT_MODEL);
+  }
+
+  // Streaming chat
+  let response = $state('');
+  async function chat(prompt: string) {
+    response = '';
+    await generate({
+      messages: [{ role: 'user', content: prompt }],
+      onToken: (t) => { response += t; },
+      temperature: 0.7,
+      maxTokens: 1024,
+    });
+  }
+</script>
+
+{#if !supported}
+  <p>WebGPU not available — Chrome/Edge 113+ or Safari 18+ required.</p>
+{:else if status.current.state === 'downloading'}
+  <p>Downloading model: {(status.current.progress * 100).toFixed(0)}%</p>
+{:else if status.current.state === 'ready'}
+  <button onclick={() => chat('Hello')}>Chat</button>
+  <pre>{response}</pre>
+{/if}
+```
+
+The full status union is in `src/types.ts`: `idle | checking | downloading | loading | ready | error`. The downloading state carries `progress: number` (0..1) and `text: string` (human-readable summary including byte counts).
+
+## Adding a new model
+
+Append an entry to `src/models.ts`:
+
+```ts
+'phi-4-mini': {
+  modelId: 'onnx-community/Phi-4-mini-instruct-ONNX',
+  displayName: 'Phi-4 Mini',
+  dtype: 'q4f16',
+  downloadSizeMb: 850,
+  ramUsageMb: 2200,
+},
+```
+
+The `/llm-test` picker reads `MODELS` dynamically so new entries appear without UI changes. Constraints:
+
+- Model must be ONNX-converted (look on `huggingface.co/onnx-community` for community converts, or `huggingface.co/{org}/{repo}-ONNX` for first-party builds).
+- The `q4f16` quantization should exist in the repo's `onnx/` subdirectory. Other valid `dtype` values: `fp32`, `fp16`, `q8`, `q4`, `q4f16`.
+- Architecture-specific model classes: Gemma 4 needs `Gemma4ForConditionalGeneration`. For other architectures you may need a different class import in `engine.ts`. Most LLMs work with `AutoModelForCausalLM` if they're text-only; multimodal-capable models (like Gemma 4) require their architecture-specific class.
+
+## Two-step tokenization gotcha
+
+Inside `engine.ts`, input prep is **two-step**:
+
+```ts
+const promptText = processor.apply_chat_template(messages, {
+  add_generation_prompt: true,
+  tokenize: false,
+});
+const inputs = processor.tokenizer(promptText, { return_tensors: 'pt' });
+```
+
+Do **not** be tempted to collapse this into a single `apply_chat_template(messages, { return_dict: true })` call. For multimodal-capable processors (Gemma4Processor in particular), the all-in-one mode does not produce a Tensor-backed `{ input_ids, attention_mask }` pair — it returns a shape that has no `.dims` on `input_ids`, and `model.generate()` then crashes deep inside the forward pass with `Cannot read properties of null (reading 'dims')`. The two-step pattern is what every transformers.js example for multimodal-capable processors uses.
+
+Note the API name collision: `apply_chat_template`'s tensor option is `return_tensor` (singular, boolean), while `tokenizer()`'s tensor option is `return_tensors` (plural, accepts Python-style `'pt'` string). The JS port intentionally diverges from Python here.
+
+## `model.generate()` returns null when streaming
+
+In transformers.js v4, `model.generate({ ..., streamer })` returns `null` instead of a tensor when a `TextStreamer` is attached — the streamer is the only output channel. The engine always attaches a streamer (even when no caller `onToken` is provided) so we have one stable text channel that works on every version. Token counts are computed from the tensor return value when present, and fall back to a `chars/4` estimate when it isn't.
+
+## Deploy notes — base image rebuild required
+
+`@mana/local-llm` itself is **not** baked into `sveltekit-base:local`; it's COPYed fresh by `apps/mana/apps/web/Dockerfile` on every build. So a change to this package alone does not require a base image rebuild — just `./scripts/mac-mini/build-app.sh mana-web`.
+
+**However**, the CSP additions live in `apps/mana/apps/web/src/hooks.server.ts` (in the app's own source tree, also COPYed fresh) — those also propagate via a normal `mana-web` build.
+
+The only time you need a base image rebuild for local-llm-related work is if you change `packages/shared-utils/src/security-headers.ts` (because shared-utils IS baked into the base). The `is_base_image_stale` helper in `build-app.sh` (added 2026-04-08) detects this automatically and triggers a base rebuild before the per-app build.
+
+## Browser cache and download size
+
+Models are cached in the standard browser Cache API under `https://huggingface.co/{model_id}/resolve/main/...` URLs. The package's `hasModelInCache(modelId)` helper probes for the model's `tokenizer.json` (always present, downloaded first) as a reliable proxy for "this model has been loaded before". If the user clears site data or the browser evicts the cache under quota pressure, the model has to re-download.
+
+The default Gemma 4 E2B model is **~500 MB on first load**. Show the download size in any UI that triggers a model load — users will assume the app is broken if a 500 MB silent download starts.
+
+## Browser support
+
+Hard requirements:
+
+- WebGPU available — `navigator.gpu` must exist. Chrome/Edge 113+, Safari 18+. Firefox is still gated behind a flag and not supported.
+- Cache API available — present in all modern browsers; no fallback path.
+- Sufficient VRAM — Gemma 4 E2B with q4f16 needs roughly 1.5–2 GB of WebGPU device memory. On low-end devices the WebGPU adapter request will fail with `RequestDeviceFailed` or the inference will OOM. The user-facing fallback is "use the server-side LLM proxy via `services/mana-llm`" — there is no smaller browser-local model in the registry right now.
+
+`LocalLLMEngine.isSupported()` only checks for WebGPU presence. It does not probe VRAM — there's no reliable WebGPU API for that.