managarten/packages/local-llm/CLAUDE.md
Till JS 7901a5df2f docs(local-llm): document browser-local LLM stack and CSP requirements
Add packages/local-llm/CLAUDE.md as the package-level reference for
browser-local LLM inference. The package went through a non-trivial
engine swap from WebLLM/Qwen to transformers.js/Gemma 4 E2B on
2026-04-08, and the bring-up surfaced enough sharp edges that the
next person (or AI agent) touching this code will save real time
having them written down in one place rather than re-discovering
them error by error.

Captured topics:

- What the package is, what library/model is currently used, and
  the deliberate engine-agnostic API surface that lets future swaps
  stay contained to this package.

- Why we chose transformers.js + Gemma 4 over staying on WebLLM
  (MLC compilation lag for new model architectures) and what the
  return path looks like once MLC ships Gemma 4 builds.

- The seven CSP directives that browser-local inference needs and
  WHY each one is required:
    * script-src: 'wasm-unsafe-eval', cdn.jsdelivr.net, blob:
    * connect-src: huggingface.co + *.huggingface.co + cdn-lfs-*,
                   *.hf.co + cas-bridge.xethub.hf.co (XET CDN),
                   cdn.jsdelivr.net (for the WASM preload fetch)
  Including the subtle "jsDelivr is needed in BOTH script-src and
  connect-src" trap that produces identical-looking error messages
  for two distinct underlying causes.

- The Vite SSR module-cache gotcha: CSP additions made in
  packages/shared-utils/security-headers.ts do NOT hot-reload across
  the workspace package boundary, while additions made directly in
  apps/mana/apps/web/src/hooks.server.ts do. Includes the diagnostic
  pattern (compare which additions show up in the next CSP error
  vs which don't) and the workaround (move them into hooks.server.ts
  via setSecurityHeaders options).

- The two-step tokenization pattern that's mandatory for
  Gemma4Processor: apply_chat_template(tokenize:false) → string, then
  processor.tokenizer(text, return_tensors:'pt'). The collapsed
  apply_chat_template(return_dict:true) path looks shorter but
  produces a malformed input shape and crashes model.generate() deep
  inside the forward pass with "Cannot read properties of null
  (reading 'dims')" — opaque from the call site.

- The transformers.js v4 quirk that model.generate() returns null
  (not a tensor) when a TextStreamer is attached. The streamer is
  the only stable text channel; the engine always attaches one and
  uses the streamer's collected text as the canonical output, with
  a chars/4 fallback for token counts.

- API surface (Svelte 5 example), how to add a new model to the
  registry, deploy notes (no base image rebuild needed for local-llm
  changes alone, but IS needed if shared-utils CSP defaults change),
  browser cache semantics, and hard browser support requirements
  (WebGPU, ~1.5–2 GB VRAM for E2B q4f16, no CPU/WASM fallback).

Also link to the new doc from the root CLAUDE.md Shared Packages
table so people land on it from the standard discovery path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 23:27:50 +02:00

189 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# `@mana/local-llm` — Browser-Local LLM Inference
Client-side LLM inference that runs **entirely in the user's browser** via WebGPU. No server roundtrips, no API keys, no data leaving the device. Used by `/llm-test` (developer tool) and the `playground` module in `apps/mana/apps/web`. Not related to `services/mana-llm` (which is the server-side LLM proxy that talks to Ollama, OpenAI, etc.).
## What's currently in the box
| Field | Value |
|---|---|
| Engine library | [`@huggingface/transformers`](https://huggingface.co/docs/transformers.js/index) v4 (transformers.js) |
| Backend | WebGPU (mandatory; no WASM/CPU fallback enabled) |
| Default model | `onnx-community/gemma-4-E2B-it-ONNX` (Google Gemma 4 E2B, q4f16, ~500 MB on disk) |
| Quantization | 4-bit weights, fp16 activations (`q4f16`) |
| API surface | `LocalLLMEngine` class + Svelte 5 reactive bindings (`getLocalLlmStatus`, `loadLocalLlm`, `generate`, `extractJson`, `classify`) |
The exposed `LocalLLMEngine` API is intentionally engine-agnostic — `load()`, `generate()`, `prompt()`, `extractJson()`, `classify()`, `onStatusChange()`, `isSupported()`. Caller code (the test page, the playground module) **does not know** whether the underlying engine is WebLLM, transformers.js, or anything else. This was deliberate so future engine swaps don't ripple outward.
## Why transformers.js and not WebLLM
The package was originally built on `@mlc-ai/web-llm` with Qwen 2.5 1.5B. We swapped to transformers.js + Gemma 4 on **2026-04-08** because:
1. **MLC compilation lag.** WebLLM only runs models that have been pre-compiled to WGSL shaders by the MLC team. Gemma 4 was released 2026-04-02 and MLC had not (and as of the swap still hadn't) published Gemma 4 builds. Waiting was the alternative — we chose not to.
2. **transformers.js Gemma 4 is shipped.** HuggingFace's `onnx-community` org published `gemma-4-E2B-it-ONNX` six days after release, with a `Gemma4ForConditionalGeneration` class in transformers.js v4.0.0.
3. **Future flexibility.** ONNX is a much broader catalog than MLC's pre-compiled list. Switching to transformers.js opens the door to thousands of community-converted models without a per-model wait.
The trade-off accepted: transformers.js is a generic ONNX runtime, so per-token throughput is **~2040% lower** than WebLLM would deliver for the same model size. For a 2B model on a modern WebGPU device that's still well above interactive latency. The JS bundle gains ~23 MB (the ONNX runtime), negligible against the 500 MB model download.
**If MLC ships Gemma 4 builds in the future** and you want to swap back: rewrite `engine.ts` to use the WebLLM API again (which is OpenAI-compatible, simpler), keep the same `LocalLLMEngine` class shape, and update `models.ts` with the MLC model id. The svelte bindings, the `/llm-test` page, and the playground module need no changes.
## CSP requirements (the seven layers)
Browser-local inference is unusually CSP-hungry. Every Mana web app that bundles `@mana/local-llm` needs **all** of the following directives or model loading silently breaks at one of seven different points. Discovered the hard way through seven sequential console errors during the Gemma 4 bring-up.
### `script-src`
```
'wasm-unsafe-eval' — instantiate WASM modules (ONNX runtime)
https://cdn.jsdelivr.net — dynamic import() of onnxruntime-web loader .mjs
blob: — Web Worker spawned via URL.createObjectURL
```
`wasm-unsafe-eval` is in the shared default in `packages/shared-utils/src/security-headers.ts`. The other two are added per-app in `apps/mana/apps/web/src/hooks.server.ts` via the `scriptSrc` option (see "Vite cache gotcha" below for why they live in the per-app hook, not the shared default).
### `connect-src`
```
https://huggingface.co — repo metadata
https://*.huggingface.co — legacy CDN hosts (cdn-lfs-*.huggingface.co etc.)
https://cdn-lfs.huggingface.co — explicit fallback for older CSP-strict browsers
https://cdn-lfs-us-1.huggingface.co — same
https://*.hf.co — new XET-backed CDN host family
https://cas-bridge.xethub.hf.co — explicit XET CAS bridge
https://cdn.jsdelivr.net — fetch() of onnxruntime-web .wasm + .mjs
```
HF rotates exact CDN hostnames every few months as they migrate from LFS to XET. The wildcards (`*.huggingface.co`, `*.hf.co`) catch most rotations; the explicit entries are belt-and-suspenders for browsers that prefer narrow matches.
### Why jsDelivr is needed twice
`@huggingface/transformers` lazy-loads the `onnxruntime-web` WASM-loader shim from jsDelivr at backend selection time. There are **two** separate fetches with different CSP semantics:
1. `import('https://cdn.jsdelivr.net/.../ort-wasm-simd-threaded.asyncify.mjs')` — dynamic ESM import → routed through `script-src`.
2. `fetch('https://cdn.jsdelivr.net/.../ort-wasm-simd-threaded.asyncify.wasm')` — pre-load of the WASM binary → routed through `connect-src`.
Allowlisting only one of the two looks like the same identical error message ("no available backend found / Failed to fetch dynamically imported module") because the second failure is masked behind the first. Both directives have to include `cdn.jsdelivr.net`.
### Why `blob:` is needed
After successfully fetching the loader .mjs from jsDelivr, onnxruntime-web wraps it in `URL.createObjectURL(new Blob([...]))` and instantiates the result as a multi-threaded Web Worker. The `blob:` URL scheme is treated as its own CSP source by browsers — neither `'self'` nor `https://cdn.jsdelivr.net` matches it. Adding `blob:` to `script-src` allowlists workers spawned from blob URLs scoped to our document origin (you cannot `URL.createObjectURL` a Blob from another origin, so this does not loosen remote-script protection).
## Vite cache gotcha — keep CSP additions in `hooks.server.ts`
When changing CSP additions, **always edit them directly in `apps/mana/apps/web/src/hooks.server.ts`**, not in `packages/shared-utils/src/security-headers.ts`. The shared-utils file holds the *default* CSP that applies to every Mana web app, but adding to it from a workspace package boundary triggers a Vite SSR module-cache pitfall:
- `hooks.server.ts` is in the SvelteKit app's own source tree → Vite hot-reloads it on every file change. CSP additions made here take effect immediately.
- `packages/shared-utils/src/security-headers.ts` is imported as `@mana/shared-utils/security-headers` from a different workspace package → Vite's SSR module cache holds the OLD compiled version even after a source edit. The dev server has to be restarted (or `apps/mana/apps/web/node_modules/.vite` deleted) before the change takes effect.
**Diagnostic:** if you add a CSP entry and the next browser console error still shows the old CSP without your addition, you got bitten by this. The fix is to move the addition into `hooks.server.ts` via `setSecurityHeaders(response, { scriptSrc: [...], connectSrc: [...] })`.
## API surface (Svelte 5 usage)
```svelte
<script lang="ts">
import {
getLocalLlmStatus,
loadLocalLlm,
generate,
extractJson,
classify,
isLocalLlmSupported,
MODELS,
DEFAULT_MODEL,
} from '@mana/local-llm';
const status = getLocalLlmStatus();
const supported = isLocalLlmSupported();
// Load on-demand (idempotent — safe to call repeatedly)
async function start() {
await loadLocalLlm(DEFAULT_MODEL);
}
// Streaming chat
let response = $state('');
async function chat(prompt: string) {
response = '';
await generate({
messages: [{ role: 'user', content: prompt }],
onToken: (t) => { response += t; },
temperature: 0.7,
maxTokens: 1024,
});
}
</script>
{#if !supported}
<p>WebGPU not available — Chrome/Edge 113+ or Safari 18+ required.</p>
{:else if status.current.state === 'downloading'}
<p>Downloading model: {(status.current.progress * 100).toFixed(0)}%</p>
{:else if status.current.state === 'ready'}
<button onclick={() => chat('Hello')}>Chat</button>
<pre>{response}</pre>
{/if}
```
The full status union is in `src/types.ts`: `idle | checking | downloading | loading | ready | error`. The downloading state carries `progress: number` (0..1) and `text: string` (human-readable summary including byte counts).
## Adding a new model
Append an entry to `src/models.ts`:
```ts
'phi-4-mini': {
modelId: 'onnx-community/Phi-4-mini-instruct-ONNX',
displayName: 'Phi-4 Mini',
dtype: 'q4f16',
downloadSizeMb: 850,
ramUsageMb: 2200,
},
```
The `/llm-test` picker reads `MODELS` dynamically so new entries appear without UI changes. Constraints:
- Model must be ONNX-converted (look on `huggingface.co/onnx-community` for community converts, or `huggingface.co/{org}/{repo}-ONNX` for first-party builds).
- The `q4f16` quantization should exist in the repo's `onnx/` subdirectory. Other valid `dtype` values: `fp32`, `fp16`, `q8`, `q4`, `q4f16`.
- Architecture-specific model classes: Gemma 4 needs `Gemma4ForConditionalGeneration`. For other architectures you may need a different class import in `engine.ts`. Most LLMs work with `AutoModelForCausalLM` if they're text-only; multimodal-capable models (like Gemma 4) require their architecture-specific class.
## Two-step tokenization gotcha
Inside `engine.ts`, input prep is **two-step**:
```ts
const promptText = processor.apply_chat_template(messages, {
add_generation_prompt: true,
tokenize: false,
});
const inputs = processor.tokenizer(promptText, { return_tensors: 'pt' });
```
Do **not** be tempted to collapse this into a single `apply_chat_template(messages, { return_dict: true })` call. For multimodal-capable processors (Gemma4Processor in particular), the all-in-one mode does not produce a Tensor-backed `{ input_ids, attention_mask }` pair — it returns a shape that has no `.dims` on `input_ids`, and `model.generate()` then crashes deep inside the forward pass with `Cannot read properties of null (reading 'dims')`. The two-step pattern is what every transformers.js example for multimodal-capable processors uses.
Note the API name collision: `apply_chat_template`'s tensor option is `return_tensor` (singular, boolean), while `tokenizer()`'s tensor option is `return_tensors` (plural, accepts Python-style `'pt'` string). The JS port intentionally diverges from Python here.
## `model.generate()` returns null when streaming
In transformers.js v4, `model.generate({ ..., streamer })` returns `null` instead of a tensor when a `TextStreamer` is attached — the streamer is the only output channel. The engine always attaches a streamer (even when no caller `onToken` is provided) so we have one stable text channel that works on every version. Token counts are computed from the tensor return value when present, and fall back to a `chars/4` estimate when it isn't.
## Deploy notes — base image rebuild required
`@mana/local-llm` itself is **not** baked into `sveltekit-base:local`; it's COPYed fresh by `apps/mana/apps/web/Dockerfile` on every build. So a change to this package alone does not require a base image rebuild — just `./scripts/mac-mini/build-app.sh mana-web`.
**However**, the CSP additions live in `apps/mana/apps/web/src/hooks.server.ts` (in the app's own source tree, also COPYed fresh) — those also propagate via a normal `mana-web` build.
The only time you need a base image rebuild for local-llm-related work is if you change `packages/shared-utils/src/security-headers.ts` (because shared-utils IS baked into the base). The `is_base_image_stale` helper in `build-app.sh` (added 2026-04-08) detects this automatically and triggers a base rebuild before the per-app build.
## Browser cache and download size
Models are cached in the standard browser Cache API under `https://huggingface.co/{model_id}/resolve/main/...` URLs. The package's `hasModelInCache(modelId)` helper probes for the model's `tokenizer.json` (always present, downloaded first) as a reliable proxy for "this model has been loaded before". If the user clears site data or the browser evicts the cache under quota pressure, the model has to re-download.
The default Gemma 4 E2B model is **~500 MB on first load**. Show the download size in any UI that triggers a model load — users will assume the app is broken if a 500 MB silent download starts.
## Browser support
Hard requirements:
- WebGPU available — `navigator.gpu` must exist. Chrome/Edge 113+, Safari 18+. Firefox is still gated behind a flag and not supported.
- Cache API available — present in all modern browsers; no fallback path.
- Sufficient VRAM — Gemma 4 E2B with q4f16 needs roughly 1.52 GB of WebGPU device memory. On low-end devices the WebGPU adapter request will fail with `RequestDeviceFailed` or the inference will OOM. The user-facing fallback is "use the server-side LLM proxy via `services/mana-llm`" — there is no smaller browser-local model in the registry right now.
`LocalLLMEngine.isSupported()` only checks for WebGPU presence. It does not probe VRAM — there's no reliable WebGPU API for that.