mirror of
https://github.com/Memo-2023/mana-monorepo.git
synced 2026-05-14 21:41:09 +02:00
In commit c9e16243c (the gemma3:4b → gemma4:e4b switch) I sloppily
wrote in the ManaServerBackend docstring that mana-llm "routes them
to the local Ollama instance on the Mac Mini (running on the M4's
Metal GPU)". That is wrong AND it's the exact misconception I had
to debug-out-of earlier the same day.
The actual topology — already documented correctly in
docs/MAC_MINI_SERVER.md and docs/WINDOWS_GPU_SERVER_SETUP.md, I
just didn't read those before writing the docstring:
mana-llm container's OLLAMA_URL points at host.docker.internal:13434
→ ~/gpu-proxy.py (Python TCP forwarder, LaunchAgent on Mac Mini)
→ 192.168.178.11:11434 (LAN)
→ Ollama on the Windows GPU server (RTX 3090, 24 GB VRAM)
→ Inference
The Mac Mini's brew-installed Ollama binary is NOT on the inference
path. It's just a CLI for inspecting the proxied daemon. Today's
"why does the Mac Mini still have Ollama 0.15.4" puzzle has the
answer "because nothing on the Mac Mini actually runs inference, the
binary version was never load-bearing".
Two doc fixes:
1. packages/shared-llm/src/backends/mana-server.ts
Replace the lying docstring with the real topology, including a
pointer to the two MAC_MINI_SERVER.md / WINDOWS_GPU_SERVER_SETUP.md
sections that document it. Also note that gemma4:e4b is a
reasoning model that emits message.reasoning when given enough
tokens (cross-reference to remote.ts's fallback parser).
2. packages/local-llm/CLAUDE.md
Add a paragraph at the top explaining the difference between
"@mana/local-llm" (browser tier, on-device) and the @mana/shared-llm
"mana-server" / "cloud" tiers (services/mana-llm proxy → gpu-proxy.py
→ RTX 3090). This was implicit before — "not related to
services/mana-llm" — but didn't say where mana-server actually
goes. Future me reading the doc would still have to dig through
the docker-compose env to find out.
No code changes — only docstring + markdown.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
191 lines
14 KiB
Markdown
191 lines
14 KiB
Markdown
# `@mana/local-llm` — Browser-Local LLM Inference
|
||
|
||
Client-side LLM inference that runs **entirely in the user's browser** via WebGPU. No server roundtrips, no API keys, no data leaving the device. Used by `/llm-test` (developer tool) and the `playground` module in `apps/mana/apps/web`.
|
||
|
||
**Don't confuse this with the server-side LLM** (`services/mana-llm`). The server-side proxy is what backs the **`mana-server`** and **`cloud`** tiers in `@mana/shared-llm`'s tiered orchestrator — it speaks OpenAI-compatible HTTP and routes to a configured Ollama instance or to Gemini. The Ollama instance is **not** the Mac Mini's local Ollama: traffic goes via `~/gpu-proxy.py` (a Python TCP forwarder running as a LaunchAgent on the Mac Mini host) to the Windows GPU server's Ollama at `192.168.178.11:11434`, where inference runs on the **RTX 3090**. See `docs/MAC_MINI_SERVER.md` and `docs/WINDOWS_GPU_SERVER_SETUP.md` for the full topology. This package (`@mana/local-llm`) is the **only** path that uses the user's own device — `mana-server` and `cloud` both leave the device.
|
||
|
||
## What's currently in the box
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Engine library | [`@huggingface/transformers`](https://huggingface.co/docs/transformers.js/index) v4 (transformers.js) |
|
||
| Backend | WebGPU (mandatory; no WASM/CPU fallback enabled) |
|
||
| Default model | `onnx-community/gemma-4-E2B-it-ONNX` (Google Gemma 4 E2B, q4f16, ~500 MB on disk) |
|
||
| Quantization | 4-bit weights, fp16 activations (`q4f16`) |
|
||
| API surface | `LocalLLMEngine` class + Svelte 5 reactive bindings (`getLocalLlmStatus`, `loadLocalLlm`, `generate`, `extractJson`, `classify`) |
|
||
|
||
The exposed `LocalLLMEngine` API is intentionally engine-agnostic — `load()`, `generate()`, `prompt()`, `extractJson()`, `classify()`, `onStatusChange()`, `isSupported()`. Caller code (the test page, the playground module) **does not know** whether the underlying engine is WebLLM, transformers.js, or anything else. This was deliberate so future engine swaps don't ripple outward.
|
||
|
||
## Why transformers.js and not WebLLM
|
||
|
||
The package was originally built on `@mlc-ai/web-llm` with Qwen 2.5 1.5B. We swapped to transformers.js + Gemma 4 on **2026-04-08** because:
|
||
|
||
1. **MLC compilation lag.** WebLLM only runs models that have been pre-compiled to WGSL shaders by the MLC team. Gemma 4 was released 2026-04-02 and MLC had not (and as of the swap still hadn't) published Gemma 4 builds. Waiting was the alternative — we chose not to.
|
||
2. **transformers.js Gemma 4 is shipped.** HuggingFace's `onnx-community` org published `gemma-4-E2B-it-ONNX` six days after release, with a `Gemma4ForConditionalGeneration` class in transformers.js v4.0.0.
|
||
3. **Future flexibility.** ONNX is a much broader catalog than MLC's pre-compiled list. Switching to transformers.js opens the door to thousands of community-converted models without a per-model wait.
|
||
|
||
The trade-off accepted: transformers.js is a generic ONNX runtime, so per-token throughput is **~20–40% lower** than WebLLM would deliver for the same model size. For a 2B model on a modern WebGPU device that's still well above interactive latency. The JS bundle gains ~2–3 MB (the ONNX runtime), negligible against the 500 MB model download.
|
||
|
||
**If MLC ships Gemma 4 builds in the future** and you want to swap back: rewrite `engine.ts` to use the WebLLM API again (which is OpenAI-compatible, simpler), keep the same `LocalLLMEngine` class shape, and update `models.ts` with the MLC model id. The svelte bindings, the `/llm-test` page, and the playground module need no changes.
|
||
|
||
## CSP requirements (the seven layers)
|
||
|
||
Browser-local inference is unusually CSP-hungry. Every Mana web app that bundles `@mana/local-llm` needs **all** of the following directives or model loading silently breaks at one of seven different points. Discovered the hard way through seven sequential console errors during the Gemma 4 bring-up.
|
||
|
||
### `script-src`
|
||
|
||
```
|
||
'wasm-unsafe-eval' — instantiate WASM modules (ONNX runtime)
|
||
https://cdn.jsdelivr.net — dynamic import() of onnxruntime-web loader .mjs
|
||
blob: — Web Worker spawned via URL.createObjectURL
|
||
```
|
||
|
||
`wasm-unsafe-eval` is in the shared default in `packages/shared-utils/src/security-headers.ts`. The other two are added per-app in `apps/mana/apps/web/src/hooks.server.ts` via the `scriptSrc` option (see "Vite cache gotcha" below for why they live in the per-app hook, not the shared default).
|
||
|
||
### `connect-src`
|
||
|
||
```
|
||
https://huggingface.co — repo metadata
|
||
https://*.huggingface.co — legacy CDN hosts (cdn-lfs-*.huggingface.co etc.)
|
||
https://cdn-lfs.huggingface.co — explicit fallback for older CSP-strict browsers
|
||
https://cdn-lfs-us-1.huggingface.co — same
|
||
https://*.hf.co — new XET-backed CDN host family
|
||
https://cas-bridge.xethub.hf.co — explicit XET CAS bridge
|
||
https://cdn.jsdelivr.net — fetch() of onnxruntime-web .wasm + .mjs
|
||
```
|
||
|
||
HF rotates exact CDN hostnames every few months as they migrate from LFS to XET. The wildcards (`*.huggingface.co`, `*.hf.co`) catch most rotations; the explicit entries are belt-and-suspenders for browsers that prefer narrow matches.
|
||
|
||
### Why jsDelivr is needed twice
|
||
|
||
`@huggingface/transformers` lazy-loads the `onnxruntime-web` WASM-loader shim from jsDelivr at backend selection time. There are **two** separate fetches with different CSP semantics:
|
||
|
||
1. `import('https://cdn.jsdelivr.net/.../ort-wasm-simd-threaded.asyncify.mjs')` — dynamic ESM import → routed through `script-src`.
|
||
2. `fetch('https://cdn.jsdelivr.net/.../ort-wasm-simd-threaded.asyncify.wasm')` — pre-load of the WASM binary → routed through `connect-src`.
|
||
|
||
Allowlisting only one of the two looks like the same identical error message ("no available backend found / Failed to fetch dynamically imported module") because the second failure is masked behind the first. Both directives have to include `cdn.jsdelivr.net`.
|
||
|
||
### Why `blob:` is needed
|
||
|
||
After successfully fetching the loader .mjs from jsDelivr, onnxruntime-web wraps it in `URL.createObjectURL(new Blob([...]))` and instantiates the result as a multi-threaded Web Worker. The `blob:` URL scheme is treated as its own CSP source by browsers — neither `'self'` nor `https://cdn.jsdelivr.net` matches it. Adding `blob:` to `script-src` allowlists workers spawned from blob URLs scoped to our document origin (you cannot `URL.createObjectURL` a Blob from another origin, so this does not loosen remote-script protection).
|
||
|
||
## Vite cache gotcha — keep CSP additions in `hooks.server.ts`
|
||
|
||
When changing CSP additions, **always edit them directly in `apps/mana/apps/web/src/hooks.server.ts`**, not in `packages/shared-utils/src/security-headers.ts`. The shared-utils file holds the *default* CSP that applies to every Mana web app, but adding to it from a workspace package boundary triggers a Vite SSR module-cache pitfall:
|
||
|
||
- `hooks.server.ts` is in the SvelteKit app's own source tree → Vite hot-reloads it on every file change. CSP additions made here take effect immediately.
|
||
- `packages/shared-utils/src/security-headers.ts` is imported as `@mana/shared-utils/security-headers` from a different workspace package → Vite's SSR module cache holds the OLD compiled version even after a source edit. The dev server has to be restarted (or `apps/mana/apps/web/node_modules/.vite` deleted) before the change takes effect.
|
||
|
||
**Diagnostic:** if you add a CSP entry and the next browser console error still shows the old CSP without your addition, you got bitten by this. The fix is to move the addition into `hooks.server.ts` via `setSecurityHeaders(response, { scriptSrc: [...], connectSrc: [...] })`.
|
||
|
||
## API surface (Svelte 5 usage)
|
||
|
||
```svelte
|
||
<script lang="ts">
|
||
import {
|
||
getLocalLlmStatus,
|
||
loadLocalLlm,
|
||
generate,
|
||
extractJson,
|
||
classify,
|
||
isLocalLlmSupported,
|
||
MODELS,
|
||
DEFAULT_MODEL,
|
||
} from '@mana/local-llm';
|
||
|
||
const status = getLocalLlmStatus();
|
||
const supported = isLocalLlmSupported();
|
||
|
||
// Load on-demand (idempotent — safe to call repeatedly)
|
||
async function start() {
|
||
await loadLocalLlm(DEFAULT_MODEL);
|
||
}
|
||
|
||
// Streaming chat
|
||
let response = $state('');
|
||
async function chat(prompt: string) {
|
||
response = '';
|
||
await generate({
|
||
messages: [{ role: 'user', content: prompt }],
|
||
onToken: (t) => { response += t; },
|
||
temperature: 0.7,
|
||
maxTokens: 1024,
|
||
});
|
||
}
|
||
</script>
|
||
|
||
{#if !supported}
|
||
<p>WebGPU not available — Chrome/Edge 113+ or Safari 18+ required.</p>
|
||
{:else if status.current.state === 'downloading'}
|
||
<p>Downloading model: {(status.current.progress * 100).toFixed(0)}%</p>
|
||
{:else if status.current.state === 'ready'}
|
||
<button onclick={() => chat('Hello')}>Chat</button>
|
||
<pre>{response}</pre>
|
||
{/if}
|
||
```
|
||
|
||
The full status union is in `src/types.ts`: `idle | checking | downloading | loading | ready | error`. The downloading state carries `progress: number` (0..1) and `text: string` (human-readable summary including byte counts).
|
||
|
||
## Adding a new model
|
||
|
||
Append an entry to `src/models.ts`:
|
||
|
||
```ts
|
||
'phi-4-mini': {
|
||
modelId: 'onnx-community/Phi-4-mini-instruct-ONNX',
|
||
displayName: 'Phi-4 Mini',
|
||
dtype: 'q4f16',
|
||
downloadSizeMb: 850,
|
||
ramUsageMb: 2200,
|
||
},
|
||
```
|
||
|
||
The `/llm-test` picker reads `MODELS` dynamically so new entries appear without UI changes. Constraints:
|
||
|
||
- Model must be ONNX-converted (look on `huggingface.co/onnx-community` for community converts, or `huggingface.co/{org}/{repo}-ONNX` for first-party builds).
|
||
- The `q4f16` quantization should exist in the repo's `onnx/` subdirectory. Other valid `dtype` values: `fp32`, `fp16`, `q8`, `q4`, `q4f16`.
|
||
- Architecture-specific model classes: Gemma 4 needs `Gemma4ForConditionalGeneration`. For other architectures you may need a different class import in `engine.ts`. Most LLMs work with `AutoModelForCausalLM` if they're text-only; multimodal-capable models (like Gemma 4) require their architecture-specific class.
|
||
|
||
## Two-step tokenization gotcha
|
||
|
||
Inside `engine.ts`, input prep is **two-step**:
|
||
|
||
```ts
|
||
const promptText = processor.apply_chat_template(messages, {
|
||
add_generation_prompt: true,
|
||
tokenize: false,
|
||
});
|
||
const inputs = processor.tokenizer(promptText, { return_tensors: 'pt' });
|
||
```
|
||
|
||
Do **not** be tempted to collapse this into a single `apply_chat_template(messages, { return_dict: true })` call. For multimodal-capable processors (Gemma4Processor in particular), the all-in-one mode does not produce a Tensor-backed `{ input_ids, attention_mask }` pair — it returns a shape that has no `.dims` on `input_ids`, and `model.generate()` then crashes deep inside the forward pass with `Cannot read properties of null (reading 'dims')`. The two-step pattern is what every transformers.js example for multimodal-capable processors uses.
|
||
|
||
Note the API name collision: `apply_chat_template`'s tensor option is `return_tensor` (singular, boolean), while `tokenizer()`'s tensor option is `return_tensors` (plural, accepts Python-style `'pt'` string). The JS port intentionally diverges from Python here.
|
||
|
||
## `model.generate()` returns null when streaming
|
||
|
||
In transformers.js v4, `model.generate({ ..., streamer })` returns `null` instead of a tensor when a `TextStreamer` is attached — the streamer is the only output channel. The engine always attaches a streamer (even when no caller `onToken` is provided) so we have one stable text channel that works on every version. Token counts are computed from the tensor return value when present, and fall back to a `chars/4` estimate when it isn't.
|
||
|
||
## Deploy notes — base image rebuild required
|
||
|
||
`@mana/local-llm` itself is **not** baked into `sveltekit-base:local`; it's COPYed fresh by `apps/mana/apps/web/Dockerfile` on every build. So a change to this package alone does not require a base image rebuild — just `./scripts/mac-mini/build-app.sh mana-web`.
|
||
|
||
**However**, the CSP additions live in `apps/mana/apps/web/src/hooks.server.ts` (in the app's own source tree, also COPYed fresh) — those also propagate via a normal `mana-web` build.
|
||
|
||
The only time you need a base image rebuild for local-llm-related work is if you change `packages/shared-utils/src/security-headers.ts` (because shared-utils IS baked into the base). The `is_base_image_stale` helper in `build-app.sh` (added 2026-04-08) detects this automatically and triggers a base rebuild before the per-app build.
|
||
|
||
## Browser cache and download size
|
||
|
||
Models are cached in the standard browser Cache API under `https://huggingface.co/{model_id}/resolve/main/...` URLs. The package's `hasModelInCache(modelId)` helper probes for the model's `tokenizer.json` (always present, downloaded first) as a reliable proxy for "this model has been loaded before". If the user clears site data or the browser evicts the cache under quota pressure, the model has to re-download.
|
||
|
||
The default Gemma 4 E2B model is **~500 MB on first load**. Show the download size in any UI that triggers a model load — users will assume the app is broken if a 500 MB silent download starts.
|
||
|
||
## Browser support
|
||
|
||
Hard requirements:
|
||
|
||
- WebGPU available — `navigator.gpu` must exist. Chrome/Edge 113+, Safari 18+. Firefox is still gated behind a flag and not supported.
|
||
- Cache API available — present in all modern browsers; no fallback path.
|
||
- Sufficient VRAM — Gemma 4 E2B with q4f16 needs roughly 1.5–2 GB of WebGPU device memory. On low-end devices the WebGPU adapter request will fail with `RequestDeviceFailed` or the inference will OOM. The user-facing fallback is "use the server-side LLM proxy via `services/mana-llm`" — there is no smaller browser-local model in the registry right now.
|
||
|
||
`LocalLLMEngine.isSupported()` only checks for WebGPU presence. It does not probe VRAM — there's no reliable WebGPU API for that.
|