managarten

mirror of https://github.com/Memo-2023/mana-monorepo.git synced 2026-05-14 20:21:09 +02:00

History

Till JS 45f368f471 feat(local-llm): Phase 3 — move inference into a Web Worker The browser tier of @mana/local-llm was running entirely in the main JS thread. With Gemma 4 E2B that meant ~50-200 ms of synchronous tensor work per forward pass × ~150 forward passes per generation = the UI froze for 10-30 seconds during a single chat reply. Scrolling, clicks, animations all stopped. Move the actual inference into a Dedicated Web Worker. The main thread keeps a thin LocalLLMEngine proxy with the same public API (load / unload / generate / prompt / extractJson / classify / onStatusChange / isSupported), so existing callers — the /llm-test page, the playground module, @mana/shared-llm's BrowserBackend, the Svelte 5 reactive bindings — need NO changes. File layout after the split: src/engine.ts — main-thread proxy (lazy worker init, postMessage protocol, pending request map, status broadcast handling, convenience wrappers for prompt/extractJson/classify) src/worker.ts — Web Worker entry point (typed message protocol, single LocalLLMEngineImpl instance, forwards status changes back to main thread) src/engine-impl.ts — the actual transformers.js engine (renamed from the previous engine.ts contents). NOT exported from index.ts — only the worker imports it. Same two-step tokenization, aggregated progress reporting, streaming token handling as before; just running in a different thread now. Worker construction uses Vite's documented `new Worker(new URL( './worker.ts', import.meta.url), { type: 'module' })` pattern, which makes Vite split worker.ts (and its transformers.js dep) into its own bundle chunk at build time. The proxy is lazy-init: the Worker constructor is never touched at module-import time, so SSR stays clean (Worker doesn't exist on Node). Message protocol (typed end-to-end): Main → Worker: { id, type: 'load', modelKey: ModelKey } { id, type: 'unload' } { id, type: 'generate', opts: SerializableGenerateOptions } { id, type: 'isReady' } Worker → Main: { id, type: 'result', data?: unknown } { id, type: 'error', message: string } { id, type: 'token', token: string } — streaming chunk { type: 'status', status: LoadingStatus } — broadcast The proxy assigns a unique id per request, stores the resolve/reject + optional onToken callback in a Map<id, PendingRequest>, and routes incoming responses by id. Status messages have no id and fire every registered status listener — same UX as before, just one extra hop. Streaming: the worker re-attaches the streaming callback on its side. Each emitted token gets posted back as `{ id, type: 'token', token }` and the proxy invokes the original `onToken` callback. The final `result` arrives as a normal response and resolves the Promise. From the caller's perspective generate() still feels identical — same async iterable feel via onToken, same return value. Worker termination on unload: transformers.js doesn't expose a dispose API, so we terminate the worker after unload and create a fresh one on the next load. This is the only reliable way to release VRAM between model swaps. CSP: no header changes needed. The worker is loaded from a same-origin URL (Vite emits it as /_app/immutable/workers/worker.[hash].js), so 'self' in script-src already covers it. The blob: + cdn.jsdelivr.net + wasm-unsafe-eval allowlists we added during the original WebLLM/transformers.js bring-up still apply because the worker still runs the same ONNX runtime that needed them. DistributiveOmit type helper: TS's plain `Omit<Union, K>` collapses discriminated unions to an intersection in some configurations, which broke the type narrowing at the postRequest call sites for each request variant. Adding a tiny `DistributiveOmit<T, K>` helper fixes the type-check without restructuring the protocol. What this commit deliberately does NOT do: - Change the public API surface. The whole point is that callers remain untouched. - Add multi-tab worker coordination via SharedWorker or BroadcastChannel. Each tab still spawns its own dedicated worker with its own copy of the model in VRAM. Multi-tab dedup is Phase 2.5/Phase 4 work — see the design doc summary in the previous Phase 1 commit message. - Add a persistent task queue. Fire-and-forget background tasks are Phase 4. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>		2026-04-09 01:27:10 +02:00
..
src	feat(local-llm): Phase 3 — move inference into a Web Worker	2026-04-09 01:27:10 +02:00
CLAUDE.md	docs(local-llm): document browser-local LLM stack and CSP requirements	2026-04-08 23:27:50 +02:00
package.json	feat(local-llm): swap WebLLM/Qwen for transformers.js + Gemma 4 E2B	2026-04-08 22:22:32 +02:00
tsconfig.json	feat(local-llm): add client-side LLM inference package with WebLLM	2026-04-02 01:53:54 +02:00