managarten/packages/local-llm/src/worker.ts
Till JS 45f368f471 feat(local-llm): Phase 3 — move inference into a Web Worker
The browser tier of @mana/local-llm was running entirely in the main
JS thread. With Gemma 4 E2B that meant ~50-200 ms of synchronous
tensor work per forward pass × ~150 forward passes per generation =
the UI froze for 10-30 seconds during a single chat reply. Scrolling,
clicks, animations all stopped.

Move the actual inference into a Dedicated Web Worker. The main
thread keeps a thin LocalLLMEngine proxy with the same public API
(load / unload / generate / prompt / extractJson / classify /
onStatusChange / isSupported), so existing callers — the /llm-test
page, the playground module, @mana/shared-llm's BrowserBackend, the
Svelte 5 reactive bindings — need NO changes.

File layout after the split:

  src/engine.ts       — main-thread proxy (lazy worker init,
                        postMessage protocol, pending request map,
                        status broadcast handling, convenience
                        wrappers for prompt/extractJson/classify)
  src/worker.ts       — Web Worker entry point (typed message
                        protocol, single LocalLLMEngineImpl instance,
                        forwards status changes back to main thread)
  src/engine-impl.ts  — the actual transformers.js engine (renamed
                        from the previous engine.ts contents). NOT
                        exported from index.ts — only the worker
                        imports it. Same two-step tokenization,
                        aggregated progress reporting, streaming
                        token handling as before; just running in
                        a different thread now.

Worker construction uses Vite's documented `new Worker(new URL(
'./worker.ts', import.meta.url), { type: 'module' })` pattern, which
makes Vite split worker.ts (and its transformers.js dep) into its
own bundle chunk at build time. The proxy is lazy-init: the Worker
constructor is never touched at module-import time, so SSR stays
clean (Worker doesn't exist on Node).

Message protocol (typed end-to-end):

  Main → Worker:
    { id, type: 'load',     modelKey: ModelKey }
    { id, type: 'unload' }
    { id, type: 'generate', opts: SerializableGenerateOptions }
    { id, type: 'isReady' }

  Worker → Main:
    { id, type: 'result',  data?: unknown }
    { id, type: 'error',   message: string }
    { id, type: 'token',   token: string }       — streaming chunk
    {     type: 'status',  status: LoadingStatus }  — broadcast

The proxy assigns a unique id per request, stores the resolve/reject
+ optional onToken callback in a Map<id, PendingRequest>, and routes
incoming responses by id. Status messages have no id and fire every
registered status listener — same UX as before, just one extra hop.

Streaming: the worker re-attaches the streaming callback on its
side. Each emitted token gets posted back as `{ id, type: 'token',
token }` and the proxy invokes the original `onToken` callback. The
final `result` arrives as a normal response and resolves the
Promise. From the caller's perspective generate() still feels
identical — same async iterable feel via onToken, same return value.

Worker termination on unload: transformers.js doesn't expose a
dispose API, so we terminate the worker after unload and create a
fresh one on the next load. This is the only reliable way to
release VRAM between model swaps.

CSP: no header changes needed. The worker is loaded from a
same-origin URL (Vite emits it as
/_app/immutable/workers/worker.[hash].js), so 'self' in script-src
already covers it. The blob: + cdn.jsdelivr.net + wasm-unsafe-eval
allowlists we added during the original WebLLM/transformers.js
bring-up still apply because the worker still runs the same ONNX
runtime that needed them.

DistributiveOmit type helper: TS's plain `Omit<Union, K>` collapses
discriminated unions to an intersection in some configurations,
which broke the type narrowing at the postRequest call sites for
each request variant. Adding a tiny `DistributiveOmit<T, K>` helper
fixes the type-check without restructuring the protocol.

What this commit deliberately does NOT do:

- Change the public API surface. The whole point is that callers
  remain untouched.
- Add multi-tab worker coordination via SharedWorker or
  BroadcastChannel. Each tab still spawns its own dedicated worker
  with its own copy of the model in VRAM. Multi-tab dedup is
  Phase 2.5/Phase 4 work — see the design doc summary in the
  previous Phase 1 commit message.
- Add a persistent task queue. Fire-and-forget background tasks
  are Phase 4.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 01:27:10 +02:00

110 lines
3.8 KiB
TypeScript

/**
* Web Worker entry point for @mana/local-llm.
*
* Runs in a Dedicated Worker context, owns a single LocalLLMEngineImpl
* instance, and exchanges messages with the main thread proxy
* (engine.ts) over postMessage. The protocol is small and typed:
*
* Main → Worker (WorkerRequest):
* { id, type: 'load', modelKey: ModelKey }
* { id, type: 'unload' }
* { id, type: 'generate', opts: SerializableGenerateOptions }
* { id, type: 'isReady' } — synchronous probe; resolves with bool
*
* Worker → Main (WorkerResponse):
* { id, type: 'result', data?: unknown } — request fulfilled
* { id, type: 'error', message: string } — request rejected
* { id, type: 'token', token: string } — streaming token chunk
* { type: 'status', status: LoadingStatus } — broadcast, no id
*
* Each request has a unique id chosen by the proxy. The worker echoes
* the id on its result/error/token responses so the proxy can route
* them back to the right pending Promise + onToken callback. Status
* messages are broadcast (no id) and trigger every registered status
* listener on the proxy.
*
* Note: this file does NOT import @mana/local-llm's index — it imports
* engine-impl directly. The package's public surface is the proxy in
* engine.ts; this file is the worker side of that proxy and lives in
* its own bundle chunk.
*/
import { LocalLLMEngineImpl } from './engine-impl';
import type { GenerateOptions, LoadingStatus } from './types';
import type { ModelKey } from './models';
// ─── Protocol types (mirrored in engine.ts) ────────────────────
export type SerializableGenerateOptions = Omit<GenerateOptions, 'onToken'>;
export type WorkerRequest =
| { id: string; type: 'load'; modelKey: ModelKey }
| { id: string; type: 'unload' }
| { id: string; type: 'generate'; opts: SerializableGenerateOptions }
| { id: string; type: 'isReady' };
export type WorkerResponse =
| { id: string; type: 'result'; data?: unknown }
| { id: string; type: 'error'; message: string }
| { id: string; type: 'token'; token: string }
| { type: 'status'; status: LoadingStatus };
// ─── Worker setup ──────────────────────────────────────────────
const engine = new LocalLLMEngineImpl();
// Forward all status changes to the main thread as broadcast messages.
engine.onStatusChange((status) => {
postMessage({ type: 'status', status } satisfies WorkerResponse);
});
self.addEventListener('message', async (event: MessageEvent<WorkerRequest>) => {
const req = event.data;
try {
switch (req.type) {
case 'load': {
await engine.load(req.modelKey);
postMessage({ id: req.id, type: 'result' } satisfies WorkerResponse);
break;
}
case 'unload': {
await engine.unload();
postMessage({ id: req.id, type: 'result' } satisfies WorkerResponse);
break;
}
case 'isReady': {
postMessage({
id: req.id,
type: 'result',
data: engine.isReady,
} satisfies WorkerResponse);
break;
}
case 'generate': {
// Re-attach the streaming callback on the worker side. Each
// emitted token gets posted back to the main thread tagged
// with the originating request id, so the proxy can route
// it to the right onToken callback.
const result = await engine.generate({
...req.opts,
onToken: (token) => {
postMessage({ id: req.id, type: 'token', token } satisfies WorkerResponse);
},
});
postMessage({
id: req.id,
type: 'result',
data: result,
} satisfies WorkerResponse);
break;
}
}
} catch (err) {
const message = err instanceof Error ? err.message : String(err);
postMessage({ id: req.id, type: 'error', message } satisfies WorkerResponse);
}
});