mirror of
https://github.com/Memo-2023/mana-monorepo.git
synced 2026-05-16 11:59:39 +02:00
The browser tier of @mana/local-llm was running entirely in the main
JS thread. With Gemma 4 E2B that meant ~50-200 ms of synchronous
tensor work per forward pass × ~150 forward passes per generation =
the UI froze for 10-30 seconds during a single chat reply. Scrolling,
clicks, animations all stopped.
Move the actual inference into a Dedicated Web Worker. The main
thread keeps a thin LocalLLMEngine proxy with the same public API
(load / unload / generate / prompt / extractJson / classify /
onStatusChange / isSupported), so existing callers — the /llm-test
page, the playground module, @mana/shared-llm's BrowserBackend, the
Svelte 5 reactive bindings — need NO changes.
File layout after the split:
src/engine.ts — main-thread proxy (lazy worker init,
postMessage protocol, pending request map,
status broadcast handling, convenience
wrappers for prompt/extractJson/classify)
src/worker.ts — Web Worker entry point (typed message
protocol, single LocalLLMEngineImpl instance,
forwards status changes back to main thread)
src/engine-impl.ts — the actual transformers.js engine (renamed
from the previous engine.ts contents). NOT
exported from index.ts — only the worker
imports it. Same two-step tokenization,
aggregated progress reporting, streaming
token handling as before; just running in
a different thread now.
Worker construction uses Vite's documented `new Worker(new URL(
'./worker.ts', import.meta.url), { type: 'module' })` pattern, which
makes Vite split worker.ts (and its transformers.js dep) into its
own bundle chunk at build time. The proxy is lazy-init: the Worker
constructor is never touched at module-import time, so SSR stays
clean (Worker doesn't exist on Node).
Message protocol (typed end-to-end):
Main → Worker:
{ id, type: 'load', modelKey: ModelKey }
{ id, type: 'unload' }
{ id, type: 'generate', opts: SerializableGenerateOptions }
{ id, type: 'isReady' }
Worker → Main:
{ id, type: 'result', data?: unknown }
{ id, type: 'error', message: string }
{ id, type: 'token', token: string } — streaming chunk
{ type: 'status', status: LoadingStatus } — broadcast
The proxy assigns a unique id per request, stores the resolve/reject
+ optional onToken callback in a Map<id, PendingRequest>, and routes
incoming responses by id. Status messages have no id and fire every
registered status listener — same UX as before, just one extra hop.
Streaming: the worker re-attaches the streaming callback on its
side. Each emitted token gets posted back as `{ id, type: 'token',
token }` and the proxy invokes the original `onToken` callback. The
final `result` arrives as a normal response and resolves the
Promise. From the caller's perspective generate() still feels
identical — same async iterable feel via onToken, same return value.
Worker termination on unload: transformers.js doesn't expose a
dispose API, so we terminate the worker after unload and create a
fresh one on the next load. This is the only reliable way to
release VRAM between model swaps.
CSP: no header changes needed. The worker is loaded from a
same-origin URL (Vite emits it as
/_app/immutable/workers/worker.[hash].js), so 'self' in script-src
already covers it. The blob: + cdn.jsdelivr.net + wasm-unsafe-eval
allowlists we added during the original WebLLM/transformers.js
bring-up still apply because the worker still runs the same ONNX
runtime that needed them.
DistributiveOmit type helper: TS's plain `Omit<Union, K>` collapses
discriminated unions to an intersection in some configurations,
which broke the type narrowing at the postRequest call sites for
each request variant. Adding a tiny `DistributiveOmit<T, K>` helper
fixes the type-check without restructuring the protocol.
What this commit deliberately does NOT do:
- Change the public API surface. The whole point is that callers
remain untouched.
- Add multi-tab worker coordination via SharedWorker or
BroadcastChannel. Each tab still spawns its own dedicated worker
with its own copy of the model in VRAM. Multi-tab dedup is
Phase 2.5/Phase 4 work — see the design doc summary in the
previous Phase 1 commit message.
- Add a persistent task queue. Fire-and-forget background tasks
are Phase 4.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
258 lines
8.4 KiB
TypeScript
258 lines
8.4 KiB
TypeScript
/**
|
|
* LocalLLMEngine — main-thread proxy for the worker-hosted engine.
|
|
*
|
|
* Public API is intentionally identical to the previous in-thread
|
|
* implementation so existing callers (the /llm-test page, the
|
|
* playground module, @mana/shared-llm's BrowserBackend, the Svelte 5
|
|
* reactive bindings in svelte.svelte.ts) need no changes. Internally
|
|
* every call now goes through a Web Worker — see worker.ts for the
|
|
* other side of the protocol.
|
|
*
|
|
* Why a worker: a 2B-parameter LLM does heavy synchronous tensor work
|
|
* for ~50-200 ms per forward pass. With ~150 forward passes per
|
|
* generation, the main thread would freeze for ~10-30 seconds during
|
|
* a single chat reply. Web Workers run on their own thread, so the
|
|
* main UI stays responsive throughout.
|
|
*
|
|
* The proxy is constructed lazily — the Worker is only instantiated
|
|
* on first method call. This matters for SSR: importing this module
|
|
* during a server render must NOT touch the Worker constructor (which
|
|
* doesn't exist in Node), and lazy construction is the cleanest way
|
|
* to keep import-time side effects to zero.
|
|
*/
|
|
|
|
import { MODELS, DEFAULT_MODEL, type ModelKey } from './models';
|
|
import type {
|
|
ChatMessage,
|
|
GenerateOptions,
|
|
GenerateResult,
|
|
LoadingStatus,
|
|
ModelConfig,
|
|
} from './types';
|
|
import type { SerializableGenerateOptions, WorkerRequest, WorkerResponse } from './worker';
|
|
|
|
/** Tracking entry for an in-flight worker request. */
|
|
interface PendingRequest {
|
|
resolve: (data: unknown) => void;
|
|
reject: (err: Error) => void;
|
|
onToken?: (token: string) => void;
|
|
}
|
|
|
|
/**
|
|
* Distributive Omit — preserves the discriminated union when stripping
|
|
* a key. Plain `Omit<Union, K>` collapses to an intersection in many
|
|
* TS versions and loses the type narrowing on `req.type`. This helper
|
|
* distributes the Omit across each member of the union so postRequest
|
|
* still type-checks at the call sites.
|
|
*/
|
|
type DistributiveOmit<T, K extends keyof T> = T extends unknown ? Omit<T, K> : never;
|
|
type WorkerRequestPayload = DistributiveOmit<WorkerRequest, 'id'>;
|
|
|
|
export class LocalLLMEngine {
|
|
private worker: Worker | null = null;
|
|
private pending = new Map<string, PendingRequest>();
|
|
private nextId = 0;
|
|
private currentModel: ModelKey | null = null;
|
|
private _status: LoadingStatus = { state: 'idle' };
|
|
private statusListeners: Set<(status: LoadingStatus) => void> = new Set();
|
|
|
|
get status(): LoadingStatus {
|
|
return this._status;
|
|
}
|
|
|
|
get isReady(): boolean {
|
|
return this._status.state === 'ready';
|
|
}
|
|
|
|
get modelConfig(): ModelConfig | null {
|
|
return this.currentModel ? MODELS[this.currentModel] : null;
|
|
}
|
|
|
|
onStatusChange(listener: (status: LoadingStatus) => void): () => void {
|
|
this.statusListeners.add(listener);
|
|
return () => this.statusListeners.delete(listener);
|
|
}
|
|
|
|
private setStatus(status: LoadingStatus) {
|
|
this._status = status;
|
|
for (const listener of this.statusListeners) {
|
|
listener(status);
|
|
}
|
|
}
|
|
|
|
/** Check if WebGPU is available. Synchronous and SSR-safe. */
|
|
static isSupported(): boolean {
|
|
return typeof navigator !== 'undefined' && 'gpu' in navigator;
|
|
}
|
|
|
|
// ─── Worker management ──────────────────────────────────
|
|
|
|
private getWorker(): Worker {
|
|
if (this.worker) return this.worker;
|
|
|
|
if (typeof Worker === 'undefined') {
|
|
throw new Error('@mana/local-llm requires a browser environment (Worker is not defined)');
|
|
}
|
|
|
|
// `new URL('./worker.ts', import.meta.url)` is Vite's documented
|
|
// pattern for declaring a Web Worker entry. Vite picks this up at
|
|
// build time, splits worker.ts (and its transformers.js dep) into
|
|
// its own chunk, and rewrites the URL to the chunk's hashed path.
|
|
// Outside Vite (raw esbuild, plain Node, etc.) this would fail —
|
|
// but the only consumer of this package is the SvelteKit web app
|
|
// where Vite handles the bundling.
|
|
this.worker = new Worker(new URL('./worker.ts', import.meta.url), {
|
|
type: 'module',
|
|
name: 'mana-local-llm',
|
|
});
|
|
|
|
this.worker.addEventListener('message', this.handleWorkerMessage);
|
|
this.worker.addEventListener('error', (e) => {
|
|
// Worker-level fatal error — reject all pending requests.
|
|
const message = e.message || 'Worker crashed';
|
|
for (const [id, p] of this.pending) {
|
|
p.reject(new Error(`Worker error: ${message}`));
|
|
this.pending.delete(id);
|
|
}
|
|
this.setStatus({ state: 'error', error: message });
|
|
});
|
|
|
|
return this.worker;
|
|
}
|
|
|
|
private handleWorkerMessage = (event: MessageEvent<WorkerResponse>) => {
|
|
const msg = event.data;
|
|
|
|
// Status broadcasts have no id and target every listener.
|
|
if (msg.type === 'status') {
|
|
this.setStatus(msg.status);
|
|
return;
|
|
}
|
|
|
|
// Streaming token: route to the matching request's onToken
|
|
// callback if one was registered.
|
|
if (msg.type === 'token') {
|
|
const pending = this.pending.get(msg.id);
|
|
pending?.onToken?.(msg.token);
|
|
return;
|
|
}
|
|
|
|
// Result/error: resolve or reject the matching pending Promise.
|
|
const pending = this.pending.get(msg.id);
|
|
if (!pending) return;
|
|
this.pending.delete(msg.id);
|
|
|
|
if (msg.type === 'result') {
|
|
pending.resolve(msg.data);
|
|
} else {
|
|
pending.reject(new Error(msg.message));
|
|
}
|
|
};
|
|
|
|
private postRequest<T>(req: WorkerRequestPayload, onToken?: (token: string) => void): Promise<T> {
|
|
const id = `${++this.nextId}`;
|
|
const worker = this.getWorker();
|
|
|
|
return new Promise<T>((resolve, reject) => {
|
|
this.pending.set(id, {
|
|
resolve: (data) => resolve(data as T),
|
|
reject,
|
|
onToken,
|
|
});
|
|
worker.postMessage({ ...req, id } as WorkerRequest);
|
|
});
|
|
}
|
|
|
|
// ─── Public API ──────────────────────────────────────────
|
|
|
|
async load(model: ModelKey = DEFAULT_MODEL): Promise<void> {
|
|
if (this.currentModel === model && this.isReady) return;
|
|
this.currentModel = model;
|
|
await this.postRequest<void>({ type: 'load', modelKey: model });
|
|
}
|
|
|
|
async unload(): Promise<void> {
|
|
if (!this.worker) return; // never loaded, nothing to do
|
|
await this.postRequest<void>({ type: 'unload' });
|
|
this.currentModel = null;
|
|
// Tear down the worker so a future load() starts a fresh one
|
|
// with cleared GPU buffers. transformers.js doesn't expose an
|
|
// explicit dispose, so terminating the worker is the only way
|
|
// to reliably reclaim VRAM.
|
|
this.worker.terminate();
|
|
this.worker = null;
|
|
this.pending.clear();
|
|
}
|
|
|
|
async generate(options: GenerateOptions): Promise<GenerateResult> {
|
|
const { onToken, ...rest } = options;
|
|
const opts: SerializableGenerateOptions = rest;
|
|
return this.postRequest<GenerateResult>({ type: 'generate', opts }, onToken);
|
|
}
|
|
|
|
// ─── Convenience wrappers (main thread, build on top of generate) ──
|
|
|
|
async prompt(
|
|
text: string,
|
|
opts?: { systemPrompt?: string; temperature?: number; maxTokens?: number }
|
|
): Promise<string> {
|
|
const messages: ChatMessage[] = [];
|
|
if (opts?.systemPrompt) {
|
|
messages.push({ role: 'system', content: opts.systemPrompt });
|
|
}
|
|
messages.push({ role: 'user', content: text });
|
|
|
|
const result = await this.generate({
|
|
messages,
|
|
temperature: opts?.temperature,
|
|
maxTokens: opts?.maxTokens,
|
|
});
|
|
return result.content;
|
|
}
|
|
|
|
async extractJson<T = unknown>(
|
|
text: string,
|
|
instruction: string,
|
|
opts?: { temperature?: number }
|
|
): Promise<T> {
|
|
const result = await this.generate({
|
|
messages: [
|
|
{
|
|
role: 'system',
|
|
content:
|
|
'You are a JSON extraction assistant. Always respond with valid JSON only, no markdown, no explanation.',
|
|
},
|
|
{
|
|
role: 'user',
|
|
content: `${instruction}\n\nText:\n${text}`,
|
|
},
|
|
],
|
|
temperature: opts?.temperature ?? 0.1,
|
|
maxTokens: 2048,
|
|
});
|
|
|
|
return JSON.parse(result.content) as T;
|
|
}
|
|
|
|
async classify(text: string, categories: string[], opts?: { context?: string }): Promise<string> {
|
|
const categoryList = categories.map((c) => `"${c}"`).join(', ');
|
|
const result = await this.generate({
|
|
messages: [
|
|
{
|
|
role: 'system',
|
|
content: `Classify the text into exactly one of these categories: ${categoryList}. Respond with only the category name, nothing else.${opts?.context ? ` Context: ${opts.context}` : ''}`,
|
|
},
|
|
{ role: 'user', content: text },
|
|
],
|
|
temperature: 0,
|
|
maxTokens: 50,
|
|
});
|
|
|
|
const normalized = result.content.trim().replace(/^["']|["']$/g, '');
|
|
const match = categories.find((c) => c.toLowerCase() === normalized.toLowerCase());
|
|
return match ?? normalized;
|
|
}
|
|
}
|
|
|
|
/** Singleton instance for app-wide use */
|
|
export const localLLM = new LocalLLMEngine();
|