PillNav overhaul: - Dropdown-as-bar: theme/AI/sync/user menus render as horizontal bars in the bottom stack (PillDropdownBar) instead of floating popovers. New onOpenBar/activeBarId props on PillNavigation. - iconOnly pills: tags/search/workbench-tabs pills show only icons. Home pill removed. New iconOnly flag on PillNavItem. - Segmented toggle groups: items sharing a `group` id render as a single segmented pill (e.g. Light/Dark/System triple). - Fullscreen mode: press "f" to hide all bottom chrome, Esc to exit. - QuickInputBar + bottom bar visibility toggles via new pills. - Progress ring on AI trigger pill during model download (conic-gradient ::after, follows pill border-radius). @mana/local-stt — new package for browser-local speech-to-text: - Whisper models via transformers.js v4 (WebGPU + WASM fallback) - Same Web Worker architecture as @mana/local-llm - Two models: Whisper Tiny (150 MB) and Whisper Small (950 MB) - Reactive Svelte 5 bindings (getLocalSttStatus, loadLocalStt, transcribe) Voice-to-text integration: - useLocalStt() composable: mic capture via AudioContext + ScriptProcessor, resample to 16kHz mono, feed into Whisper worker - Mic button in QuickInputBar (leftAction slot) with recording/loading/transcribing states + pulse animation - Transcribed text injected into InputBar via new injectedText prop - STT model selector in AI bar alongside LLM tier controls Also: vite.config.ts server.fs.allow expanded to monorepo root so workspace package workers resolve in dev. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
7 KiB
@mana/local-stt — Browser-Local Speech-to-Text
Client-side speech-to-text that runs entirely in the user's browser via WebGPU (WASM fallback). No server roundtrips, no API keys, no audio leaving the device. Uses OpenAI's Whisper models through @huggingface/transformers v4 — the same library that powers @mana/local-llm.
Don't confuse this with the server-side STT (services/mana-stt). The server-side service runs Whisper on the GPU server (RTX 3090). This package is the only STT path that keeps audio on the user's device.
What's in the box
| Field | Value |
|---|---|
| Engine library | @huggingface/transformers v4 (transformers.js) |
| Backend | WebGPU (primary), WASM (fallback) |
| Default model | onnx-community/whisper-tiny (~150 MB, multilingual) |
| Pipeline | automatic-speech-recognition (Whisper encoder-decoder) |
| Audio input | Float32Array, 16 kHz mono PCM |
| Chunking | 30s chunks with 5s stride overlap (handled by the pipeline) |
Available models
| Key | Model | Size | English WER | Multilingual |
|---|---|---|---|---|
whisper-tiny |
Whisper Tiny | ~150 MB | ~5.6% | Yes (auto-detect) |
whisper-tiny.en |
Whisper Tiny EN | ~150 MB | ~5.6% | No (English only) |
whisper-base |
Whisper Base | ~290 MB | ~4.3% | Yes |
whisper-base.en |
Whisper Base EN | ~290 MB | ~4.3% | No |
whisper-small |
Whisper Small | ~950 MB | ~3.4% | Yes |
Default is whisper-tiny — smallest, fastest, multilingual. Users can switch in settings.
Architecture
Mirrors @mana/local-llm exactly:
Consumer (Svelte component)
│
▼
svelte.svelte.ts — reactive status ($state), loadLocalStt(), transcribe()
│
▼
engine.ts — main-thread proxy (LocalSttEngine singleton)
│ postMessage / onmessage
▼
worker.ts — Web Worker entry point
│
▼
engine-impl.ts — transformers.js pipeline('automatic-speech-recognition')
│
▼
@huggingface/transformers — ONNX runtime (WebGPU or WASM)
The Web Worker isolates the heavy Whisper inference (~3-5s for 60s audio on WebGPU) from the main thread. Audio processing never blocks the UI.
API surface (Svelte 5 usage)
<script lang="ts">
import {
getLocalSttStatus,
loadLocalStt,
transcribe,
isLocalSttSupported,
MODELS,
DEFAULT_MODEL,
} from '@mana/local-stt';
const status = getLocalSttStatus();
const supported = isLocalSttSupported();
// Load on-demand (idempotent)
async function start() {
await loadLocalStt(DEFAULT_MODEL);
}
// Transcribe audio
let result = $state('');
async function handleAudio(pcm16k: Float32Array) {
const out = await transcribe({
audio: pcm16k,
language: 'de',
onChunk: (text) => { result += text; },
});
result = out.text;
}
</script>
{#if !supported}
<p>WebGPU not available.</p>
{:else if status.current.state === 'downloading'}
<p>Downloading: {(status.current.progress * 100).toFixed(0)}%</p>
{:else if status.current.state === 'ready'}
<button onclick={start}>Ready</button>
{/if}
Status union: idle | checking | downloading | loading | ready | error (same as @mana/local-llm).
Audio input format
The transcribe() function expects Float32Array of 16 kHz mono PCM samples (values -1.0 to 1.0). The consumer is responsible for:
- Capturing audio (e.g.
navigator.mediaDevices.getUserMedia) - Extracting raw PCM from the
AudioContext - Resampling to 16 kHz if the mic runs at a different rate (typically 44.1/48 kHz)
The high-level useLocalStt() composable in apps/mana/apps/web/src/lib/components/voice/use-local-stt.svelte.ts handles all of this automatically.
High-level composable: useLocalStt()
Located at apps/mana/apps/web/src/lib/components/voice/use-local-stt.svelte.ts. Combines mic capture + resampling + transcription in one call:
<script lang="ts">
import { useLocalStt } from '$lib/components/voice/use-local-stt.svelte';
const stt = useLocalStt({ language: 'de' });
// stt.state — 'idle' | 'loading' | 'recording' | 'transcribing'
// stt.text — final transcribed text
// stt.partial — streaming partial text (per chunk)
// stt.error — error message or null
// stt.toggle() — start recording or stop + transcribe
// stt.cancel() — abort without transcribing
</script>
<button onclick={() => stt.toggle()}>
{stt.state === 'recording' ? 'Stop' : 'Record'}
</button>
<p>{stt.text}</p>
Audio pipeline inside the composable:
getUserMedia (native sample rate, e.g. 48 kHz)
→ AudioContext + ScriptProcessorNode → collect Float32 chunks
→ on stop: merge all chunks + linear resample to 16 kHz mono
→ transcribe() via @mana/local-stt worker
→ text result
UI integration
The QuickInputBar in (app)/+layout.svelte has a mic button (left slot) that uses useLocalStt():
- Idle: Microphone icon
- Loading: Disabled, pulsing (model downloading)
- Recording: Red stop icon with pulse animation
- Transcribing: Disabled, fading
When transcription completes, the text is fed into inputBarAdapter.onCreate() — making it context-aware: on /todo it creates a task, on /calendar an event, on / it searches.
CSP requirements
Same as @mana/local-llm — no new CSP rules needed. The existing config in apps/mana/apps/web/src/hooks.server.ts already allows:
script-src:'wasm-unsafe-eval',https://cdn.jsdelivr.net,blob:connect-src:https://huggingface.co,https://*.huggingface.co,https://*.hf.co,https://cdn.jsdelivr.net
Browser cache
Models are cached in the browser Cache API under HuggingFace URLs (same as local-llm). hasModelInCache(modelId) probes for config.json to detect cached models. After first download, subsequent loads are instant.
Browser support
- WebGPU: Chrome/Edge 113+, Safari 18+ (fastest, ~3-5s for 60s audio)
- WASM fallback: all modern browsers (~15-20s for 60s audio)
- Requires
getUserMediafor mic access (HTTPS or localhost)
Adding a new model
Add an entry to src/models.ts:
'whisper-medium': {
modelId: 'onnx-community/whisper-medium',
displayName: 'Whisper Medium',
dtype: 'fp32',
downloadSizeMb: 3000,
ramUsageMb: 4000,
},
The model must be an ONNX build on HuggingFace with a Whisper architecture.
Relationship to existing voice features
| Component | Purpose | Uses local-stt? |
|---|---|---|
voiceRecorder singleton |
Record audio as Blob (webm/opus) for server transcription | No |
VoiceCaptureBar |
UI bar for dreams/memoro voice capture → sends to mana-stt server | No |
useLocalStt() |
Record + transcribe entirely on-device | Yes |
| QuickInputBar mic button | Voice-to-text for any module via useLocalStt | Yes |
The existing voiceRecorder and VoiceCaptureBar are still used for features that need server-side processing (e.g. dreams with server STT). useLocalStt() is the privacy-first alternative that never sends audio off-device.