mirror of https://github.com/Memo-2023/mana-monorepo.git synced 2026-05-14 20:21:09 +02:00

Till JS 3deee755b3 feat(web): PillNav bar mode, fullscreen, local STT + mic button

PillNav overhaul:
- Dropdown-as-bar: theme/AI/sync/user menus render as horizontal
  bars in the bottom stack (PillDropdownBar) instead of floating
  popovers. New onOpenBar/activeBarId props on PillNavigation.
- iconOnly pills: tags/search/workbench-tabs pills show only icons.
  Home pill removed. New iconOnly flag on PillNavItem.
- Segmented toggle groups: items sharing a `group` id render as a
  single segmented pill (e.g. Light/Dark/System triple).
- Fullscreen mode: press "f" to hide all bottom chrome, Esc to exit.
- QuickInputBar + bottom bar visibility toggles via new pills.
- Progress ring on AI trigger pill during model download
  (conic-gradient ::after, follows pill border-radius).

@mana/local-stt — new package for browser-local speech-to-text:
- Whisper models via transformers.js v4 (WebGPU + WASM fallback)
- Same Web Worker architecture as @mana/local-llm
- Two models: Whisper Tiny (150 MB) and Whisper Small (950 MB)
- Reactive Svelte 5 bindings (getLocalSttStatus, loadLocalStt, transcribe)

Voice-to-text integration:
- useLocalStt() composable: mic capture via AudioContext +
  ScriptProcessor, resample to 16kHz mono, feed into Whisper worker
- Mic button in QuickInputBar (leftAction slot) with
  recording/loading/transcribing states + pulse animation
- Transcribed text injected into InputBar via new injectedText prop
- STT model selector in AI bar alongside LLM tier controls

Also: vite.config.ts server.fs.allow expanded to monorepo root
so workspace package workers resolve in dev.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-12 16:05:43 +02:00

7 KiB

Raw Blame History

`@mana/local-stt` — Browser-Local Speech-to-Text

Client-side speech-to-text that runs entirely in the user's browser via WebGPU (WASM fallback). No server roundtrips, no API keys, no audio leaving the device. Uses OpenAI's Whisper models through @huggingface/transformers v4 — the same library that powers @mana/local-llm.

Don't confuse this with the server-side STT (services/mana-stt). The server-side service runs Whisper on the GPU server (RTX 3090). This package is the only STT path that keeps audio on the user's device.

What's in the box

Field	Value
Engine library	`@huggingface/transformers` v4 (transformers.js)
Backend	WebGPU (primary), WASM (fallback)
Default model	`onnx-community/whisper-tiny` (~150 MB, multilingual)
Pipeline	`automatic-speech-recognition` (Whisper encoder-decoder)
Audio input	Float32Array, 16 kHz mono PCM
Chunking	30s chunks with 5s stride overlap (handled by the pipeline)

Available models

Key	Model	Size	English WER	Multilingual
`whisper-tiny`	Whisper Tiny	~150 MB	~5.6%	Yes (auto-detect)
`whisper-tiny.en`	Whisper Tiny EN	~150 MB	~5.6%	No (English only)
`whisper-base`	Whisper Base	~290 MB	~4.3%	Yes
`whisper-base.en`	Whisper Base EN	~290 MB	~4.3%	No
`whisper-small`	Whisper Small	~950 MB	~3.4%	Yes

Default is whisper-tiny — smallest, fastest, multilingual. Users can switch in settings.

Architecture

Mirrors @mana/local-llm exactly:

Consumer (Svelte component)
    │
    ▼
svelte.svelte.ts — reactive status ($state), loadLocalStt(), transcribe()
    │
    ▼
engine.ts — main-thread proxy (LocalSttEngine singleton)
    │  postMessage / onmessage
    ▼
worker.ts — Web Worker entry point
    │
    ▼
engine-impl.ts — transformers.js pipeline('automatic-speech-recognition')
    │
    ▼
@huggingface/transformers — ONNX runtime (WebGPU or WASM)

The Web Worker isolates the heavy Whisper inference (~3-5s for 60s audio on WebGPU) from the main thread. Audio processing never blocks the UI.

API surface (Svelte 5 usage)

<script lang="ts">
  import {
    getLocalSttStatus,
    loadLocalStt,
    transcribe,
    isLocalSttSupported,
    MODELS,
    DEFAULT_MODEL,
  } from '@mana/local-stt';

  const status = getLocalSttStatus();
  const supported = isLocalSttSupported();

  // Load on-demand (idempotent)
  async function start() {
    await loadLocalStt(DEFAULT_MODEL);
  }

  // Transcribe audio
  let result = $state('');
  async function handleAudio(pcm16k: Float32Array) {
    const out = await transcribe({
      audio: pcm16k,
      language: 'de',
      onChunk: (text) => { result += text; },
    });
    result = out.text;
  }
</script>

{#if !supported}
  <p>WebGPU not available.</p>
{:else if status.current.state === 'downloading'}
  <p>Downloading: {(status.current.progress * 100).toFixed(0)}%</p>
{:else if status.current.state === 'ready'}
  <button onclick={start}>Ready</button>
{/if}

Audio input format

The transcribe() function expects Float32Array of 16 kHz mono PCM samples (values -1.0 to 1.0). The consumer is responsible for:

Capturing audio (e.g. navigator.mediaDevices.getUserMedia)
Extracting raw PCM from the AudioContext
Resampling to 16 kHz if the mic runs at a different rate (typically 44.1/48 kHz)

The high-level useLocalStt() composable in apps/mana/apps/web/src/lib/components/voice/use-local-stt.svelte.ts handles all of this automatically.

High-level composable: `useLocalStt()`

Located at apps/mana/apps/web/src/lib/components/voice/use-local-stt.svelte.ts. Combines mic capture + resampling + transcription in one call:

<script lang="ts">
  import { useLocalStt } from '$lib/components/voice/use-local-stt.svelte';

  const stt = useLocalStt({ language: 'de' });
  // stt.state   — 'idle' | 'loading' | 'recording' | 'transcribing'
  // stt.text    — final transcribed text
  // stt.partial — streaming partial text (per chunk)
  // stt.error   — error message or null
  // stt.toggle()  — start recording or stop + transcribe
  // stt.cancel()  — abort without transcribing
</script>

<button onclick={() => stt.toggle()}>
  {stt.state === 'recording' ? 'Stop' : 'Record'}
</button>
<p>{stt.text}</p>

Audio pipeline inside the composable:

getUserMedia (native sample rate, e.g. 48 kHz)
  → AudioContext + ScriptProcessorNode → collect Float32 chunks
  → on stop: merge all chunks + linear resample to 16 kHz mono
  → transcribe() via @mana/local-stt worker
  → text result

UI integration

The QuickInputBar in (app)/+layout.svelte has a mic button (left slot) that uses useLocalStt():

Idle: Microphone icon
Loading: Disabled, pulsing (model downloading)
Recording: Red stop icon with pulse animation
Transcribing: Disabled, fading

When transcription completes, the text is fed into inputBarAdapter.onCreate() — making it context-aware: on /todo it creates a task, on /calendar an event, on / it searches.

CSP requirements

Same as @mana/local-llm — no new CSP rules needed. The existing config in apps/mana/apps/web/src/hooks.server.ts already allows:

script-src: 'wasm-unsafe-eval', https://cdn.jsdelivr.net, blob:
connect-src: https://huggingface.co, https://*.huggingface.co, https://*.hf.co, https://cdn.jsdelivr.net

Browser cache

Models are cached in the browser Cache API under HuggingFace URLs (same as local-llm). hasModelInCache(modelId) probes for config.json to detect cached models. After first download, subsequent loads are instant.

Browser support

WebGPU: Chrome/Edge 113+, Safari 18+ (fastest, ~3-5s for 60s audio)
WASM fallback: all modern browsers (~15-20s for 60s audio)
Requires getUserMedia for mic access (HTTPS or localhost)

Adding a new model

Add an entry to src/models.ts:

'whisper-medium': {
  modelId: 'onnx-community/whisper-medium',
  displayName: 'Whisper Medium',
  dtype: 'fp32',
  downloadSizeMb: 3000,
  ramUsageMb: 4000,
},

The model must be an ONNX build on HuggingFace with a Whisper architecture.

Relationship to existing voice features

Component	Purpose	Uses local-stt?
`voiceRecorder` singleton	Record audio as Blob (webm/opus) for server transcription	No
`VoiceCaptureBar`	UI bar for dreams/memoro voice capture → sends to mana-stt server	No
`useLocalStt()`	Record + transcribe entirely on-device	Yes
QuickInputBar mic button	Voice-to-text for any module via useLocalStt	Yes

The existing voiceRecorder and VoiceCaptureBar are still used for features that need server-side processing (e.g. dreams with server STT). useLocalStt() is the privacy-first alternative that never sends audio off-device.

7 KiB Raw Blame History

@mana/local-stt — Browser-Local Speech-to-Text