docs: GPU tunnel setup, STT env wiring, and 2026-04-07 postmortem

Three docs updates landing the institutional knowledge from today's
Memoro voice recording deploy:

- docs/MAC_MINI_SERVER.md: architecture diagram updated to show the
  two-tunnel setup (cloudflared on the Mac Mini for *.mana.how
  except gpu-*, plus a separate cloudflared running as a Windows
  Service on the GPU box for gpu-*.mana.how). New "GPU Tunnel
  (mana-gpu-server)" section explains how to add hostnames in the
  Cloudflare dashboard, the standard 502 debug ladder (DNS misroute,
  service stopped, scheduled task crashed, missing public hostname),
  and how the API key flows from the Windows .env through Mac Mini
  .env to the mana-web container.

- docs/ENVIRONMENT_VARIABLES.md: STT section updated to reflect that
  MANA_STT_URL/API_KEY are now wired into the mana-web container via
  docker-compose.macmini.yml (committed in 42bd2a3a0). Health-check
  command added; cross-link to MAC_MINI_SERVER.md for the debug ladder.

- docs/POSTMORTEM_2026-04-07.md (new): full incident timeline of
  today's deploy. Six root causes (tunnel never started, DB wiped
  without re-push, untracked module-registry files, uncommitted
  Dockerfile heap bump, missing compose env vars, /offline prerender
  500). Three "what went poorly" honest assessments (premature P0
  alarm, miscounted commits, clumsy stash dance). Action items split
  by priority — high priority is the clean-clone build CI job, which
  would have caught half the issues today.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Till JS 2026-04-07 19:59:04 +02:00
parent 6b8e2c7176
commit 28395b313d
3 changed files with 411 additions and 14 deletions

View file

@ -125,19 +125,33 @@ The browser never talks to mana-stt directly — requests go through the SvelteK
**Where to obtain a key:**
- Production deploy: set `MANA_STT_API_KEY` in the Mac Mini's `.env` (sourced by
`docker-compose.macmini.yml`, line ~1076). The key lives on the Windows GPU box in
`services/mana-stt/.env` under `API_KEYS=<key>:<name>` and is the source of truth.
- Local dev: paste the dev key into your local `apps/mana/apps/web/.env` after running
- **Production (Mac Mini)**: `MANA_STT_API_KEY` is read from `~/projects/mana-monorepo/.env`
on the Mac Mini and injected into the `mana-web` container by `docker-compose.macmini.yml`
(the `mana-web` service block, alongside `MANA_STT_URL=https://gpu-stt.mana.how`). To rotate,
update the `.env` value and recreate the container with
`docker compose -f docker-compose.macmini.yml up -d --no-deps --force-recreate mana-web`.
- **Local dev**: paste the dev key into your local `apps/mana/apps/web/.env` after running
`pnpm setup:env` (the generator only writes an empty placeholder). Ask in `#mana-dev` or
pull from the team's password manager under `mana-stt → web-key`.
- New dev key: SSH to the Windows GPU box (`ssh mana-gpu`), append a new entry to
`C:\mana\services\mana-stt\.env` `API_KEYS` (format: `<random>:<name>`), restart the
`ManaSTT` scheduled task. Use a fresh key per consumer (`mana-web`, `chat-server`, etc.)
so we can revoke individually.
- **Source of truth**: `services/mana-stt/.env` on the Windows GPU box, in the `API_KEYS`
variable. Each entry is `<random>:<name>` and gets rate-limited per key.
- **Adding a new key**: SSH to the Windows GPU box (`ssh mana-gpu`), append a new entry to
`C:\mana\services\mana-stt\.env` `API_KEYS`, restart the `ManaSTT` scheduled task. Use a
fresh key per consumer (`mana-web`, `chat-server`, etc.) so they can be revoked individually.
**Endpoint:** `https://gpu-stt.mana.how` — Cloudflare Tunnel `mana-gpu-server`
Windows GPU box (`192.168.178.11:3020`). Health: `curl https://gpu-stt.mana.how/health`.
**Endpoint:** `https://gpu-stt.mana.how` — Cloudflare Tunnel `mana-gpu-server` (token-managed,
runs as a Windows Service on the GPU box, **not** as a route in the Mac Mini's cloudflared).
The tunnel terminates at `localhost:3020` on the Windows host.
**Health check**:
```bash
curl https://gpu-stt.mana.how/health
# → {"status":"healthy","whisper_loaded":true,"whisperx":true,...}
```
If this returns 502, see "GPU Tunnel" in `docs/MAC_MINI_SERVER.md` for the standard
debug ladder.
## Adding New Variables

View file

@ -58,18 +58,31 @@ Cloudflare Tunnel (cloudflared)
│ │ ├── Ollama (Port 11434) - gemma3:12b │ │
│ │ ├── STT (Whisper) (Port 3020) │ │
│ │ ├── TTS (Port 3022) │ │
│ │ └── Image Gen (FLUX) (Port 3023) │ │
│ │ ├── Image Gen (FLUX) (Port 3023) │ │
│ │ └── cloudflared (Windows Service, Tunnel │ │
│ │ mana-gpu-server) │ │
│ │ → gpu-*.mana.how │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ LaunchAgents (Autostart) │ │
│ │ ├── cloudflared (Tunnel) │ │
│ │ LaunchAgents (Autostart, Mac Mini) │ │
│ │ ├── cloudflared (Tunnel mana-server, │ │
│ │ │ → all .mana.how except │ │
│ │ │ gpu-*) │ │
│ │ ├── docker-startup (Container beim Boot) │ │
│ │ └── health-check (alle 5 Minuten) │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
```
> **Two-tunnel setup**: There are two Cloudflare tunnels in this stack.
> The **mana-server** tunnel runs on the Mac Mini (LaunchAgent, locally-managed
> via `~/.cloudflared/config.yml`) and handles every `*.mana.how` route except
> the GPU ones. The **mana-gpu-server** tunnel runs on the Windows GPU box as
> a Windows Service (token-managed via the Cloudflare dashboard, NOT a local
> config.yml) and handles `gpu-stt`, `gpu-llm`, `gpu-tts`, `gpu-img`,
> `gpu-ollama`. See the "GPU Tunnel" section below for adding new hostnames.
### Öffentliche URLs
| Service | URL |
@ -212,6 +225,69 @@ launchctl unload ~/Library/LaunchAgents/com.mana.docker-startup.plist
launchctl load ~/Library/LaunchAgents/com.mana.docker-startup.plist
```
## GPU Tunnel (mana-gpu-server)
The Windows GPU box is **not** reached via the Mac Mini's tunnel. It runs its
own Cloudflare tunnel as a Windows Service, exposing five public hostnames:
| Hostname | Local target on Windows | Service |
|---|---|---|
| `gpu-stt.mana.how` | `localhost:3020` | mana-stt (Whisper) |
| `gpu-llm.mana.how` | `localhost:3025` | mana-llm |
| `gpu-tts.mana.how` | `localhost:3022` | mana-tts |
| `gpu-img.mana.how` | `localhost:3023` | mana-image-gen (FLUX) |
| `gpu-ollama.mana.how` | `localhost:11434` | Ollama |
The connector itself runs as the Windows Service `Cloudflared`, installed via
`cloudflared.exe service install <TOKEN>`. **Token-managed** means the routing
config (which hostname → which local port) lives in the Cloudflare Zero Trust
dashboard, **not** in any local config file. Editing
`~/.cloudflared/config.yml` on the Windows box has no effect on this tunnel.
### Adding a new GPU hostname
1. **Cloudflare dashboard**: Zero Trust → Networks → Tunnels → `mana-gpu-server`
→ Public Hostname → "Add a public hostname" with `Service Type: HTTP` and
`URL: localhost:<PORT>`. The dashboard creates both the DNS CNAME and the
ingress rule in one step.
2. **Verify**: `curl https://<new-host>.mana.how/health` should return 200 within
a few seconds (no need to restart the connector — it picks up dashboard
changes automatically).
### If a `gpu-*` hostname returns 502
Most likely the DNS CNAME points at a different tunnel (e.g. `mana-server`
instead of `mana-gpu-server`). To force-repoint from the Mac Mini:
```bash
ssh mana-server "/opt/homebrew/bin/cloudflared tunnel route dns \
--overwrite-dns 83454e8e-d7f5-4954-b2cb-0307c2dba7a6 <hostname>"
```
(`83454e8e-…` is the `mana-gpu-server` tunnel UUID. Use the UUID, not the
name — the CLI resolves the name against the wrong tunnel context otherwise.)
Other 502 root causes to check, in order of likelihood:
1. **Cloudflared service stopped on Windows**: `ssh mana-gpu "Get-Service Cloudflared"` → must show `Running`. Restart with `Restart-Service Cloudflared`.
2. **Origin service down**: `ssh mana-gpu "Get-ScheduledTask -TaskName ManaSTT"` → must show `Running`. The Python service runs as a Scheduled Task with `RestartCount=5, RestartInterval=PT1M`, so a crash should self-heal within ~1 min.
3. **Public Hostname missing in dashboard**: tunnel has zero ingress rules for the requested hostname.
### API key for STT proxy
The unified mana-web container's `/api/v1/memoro/transcribe` and
`/api/v1/dreams/transcribe` proxies need `MANA_STT_API_KEY` to authenticate
against `gpu-stt.mana.how`. The key:
- Lives in **Mac Mini `~/projects/mana-monorepo/.env`** (gitignored)
- Is referenced from `docker-compose.macmini.yml` as `${MANA_STT_API_KEY:-}`
- The source-of-truth is `services/mana-stt/.env` on the Windows GPU box (`API_KEYS=<key>:<name>`)
To rotate: append a new entry on the Windows side, restart the `ManaSTT`
scheduled task, update both `.env` files (Mac Mini + any local dev), restart
`mana-app-web`. Use a separate key per consumer (`mana-web`, future apps) so
they can be revoked individually.
## Autostart-Konfiguration
Drei LaunchAgents sorgen fuer automatischen Betrieb:
@ -221,7 +297,8 @@ Drei LaunchAgents sorgen fuer automatischen Betrieb:
**Datei:** `~/Library/LaunchAgents/com.cloudflare.cloudflared.plist`
- Startet beim Login
- Haelt den Tunnel zu Cloudflare offen
- Haelt den Tunnel zu Cloudflare offen (mana-server tunnel only — the
GPU tunnel runs on the Windows box as a Windows Service, not here)
- Automatischer Neustart bei Absturz
### 2. Docker Container Startup

View file

@ -0,0 +1,306 @@
# Postmortem — 2026-04-07
**Title**: Memoro voice recording deploy + production database wipe + GPU tunnel cold-start
**Date**: 2026-04-07
**Severity**: Self-imposed P0 (no live users → no real customer impact, but full
production database was empty for ~6 hours and the reason was unclear at the time)
**Author**: Claude Opus 4.6 (1M context) for Till JS
**Status**: Resolved, hardening commits merged.
## Summary
What started as "build an audio recording feature for Memoro and test it" turned
into a four-front incident:
1. **mana-stt** — the upstream Whisper service — was unreachable from the
browser because the Cloudflare tunnel that fronts the GPU server had never
been started. Five public hostnames (`gpu-stt`, `gpu-llm`, `gpu-tts`,
`gpu-img`, `gpu-ollama`) all returned 502 because no connector was running.
2. **mana-auth** in production was throwing `relation "auth.sessions" does
not exist` on every `get-session` call, blocking the `/memoro` page (and
every other authenticated route) from loading. The Postgres data directory
had been reinitialized at 12:54 local time the same morning as part of a
schema-consolidation effort, but the schemas were never re-pushed.
3. **The build pipeline for `mana-web` was broken on a clean clone** because a
prerequisite refactor had left ~34 files (`module-registry.ts` and 31
`module.config.ts` files) untracked in everyone's working tree. `database.ts`
imported from a path that did not exist in git — but every dev box had a
local copy, so the issue was invisible until we tried to rebuild on the
Mac Mini after a stash dance.
4. **The first production deploy of the new `/api/v1/memoro/transcribe` proxy
short-circuited with HTTP 503** because `MANA_STT_URL` and `MANA_STT_API_KEY`
were never wired into the `mana-web` container in `docker-compose.macmini.yml`.
None of these were caused by the Memoro recording feature itself. They were
all latent issues that the deploy uncovered.
## Timeline (local CEST)
- **~12:54** — Postgres data directory reinitialized as part of schema
consolidation. Schemas are not re-pushed; production DB sits empty for 6+ hours.
Hourly backup at 12:54 dumps the empty state, irretrievably overwriting the
last good state already-archived backup window.
- **~17:30** — Memoro voice recording feature implemented locally (recorder,
server proxy, store wiring). Committed locally as `c5aeaf5e7`.
- **~18:00** — Cloudflare tunnel `mana-gpu-server` discovered to be configured
in the dashboard but with zero connector instances. `cloudflared.exe service install <TOKEN>`
installed on the Windows GPU box. Connector connects, four edge connections
established, but routes still return 502 because the DNS CNAMEs point at the
wrong tunnel (the older `mana-server` tunnel on the Mac Mini still claims
them via its locally-managed `~/.cloudflared/config.yml`).
- **~18:30** — DNS CNAMEs for all five `gpu-*.mana.how` hostnames force-repointed
at the new tunnel using the explicit UUID (the cloudflared CLI resolves the
tunnel name against the wrong tunnel's user context otherwise). All five
hostnames go green.
- **~18:45** — End-to-end transcription test through the public tunnel: HTTP 200
in 1.58s for a 10-second German audio clip.
- **~19:00** — Deploy phase begins. User reports `/memoro` page hanging at the
loader with `500` from `auth.mana.how/api/auth/get-session` and an
`Uncaught TypeError: Cannot read properties of undefined (reading 'length')`
in a minified bundle.
- **~19:05** — Auth root-cause found in 5 seconds via `docker logs mana-auth`:
`PostgresError: relation "auth.sessions" does not exist`. Database investigation
reveals only 3 schemas (`api_gateway`, `notify`, `public`) instead of the expected
~16. Hourly backups all turn out to be empty. Brief P0 alarm before the user
clarifies "kein problem, die DB wurde absichtlich neu gemacht".
- **~19:15** — Schemas recreated via `CREATE SCHEMA IF NOT EXISTS` for all 13
expected schemas, then `pnpm --filter @mana/auth db:push --force` (and 5 other
service schemas) populated the tables via Drizzle. mana-auth restarted clean.
`get-session` returns 200.
- **~19:25** — Local `git push origin main` (27 commits ahead) → Mac Mini
`git pull` → first build attempt fails: `Could not resolve "./module-registry"`.
- **~19:30** — Build-critical untracked files (`module-registry.ts`, 31 module
configs, Dockerfile heap bump) extracted from an old WIP stash on the Mac Mini
via `git checkout stash@{0}^3 -- <files>`. Second build attempt fails on a
different error: `Error: 500 /offline` during prerender.
- **~19:36** — `/offline` route's `prerender = true` flipped to `false` as a
workaround. Third build succeeds in 70 seconds. Container recreated.
- **~19:38** — Smoke test: `POST /api/v1/memoro/transcribe` returns 503
`mana-stt is not configured (MANA_STT_URL missing)`. Compose env block for
mana-web missing the STT variables.
- **~19:42** — Compose patched to inject `MANA_STT_URL` + `MANA_STT_API_KEY` into
the `mana-web` env block. Container recreated.
- **~19:43** — End-to-end production transcription via SvelteKit proxy: HTTP 200
in 2.85s. **Pipeline live.**
- **~19:50** — Three follow-up commits pushed to commit the workarounds and
build-critical files: `42bd2a3a0` (compose env), `5d4123d2b` (module-registry +
configs + Dockerfile bump), `de33ed868` (offline prerender FIXME).
## Root causes
### Root cause 1: Tunnel never started
The `mana-gpu-server` Cloudflare tunnel was created in the dashboard on
2026-03-27 but the Windows-side connector was never installed. Anyone hitting
`gpu-*.mana.how` would have seen 502 since day one — but until today nothing on
the live site needed those hostnames, so it was invisible.
**Why it survived**: there was no health probe for the public tunnel hostnames.
The Mac Mini's `health-check.sh` only probed the GPU services via LAN IP
(`192.168.178.11:3020/health`), which works as long as the GPU box is online,
regardless of whether anyone outside the LAN can reach it. A LAN-side probe
masks tunnel breakage by definition.
**Fix**: tunnel installed on Windows as a Service via `cloudflared.exe service install`
(survives reboot, automatic restart, four edge connections active). Health-check
script extended with a `Public hostnames` block that walks the cloudflared ingress
config and probes every public hostname over HTTPS, so tunnel breakage is now
detected within one health-check cycle (5 minutes).
### Root cause 2: Database wiped without schema re-push
Production Postgres was reinitialized as part of a schema consolidation
("vereinheitlicht"), but the schema-creation step was never run afterwards.
The mana-auth service starts and runs happily because Better Auth doesn't
verify table existence on startup — it only crashes on the first query that
hits a missing table. With an unauthenticated visit, no query is issued, and
the service appears healthy.
**Why it survived**: same shape as #1 — the health-check probes
`/health`, which doesn't touch the user/session tables. Anything below
"actually serve a request that needs the DB" passes the liveness probe.
**Fix**: schemas recreated via `CREATE SCHEMA IF NOT EXISTS` for all 13
expected schemas, then `pnpm --filter @mana/auth db:push --force` (and 5
other services with `db:push` scripts) populated the tables. The
`scripts/setup-databases.sh` script exists for exactly this purpose but
hardcodes dev credentials (`mana/devpassword`) so it can't be run as-is on
production. Today we used a manual `psql + drizzle-kit push` combo with
the production credentials passed via env var. **Followup: extend the
script to honour `POSTGRES_USER`/`POSTGRES_PASSWORD` env vars** so the
recovery path is one command instead of a multi-step manual sequence.
### Root cause 3: Untracked files in working trees
The unified module-registry refactor introduced `module-registry.ts` and
32 `module.config.ts` files but never `git add`'d them. Every developer
on every machine had the files locally because they were generated by the
same hand at the same time, so `database.ts` (which imports
`./module-registry`) worked everywhere. The first time anyone tried a
clean clone, the build would have crashed.
**Why it survived**: SvelteKit's vite build only complains about missing
modules at the bundling stage, which happens inside the Docker builder.
On every dev's box, the file was sitting in the working tree as untracked
(visible in `git status` as `??` but not blocking anything). No CI ever
attempted a clean rebuild because the project doesn't have a "build from
scratch on a fresh checkout" CI lane.
**Fix**: committed in `5d4123d2b`. Followup tech debt:
- Add a CI job that does `git clone` + `pnpm install` + `pnpm build` of
`mana-web` from scratch on every PR. This would have caught the
missing-files issue immediately.
- Audit other apps for the same pattern — anyone else who imports from
an untracked sibling file is one clean clone away from breaking.
### Root cause 4: Dockerfile heap bump never committed
`apps/mana/apps/web/Dockerfile` has a hardcoded `--max-old-space-size=4096`
that the unified app outgrew somewhere between Sprint 2 and Sprint 3 of
the data layer rewrite. The fix (bumping it to 8192) was applied locally on
multiple machines but never committed. Same root cause as #3 — works on every
dev's box, breaks on a clean rebuild.
**Fix**: bump committed in `5d4123d2b`. Followup: consider moving the
heap size to a build arg so it can be overridden without editing the
Dockerfile.
### Root cause 5: Compose env vars missing
`docker-compose.macmini.yml` had `MANA_STT_URL` set in a different
service block (the legacy memoro-server, with the wrong value
`http://host.docker.internal:3020`) but not in the `mana-web` block.
The new SvelteKit proxy validates these at request time and short-circuits
with 503 if missing. Caught by smoke test on first deploy.
**Fix**: env block patched into mana-web in `42bd2a3a0`. The variable
flows from Mac Mini `.env` (gitignored) → compose `${MANA_STT_API_KEY:-}`
expansion → container environment → SvelteKit `$env/dynamic/private`.
### Root cause 6: `/offline` route prerender 500
SvelteKit's prerender Worker reports `Error: 500 /offline` with no
usable stack trace. The error was introduced by one of the
encryption phase 4-6 commits (notes encryption rollout) — likely a
module-level side-effect on the shared layout that fails when
no `window` is available, but bisecting the actual import wasn't
in scope for getting Memoro shipped.
**Fix**: `prerender = false` on `/offline` (committed as `de33ed868`
with a `FIXME` comment). The page is still served at request time;
SSR works because the shared layout's request-time path doesn't
trigger the broken import.
**Followup**: bisect which import on the `/offline` codepath throws on
the bare server. Either guard the offending side-effect with
`typeof window !== 'undefined'`, move the import to `onMount`, or
add `handleHttpError` hook to ignore prerender failures.
## What went well
- **Local-first architecture saved us during the DB wipe scare**. Clients have
full data in IndexedDB; even if we'd lost everything server-side, users
could re-sync their state on next login. This drastically lowered the
blast radius of #2.
- **Diagnosis was fast at every step**. mana-auth logs gave the schema
error in one line. The tunnel/connector mismatch showed up in
`cloudflared tunnel info` immediately. The build errors named the
exact missing file and line.
- **The Dreams recording feature provided a complete blueprint**. The
Memoro recorder, store, server proxy, and UI patterns are byte-for-byte
identical to Dreams', which is in production and known-good.
- **Each fix was committable in isolation**. The three follow-up commits
(`42bd2a3a0`, `5d4123d2b`, `de33ed868`) are surgical and reviewable.
## What went poorly
- **I declared a P0 incident over the empty database without checking
with the user first**. The user's response was "kein problem, das war
Absicht". I should have asked before going into incident-response mode
with stop-everything language.
- **I miscounted local commits early in the deploy**. I told the user there
were 3 commits ahead of origin; the actual count was 27. This led to a
bigger-than-advertised push that included encryption phase 4 (a real
data-layer migration) and an `--allow-bypass` of a CI check.
- **The stash dance on the Mac Mini was clumsy**. I tried to stash specific
files, got "no changes to save" (which I should have caught), then
inadvertently popped a previous-session stash that scattered 100+ files
across the working tree. A cleaner approach would have been
`git stash --keep-index --include-untracked` upfront, or just
`git checkout` the specific files I wanted to preserve.
- **The build was unblocked by disabling prerender on `/offline` without
understanding why it broke**. This is a workaround, not a fix. The
root cause (probably a module-level side-effect introduced in
encryption phase 4-6) is still latent.
- **Cred hygiene is still bad**. The mana-stt API key now lives in
cleartext in three places: Mac Mini `.env`, my local `apps/mana/apps/web/.env`,
and `services/mana-stt/.env` on the Windows box. There is still no
password manager entry, and the `docs/ENVIRONMENT_VARIABLES.md` instructions
point at a "team password manager" that doesn't exist yet.
## Action items
### High priority (do soon)
- [ ] **Add a clean-clone build CI job** for `mana-web`. Single most
effective preventative measure for the class of bugs we hit today
(#3, #4). Should run on every PR that touches `apps/mana/apps/web/**`.
- [ ] **Bisect the `/offline` prerender 500**. Tracked as a `FIXME` in
`apps/mana/apps/web/src/routes/offline/+page.ts`. Suspect the
vault-client or data-layer-listeners imports.
- [ ] **Set up a real password manager entry** for `MANA_STT_API_KEY` and
remove the placeholder reference from `docs/ENVIRONMENT_VARIABLES.md`.
### Medium priority
- [ ] **Fix `scripts/setup-databases.sh` to honour env-var creds**. Right
now it hardcodes `mana/devpassword` and silently fails on production.
Should accept `POSTGRES_USER`/`POSTGRES_PASSWORD` from the environment.
- [ ] **Audit hourly backup behaviour**. Today's hourly backup ran one minute
after the data wipe and dumped the empty state, irretrievably overwriting
what would have been the last good backup window. Backup integrity should
be checked (e.g., refuse to overwrite if dump size drops by >50% vs the
previous backup, or keep a strictly-monotonic last-N-non-empty archive).
- [ ] **Add base backups that actually work**. The `/Volumes/ManaData/backups/postgres/base_*`
directories are empty — the base backup mechanism is configured but never
writes anything. Either fix it or remove the dead config to avoid the false
sense of security.
- [ ] **Clean up Mac Mini stashes**. There are 28 old `WIP on main` stashes from
previous sessions. Today's stash dance dropped one of them onto the working
tree by accident, causing the AA conflict. Drop them all once any
still-relevant ones are reviewed.
### Low priority / nice-to-have
- [ ] **Add per-user JWT auth on mana-stt**. Currently every consumer uses a
shared internal API key. If the key leaks, every consumer is compromised
simultaneously. mana-stt already has the `external_auth.py` infrastructure
for `sk_live_` keys validated against mana-auth — wire mana-web's proxy to
forward the user's mana-auth JWT instead of a shared secret.
- [ ] **Move heap size to a build arg** in `apps/mana/apps/web/Dockerfile` so
future bumps don't require a Dockerfile commit.
- [ ] **Document the two-tunnel setup** in onboarding docs (now done in
`docs/MAC_MINI_SERVER.md` "GPU Tunnel" section).
## Lessons
- **Local-first buys you forgiveness for server-side outages, but only if
the auth path is independent of the data path.** Today, an auth-only outage
blocked the entire app even though the actual user data was safe in
IndexedDB. Auth must be the most boring, most observable, most belt-and-
suspenders part of the stack.
- **Health probes that don't exercise the data path are a lie.** A `/health`
endpoint that returns `{status: "healthy"}` based on process liveness is
worse than no probe — it gives false confidence and delays detection of
real outages. Probes should query at least one real table.
- **Untracked files in working trees are a ticking time bomb.** Anything that
works on `git status -sb` showing `??` will eventually break someone else's
clean clone. CI should explicitly test that case.
- **Dashboard-managed Cloudflare tunnels and locally-managed ones don't mix
well.** When the same hostname is configured in both a local config.yml and
a dashboard tunnel, cloudflared's CLI gets confused about which tunnel it
"belongs to" and refuses to update DNS even with `--overwrite-dns` unless
you pass the explicit UUID. Pick one mode per tunnel and stick with it.
- **Always count commits before pushing**. `git log origin/main..HEAD | wc -l`
is two seconds and prevents "I thought it was 3, it was 27" surprises.