mirror of
https://github.com/Memo-2023/mana-monorepo.git
synced 2026-05-26 15:14:39 +02:00
Root file cleanup:
- mac-mini-setup.sh → scripts/mac-mini/bootstrap.sh (first-time bootstrap
belongs next to the other mac-mini setup-* scripts)
- test-chat-auth.sh → scripts/test-chat-auth.sh (ad-hoc smoke test, no
reason to live in the repo root)
- cloudflared-config.yml stays in root on purpose — it's the single source
of truth read by scripts/mac-mini/setup-*.sh and scripts/check-status.sh.
Docs:
- docs/POSTMORTEM_2026-04-07.md → docs/postmortems/2026-04-07-memoro-deploy-prod-wipe.md
(creates the postmortems/ home for future entries; descriptive name)
- docs/future/MAIL_SERVER_MAC_MINI_TEMP.md deleted — what it described
("Bereit zur Umsetzung", Stalwart on Mac Mini) is what's actually
running today, documented in docs/MAIL_SERVER.md. The DEDICATED variant
in docs/future/ remains since it's still a real future plan.
Root CLAUDE.md fix:
- @mana/local-store description was wrong — claimed it was legacy/standalone
only, but it's still used by apps/mana/apps/web itself, plus manavoxel,
arcade, and three shared packages.
Not touched (flagged for follow-up):
- NewAppIdeas/ (344K of "Roblox Reimagined" planning notes in repo root) —
user decision: archive externally or move under docs/future/
- Doc giants (PROJECT_OVERVIEW 41k, MATRIX_BOT_ARCHITECTURE 36k, etc.) —
splitting them is its own refactor
- Service CLAUDE.md staleness audit across 18 services — too broad for
this pass
306 lines
17 KiB
Markdown
306 lines
17 KiB
Markdown
# Postmortem — 2026-04-07
|
|
|
|
**Title**: Memoro voice recording deploy + production database wipe + GPU tunnel cold-start
|
|
**Date**: 2026-04-07
|
|
**Severity**: Self-imposed P0 (no live users → no real customer impact, but full
|
|
production database was empty for ~6 hours and the reason was unclear at the time)
|
|
**Author**: Claude Opus 4.6 (1M context) for Till JS
|
|
**Status**: Resolved, hardening commits merged.
|
|
|
|
## Summary
|
|
|
|
What started as "build an audio recording feature for Memoro and test it" turned
|
|
into a four-front incident:
|
|
|
|
1. **mana-stt** — the upstream Whisper service — was unreachable from the
|
|
browser because the Cloudflare tunnel that fronts the GPU server had never
|
|
been started. Five public hostnames (`gpu-stt`, `gpu-llm`, `gpu-tts`,
|
|
`gpu-img`, `gpu-ollama`) all returned 502 because no connector was running.
|
|
2. **mana-auth** in production was throwing `relation "auth.sessions" does
|
|
not exist` on every `get-session` call, blocking the `/memoro` page (and
|
|
every other authenticated route) from loading. The Postgres data directory
|
|
had been reinitialized at 12:54 local time the same morning as part of a
|
|
schema-consolidation effort, but the schemas were never re-pushed.
|
|
3. **The build pipeline for `mana-web` was broken on a clean clone** because a
|
|
prerequisite refactor had left ~34 files (`module-registry.ts` and 31
|
|
`module.config.ts` files) untracked in everyone's working tree. `database.ts`
|
|
imported from a path that did not exist in git — but every dev box had a
|
|
local copy, so the issue was invisible until we tried to rebuild on the
|
|
Mac Mini after a stash dance.
|
|
4. **The first production deploy of the new `/api/v1/memoro/transcribe` proxy
|
|
short-circuited with HTTP 503** because `MANA_STT_URL` and `MANA_STT_API_KEY`
|
|
were never wired into the `mana-web` container in `docker-compose.macmini.yml`.
|
|
|
|
None of these were caused by the Memoro recording feature itself. They were
|
|
all latent issues that the deploy uncovered.
|
|
|
|
## Timeline (local CEST)
|
|
|
|
- **~12:54** — Postgres data directory reinitialized as part of schema
|
|
consolidation. Schemas are not re-pushed; production DB sits empty for 6+ hours.
|
|
Hourly backup at 12:54 dumps the empty state, irretrievably overwriting the
|
|
last good state already-archived backup window.
|
|
- **~17:30** — Memoro voice recording feature implemented locally (recorder,
|
|
server proxy, store wiring). Committed locally as `c5aeaf5e7`.
|
|
- **~18:00** — Cloudflare tunnel `mana-gpu-server` discovered to be configured
|
|
in the dashboard but with zero connector instances. `cloudflared.exe service install <TOKEN>`
|
|
installed on the Windows GPU box. Connector connects, four edge connections
|
|
established, but routes still return 502 because the DNS CNAMEs point at the
|
|
wrong tunnel (the older `mana-server` tunnel on the Mac Mini still claims
|
|
them via its locally-managed `~/.cloudflared/config.yml`).
|
|
- **~18:30** — DNS CNAMEs for all five `gpu-*.mana.how` hostnames force-repointed
|
|
at the new tunnel using the explicit UUID (the cloudflared CLI resolves the
|
|
tunnel name against the wrong tunnel's user context otherwise). All five
|
|
hostnames go green.
|
|
- **~18:45** — End-to-end transcription test through the public tunnel: HTTP 200
|
|
in 1.58s for a 10-second German audio clip.
|
|
- **~19:00** — Deploy phase begins. User reports `/memoro` page hanging at the
|
|
loader with `500` from `auth.mana.how/api/auth/get-session` and an
|
|
`Uncaught TypeError: Cannot read properties of undefined (reading 'length')`
|
|
in a minified bundle.
|
|
- **~19:05** — Auth root-cause found in 5 seconds via `docker logs mana-auth`:
|
|
`PostgresError: relation "auth.sessions" does not exist`. Database investigation
|
|
reveals only 3 schemas (`api_gateway`, `notify`, `public`) instead of the expected
|
|
~16. Hourly backups all turn out to be empty. Brief P0 alarm before the user
|
|
clarifies "kein problem, die DB wurde absichtlich neu gemacht".
|
|
- **~19:15** — Schemas recreated via `CREATE SCHEMA IF NOT EXISTS` for all 13
|
|
expected schemas, then `pnpm --filter @mana/auth db:push --force` (and 5 other
|
|
service schemas) populated the tables via Drizzle. mana-auth restarted clean.
|
|
`get-session` returns 200.
|
|
- **~19:25** — Local `git push origin main` (27 commits ahead) → Mac Mini
|
|
`git pull` → first build attempt fails: `Could not resolve "./module-registry"`.
|
|
- **~19:30** — Build-critical untracked files (`module-registry.ts`, 31 module
|
|
configs, Dockerfile heap bump) extracted from an old WIP stash on the Mac Mini
|
|
via `git checkout stash@{0}^3 -- <files>`. Second build attempt fails on a
|
|
different error: `Error: 500 /offline` during prerender.
|
|
- **~19:36** — `/offline` route's `prerender = true` flipped to `false` as a
|
|
workaround. Third build succeeds in 70 seconds. Container recreated.
|
|
- **~19:38** — Smoke test: `POST /api/v1/memoro/transcribe` returns 503
|
|
`mana-stt is not configured (MANA_STT_URL missing)`. Compose env block for
|
|
mana-web missing the STT variables.
|
|
- **~19:42** — Compose patched to inject `MANA_STT_URL` + `MANA_STT_API_KEY` into
|
|
the `mana-web` env block. Container recreated.
|
|
- **~19:43** — End-to-end production transcription via SvelteKit proxy: HTTP 200
|
|
in 2.85s. **Pipeline live.**
|
|
- **~19:50** — Three follow-up commits pushed to commit the workarounds and
|
|
build-critical files: `42bd2a3a0` (compose env), `5d4123d2b` (module-registry +
|
|
configs + Dockerfile bump), `de33ed868` (offline prerender FIXME).
|
|
|
|
## Root causes
|
|
|
|
### Root cause 1: Tunnel never started
|
|
|
|
The `mana-gpu-server` Cloudflare tunnel was created in the dashboard on
|
|
2026-03-27 but the Windows-side connector was never installed. Anyone hitting
|
|
`gpu-*.mana.how` would have seen 502 since day one — but until today nothing on
|
|
the live site needed those hostnames, so it was invisible.
|
|
|
|
**Why it survived**: there was no health probe for the public tunnel hostnames.
|
|
The Mac Mini's `health-check.sh` only probed the GPU services via LAN IP
|
|
(`192.168.178.11:3020/health`), which works as long as the GPU box is online,
|
|
regardless of whether anyone outside the LAN can reach it. A LAN-side probe
|
|
masks tunnel breakage by definition.
|
|
|
|
**Fix**: tunnel installed on Windows as a Service via `cloudflared.exe service install`
|
|
(survives reboot, automatic restart, four edge connections active). Health-check
|
|
script extended with a `Public hostnames` block that walks the cloudflared ingress
|
|
config and probes every public hostname over HTTPS, so tunnel breakage is now
|
|
detected within one health-check cycle (5 minutes).
|
|
|
|
### Root cause 2: Database wiped without schema re-push
|
|
|
|
Production Postgres was reinitialized as part of a schema consolidation
|
|
("vereinheitlicht"), but the schema-creation step was never run afterwards.
|
|
The mana-auth service starts and runs happily because Better Auth doesn't
|
|
verify table existence on startup — it only crashes on the first query that
|
|
hits a missing table. With an unauthenticated visit, no query is issued, and
|
|
the service appears healthy.
|
|
|
|
**Why it survived**: same shape as #1 — the health-check probes
|
|
`/health`, which doesn't touch the user/session tables. Anything below
|
|
"actually serve a request that needs the DB" passes the liveness probe.
|
|
|
|
**Fix**: schemas recreated via `CREATE SCHEMA IF NOT EXISTS` for all 13
|
|
expected schemas, then `pnpm --filter @mana/auth db:push --force` (and 5
|
|
other services with `db:push` scripts) populated the tables. The
|
|
`scripts/setup-databases.sh` script exists for exactly this purpose but
|
|
hardcodes dev credentials (`mana/devpassword`) so it can't be run as-is on
|
|
production. Today we used a manual `psql + drizzle-kit push` combo with
|
|
the production credentials passed via env var. **Followup: extend the
|
|
script to honour `POSTGRES_USER`/`POSTGRES_PASSWORD` env vars** so the
|
|
recovery path is one command instead of a multi-step manual sequence.
|
|
|
|
### Root cause 3: Untracked files in working trees
|
|
|
|
The unified module-registry refactor introduced `module-registry.ts` and
|
|
32 `module.config.ts` files but never `git add`'d them. Every developer
|
|
on every machine had the files locally because they were generated by the
|
|
same hand at the same time, so `database.ts` (which imports
|
|
`./module-registry`) worked everywhere. The first time anyone tried a
|
|
clean clone, the build would have crashed.
|
|
|
|
**Why it survived**: SvelteKit's vite build only complains about missing
|
|
modules at the bundling stage, which happens inside the Docker builder.
|
|
On every dev's box, the file was sitting in the working tree as untracked
|
|
(visible in `git status` as `??` but not blocking anything). No CI ever
|
|
attempted a clean rebuild because the project doesn't have a "build from
|
|
scratch on a fresh checkout" CI lane.
|
|
|
|
**Fix**: committed in `5d4123d2b`. Followup tech debt:
|
|
|
|
- Add a CI job that does `git clone` + `pnpm install` + `pnpm build` of
|
|
`mana-web` from scratch on every PR. This would have caught the
|
|
missing-files issue immediately.
|
|
- Audit other apps for the same pattern — anyone else who imports from
|
|
an untracked sibling file is one clean clone away from breaking.
|
|
|
|
### Root cause 4: Dockerfile heap bump never committed
|
|
|
|
`apps/mana/apps/web/Dockerfile` has a hardcoded `--max-old-space-size=4096`
|
|
that the unified app outgrew somewhere between Sprint 2 and Sprint 3 of
|
|
the data layer rewrite. The fix (bumping it to 8192) was applied locally on
|
|
multiple machines but never committed. Same root cause as #3 — works on every
|
|
dev's box, breaks on a clean rebuild.
|
|
|
|
**Fix**: bump committed in `5d4123d2b`. Followup: consider moving the
|
|
heap size to a build arg so it can be overridden without editing the
|
|
Dockerfile.
|
|
|
|
### Root cause 5: Compose env vars missing
|
|
|
|
`docker-compose.macmini.yml` had `MANA_STT_URL` set in a different
|
|
service block (the legacy memoro-server, with the wrong value
|
|
`http://host.docker.internal:3020`) but not in the `mana-web` block.
|
|
The new SvelteKit proxy validates these at request time and short-circuits
|
|
with 503 if missing. Caught by smoke test on first deploy.
|
|
|
|
**Fix**: env block patched into mana-web in `42bd2a3a0`. The variable
|
|
flows from Mac Mini `.env` (gitignored) → compose `${MANA_STT_API_KEY:-}`
|
|
expansion → container environment → SvelteKit `$env/dynamic/private`.
|
|
|
|
### Root cause 6: `/offline` route prerender 500
|
|
|
|
SvelteKit's prerender Worker reports `Error: 500 /offline` with no
|
|
usable stack trace. The error was introduced by one of the
|
|
encryption phase 4-6 commits (notes encryption rollout) — likely a
|
|
module-level side-effect on the shared layout that fails when
|
|
no `window` is available, but bisecting the actual import wasn't
|
|
in scope for getting Memoro shipped.
|
|
|
|
**Fix**: `prerender = false` on `/offline` (committed as `de33ed868`
|
|
with a `FIXME` comment). The page is still served at request time;
|
|
SSR works because the shared layout's request-time path doesn't
|
|
trigger the broken import.
|
|
|
|
**Followup**: bisect which import on the `/offline` codepath throws on
|
|
the bare server. Either guard the offending side-effect with
|
|
`typeof window !== 'undefined'`, move the import to `onMount`, or
|
|
add `handleHttpError` hook to ignore prerender failures.
|
|
|
|
## What went well
|
|
|
|
- **Local-first architecture saved us during the DB wipe scare**. Clients have
|
|
full data in IndexedDB; even if we'd lost everything server-side, users
|
|
could re-sync their state on next login. This drastically lowered the
|
|
blast radius of #2.
|
|
- **Diagnosis was fast at every step**. mana-auth logs gave the schema
|
|
error in one line. The tunnel/connector mismatch showed up in
|
|
`cloudflared tunnel info` immediately. The build errors named the
|
|
exact missing file and line.
|
|
- **The Dreams recording feature provided a complete blueprint**. The
|
|
Memoro recorder, store, server proxy, and UI patterns are byte-for-byte
|
|
identical to Dreams', which is in production and known-good.
|
|
- **Each fix was committable in isolation**. The three follow-up commits
|
|
(`42bd2a3a0`, `5d4123d2b`, `de33ed868`) are surgical and reviewable.
|
|
|
|
## What went poorly
|
|
|
|
- **I declared a P0 incident over the empty database without checking
|
|
with the user first**. The user's response was "kein problem, das war
|
|
Absicht". I should have asked before going into incident-response mode
|
|
with stop-everything language.
|
|
- **I miscounted local commits early in the deploy**. I told the user there
|
|
were 3 commits ahead of origin; the actual count was 27. This led to a
|
|
bigger-than-advertised push that included encryption phase 4 (a real
|
|
data-layer migration) and an `--allow-bypass` of a CI check.
|
|
- **The stash dance on the Mac Mini was clumsy**. I tried to stash specific
|
|
files, got "no changes to save" (which I should have caught), then
|
|
inadvertently popped a previous-session stash that scattered 100+ files
|
|
across the working tree. A cleaner approach would have been
|
|
`git stash --keep-index --include-untracked` upfront, or just
|
|
`git checkout` the specific files I wanted to preserve.
|
|
- **The build was unblocked by disabling prerender on `/offline` without
|
|
understanding why it broke**. This is a workaround, not a fix. The
|
|
root cause (probably a module-level side-effect introduced in
|
|
encryption phase 4-6) is still latent.
|
|
- **Cred hygiene is still bad**. The mana-stt API key now lives in
|
|
cleartext in three places: Mac Mini `.env`, my local `apps/mana/apps/web/.env`,
|
|
and `services/mana-stt/.env` on the Windows box. There is still no
|
|
password manager entry, and the `docs/ENVIRONMENT_VARIABLES.md` instructions
|
|
point at a "team password manager" that doesn't exist yet.
|
|
|
|
## Action items
|
|
|
|
### High priority (do soon)
|
|
|
|
- [ ] **Add a clean-clone build CI job** for `mana-web`. Single most
|
|
effective preventative measure for the class of bugs we hit today
|
|
(#3, #4). Should run on every PR that touches `apps/mana/apps/web/**`.
|
|
- [ ] **Bisect the `/offline` prerender 500**. Tracked as a `FIXME` in
|
|
`apps/mana/apps/web/src/routes/offline/+page.ts`. Suspect the
|
|
vault-client or data-layer-listeners imports.
|
|
- [ ] **Set up a real password manager entry** for `MANA_STT_API_KEY` and
|
|
remove the placeholder reference from `docs/ENVIRONMENT_VARIABLES.md`.
|
|
|
|
### Medium priority
|
|
|
|
- [ ] **Fix `scripts/setup-databases.sh` to honour env-var creds**. Right
|
|
now it hardcodes `mana/devpassword` and silently fails on production.
|
|
Should accept `POSTGRES_USER`/`POSTGRES_PASSWORD` from the environment.
|
|
- [ ] **Audit hourly backup behaviour**. Today's hourly backup ran one minute
|
|
after the data wipe and dumped the empty state, irretrievably overwriting
|
|
what would have been the last good backup window. Backup integrity should
|
|
be checked (e.g., refuse to overwrite if dump size drops by >50% vs the
|
|
previous backup, or keep a strictly-monotonic last-N-non-empty archive).
|
|
- [ ] **Add base backups that actually work**. The `/Volumes/ManaData/backups/postgres/base_*`
|
|
directories are empty — the base backup mechanism is configured but never
|
|
writes anything. Either fix it or remove the dead config to avoid the false
|
|
sense of security.
|
|
- [ ] **Clean up Mac Mini stashes**. There are 28 old `WIP on main` stashes from
|
|
previous sessions. Today's stash dance dropped one of them onto the working
|
|
tree by accident, causing the AA conflict. Drop them all once any
|
|
still-relevant ones are reviewed.
|
|
|
|
### Low priority / nice-to-have
|
|
|
|
- [ ] **Add per-user JWT auth on mana-stt**. Currently every consumer uses a
|
|
shared internal API key. If the key leaks, every consumer is compromised
|
|
simultaneously. mana-stt already has the `external_auth.py` infrastructure
|
|
for `sk_live_` keys validated against mana-auth — wire mana-web's proxy to
|
|
forward the user's mana-auth JWT instead of a shared secret.
|
|
- [ ] **Move heap size to a build arg** in `apps/mana/apps/web/Dockerfile` so
|
|
future bumps don't require a Dockerfile commit.
|
|
- [ ] **Document the two-tunnel setup** in onboarding docs (now done in
|
|
`docs/MAC_MINI_SERVER.md` "GPU Tunnel" section).
|
|
|
|
## Lessons
|
|
|
|
- **Local-first buys you forgiveness for server-side outages, but only if
|
|
the auth path is independent of the data path.** Today, an auth-only outage
|
|
blocked the entire app even though the actual user data was safe in
|
|
IndexedDB. Auth must be the most boring, most observable, most belt-and-
|
|
suspenders part of the stack.
|
|
- **Health probes that don't exercise the data path are a lie.** A `/health`
|
|
endpoint that returns `{status: "healthy"}` based on process liveness is
|
|
worse than no probe — it gives false confidence and delays detection of
|
|
real outages. Probes should query at least one real table.
|
|
- **Untracked files in working trees are a ticking time bomb.** Anything that
|
|
works on `git status -sb` showing `??` will eventually break someone else's
|
|
clean clone. CI should explicitly test that case.
|
|
- **Dashboard-managed Cloudflare tunnels and locally-managed ones don't mix
|
|
well.** When the same hostname is configured in both a local config.yml and
|
|
a dashboard tunnel, cloudflared's CLI gets confused about which tunnel it
|
|
"belongs to" and refuses to update DNS even with `--overwrite-dns` unless
|
|
you pass the explicit UUID. Pick one mode per tunnel and stick with it.
|
|
- **Always count commits before pushing**. `git log origin/main..HEAD | wc -l`
|
|
is two seconds and prevents "I thought it was 3, it was 27" surprises.
|