Commit graph

89 commits

Author SHA1 Message Date
Till JS
b1b9bbc269 chore: rename repo mana-monorepo → managarten
Some checks are pending
CD Mac Mini / Detect Changes (push) Waiting to run
CD Mac Mini / Deploy (push) Blocked by required conditions
CI / Detect Changes (push) Waiting to run
CI / Validate (push) Waiting to run
CI / Build mana-search (push) Blocked by required conditions
CI / Build mana-sync (push) Blocked by required conditions
CI / Build mana-api-gateway (push) Blocked by required conditions
CI / Build mana-crawler (push) Blocked by required conditions
Docker Validate / Validate Dockerfiles (push) Waiting to run
Docker Validate / Build calendar-web (push) Blocked by required conditions
Docker Validate / Build quotes-web (push) Blocked by required conditions
Docker Validate / Build todo-backend (push) Blocked by required conditions
Docker Validate / Build todo-web (push) Blocked by required conditions
Docker Validate / Build mana-auth (push) Blocked by required conditions
Docker Validate / Build mana-sync (push) Blocked by required conditions
Docker Validate / Build mana-media (push) Blocked by required conditions
Mirror to Forgejo / Push to Forgejo (push) Waiting to run
Phase-3-Rename des ehemaligen Multi-App-Monorepos zum eigenständigen
Produkt-Repo. Verein heißt mana e.V., Plattform-Domain bleibt mana.how,
apps/mana/ bleibt unverändert — nur der Repo-Container kriegt den
neuen Namen "managarten" (Garten der mana-Apps).

Geändert:
- package.json#name + #description
- README.md (Titel + erster Absatz)
- TROUBLESHOOTING.md
- alle Mac-Mini-Skripte (Pfade ~/projects/mana-monorepo → ~/projects/managarten)
- COMPOSE_PROJECT_NAME-default in scripts/mac-mini/status.sh
- .github/workflows/cd-macmini.yml + mirror-to-forgejo.yml
- apps/docs (astro.config.mjs + content)
- .claude/settings.local.json (Bash-Permission-Pfade)
- alle docs/*.md Pfad-Referenzen
- launchd plists, .env.macmini.example, infrastructure/

Forgejo-Repo + GitHub-Repo bereits via API umbenannt. Lokales
Verzeichnis-Rename + Mac-Mini-Cutover folgen separat.
2026-05-09 01:16:02 +02:00
Till JS
774852ba2d feat(cutover): platform services build from ../mana, not from this repo
Part of the 8-Doppel-Cutover (2026-05-08, plan
~/.claude/plans/floating-swinging-flurry.md):

- docker-compose.{macmini,dev,test}.yml: build context for
  mana-{auth,credits,media,llm,notify} switched to ../mana/services/...
  so the Mac Mini stack pulls platform services from the platform repo
  (sibling clone), not from services/ in this monorepo.
- .npmrc + apps/api/{Dockerfile,package.json}: @mana/media-client now
  resolved from Verdaccio (npm.mana.how, ^0.1.0) instead of as a
  workspace COPY from services/mana-media/packages/client. Build-arg
  NPM_TOKEN flows through .npmrc for pnpm install auth. Required
  before services/mana-media/ can be deleted.
- .github/workflows/{ci,cd-macmini,daily-tests}.yml: removed the
  detect-/build-/test-jobs that targeted services/mana-{auth,credits,
  notify,media}/. Those services build out of the platform repo now —
  CI for them belongs in mana/-repo (open). cd-macmini's
  workflow_dispatch can still rebuild any of them on demand;
  auto-detect on path-change is gone for these five.
- scripts/{mac-mini/push-schemas.sh,run-integration-tests.sh}:
  rewritten to look in ../mana/ for the platform services.
- package.json dev:{auth,credits,notify,media}: paths point at
  ../mana/services/... so local dev still works post-cutover.

What this commit does NOT do: delete services/mana-{auth,credits,...}
from this repo. That waits for Phase 7 once the Mac Mini stack has
booted cleanly from the new build paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 18:40:08 +02:00
Till JS
f754d4ecbb chore(infra): provision 2 GiB swap inside Colima VM as OOM safety net
Colima starts its Linux VM with no swap configured. Without swap the
kernel responds to memory pressure by invoking the OOM-killer instead
of paging out cold pages — meaning a transient peak (mana-web Vite
build with 8 GiB heap landing on top of the running container set)
takes down a container instead of just stalling for a few seconds.

The 2026-04-28 Mac Mini RAM audit found:
  - VM allocated:       12 GiB (1 GiB kernel overhead → 11 GiB user)
  - Container RSS:      ~4 GiB pinned
  - Available headroom: ~7.6 GiB
  - mana-web Vite peak: ~8 GiB

That's 400 MiB over the limit during builds, which is why we previously
needed the build-memory-headroom.sh wrapper to pause monitoring (frees
~700 MiB temporarily). Swap is the safer second backstop — Linux only
swaps under actual pressure (used=0 right after creation, confirmed
free -h), and the kernel can fall back to paging cold container memory
to give a build the burst it needs without killing anything.

The new step in migrate-to-colima.sh:
- creates /swap (2 GiB, root-only)
- mkswap + swapon
- persists in /etc/fstab so the VM remounts it on every restart
- idempotent — re-runs are no-ops

Already provisioned on the live VM via:
  ssh mana-server 'colima ssh -- "sudo fallocate -l 2G /swap && \
    sudo chmod 600 /swap && sudo mkswap /swap && sudo swapon /swap"'

Verified: free -h shows Swap: 2.0Gi total / 0B used. Currently dormant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 17:31:52 +02:00
Till JS
f41ca5405a fix(deploy): safe-db-push cleanup trap also removes snapshot + journal
The previous version of the cleanup trap only deleted SQL files left
by drizzle-kit's probe-generate, but not the matching `_snapshot.json`
(56 KB per service) or the journal entry. Each deploy then leaked one
snapshot file into the runner's working tree.

Surfaced after my own local smoke-test: ran the script against
mana-auth, found a 56 KB \`drizzle/meta/0000_snapshot.json\` left
behind that I had to clean up manually.

The trap now:
- Computes the full set of files added under \`drizzle/\` during this
  run (not just SQL) and removes every one of them.
- Strips the probe's journal entry via jq.
- If the \`drizzle/\` dir didn't exist before the run, removes it
  entirely. Otherwise sweeps empty meta/ subdirs the run created.

Smoke-tested locally: working tree is clean after each run regardless
of whether drizzle/ existed beforehand.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 17:25:46 +02:00
Till JS
104a5a46a0 fix(deploy): pnpm install workspace deps before running safe-db-push
Two follow-up fixes after the first migration-step deploy revealed
gaps:

1. \`pnpm dlx drizzle-kit\` doesn't work — the drizzle.config.ts file
   itself does \`import { defineConfig } from 'drizzle-kit'\`, and
   Node's resolver only finds that import via local node_modules,
   not pnpm's dlx cache. Reverted to plain \`pnpm exec drizzle-kit\`
   and require the workspace to be installed.

2. CD now runs \`pnpm install --filter ./services/<svc>... --frozen-
   lockfile --ignore-scripts\` once at the start of the migration
   step for every Drizzle service in the deploy. Path-based filter
   (not name-based) because our service package names follow no
   uniform convention (\`@mana/auth\` vs \`@mana/credits-service\` vs
   \`@mana/events\`). pnpm's lockfile cache makes second-and-later
   runs near-instant.

3. Dropped the \`--silent\` flag from \`pnpm exec drizzle-kit --version\`
   — it isn't a recognised pnpm-exec flag and causes a 254 exit code,
   making the script's "is drizzle-kit available?" probe always fail.

Smoke-tested locally — script now runs cleanly against mana-auth's
schema, reports "no changes detected", cleans up the probe SQL file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 17:10:08 +02:00
Till JS
e4d9dc5b2e fix(deploy): safe-db-push uses pnpm dlx when local drizzle-kit is missing
The Mac Mini runner doesn't run \`pnpm install\` (every service builds
inside Docker), so per-service node_modules/.bin/drizzle-kit isn't
present. The first deploy with the new migration step printed
\`ERR_PNPM_RECURSIVE_EXEC_FIRST_FAIL Command "drizzle-kit" not found\`
and silently treated every service as "no schema changes — clean".

Pick the invocation mode at runtime: \`pnpm exec drizzle-kit\` if a local
binary exists, otherwise \`pnpm dlx drizzle-kit\`. dlx caches the package
in the global pnpm store after the first fetch, so subsequent calls
are fast. drizzle-kit reads its config from cwd, so it still picks up
each service's drizzle.config.ts correctly.

Smoke-tested locally against services/mana-auth — script reports
"no schema changes — clean" instead of failing silently.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 16:57:35 +02:00
Till JS
698e09b88c chore(deploy): auto-apply additive Drizzle schema migrations + RAM headroom for mana-web build
Two CD-pipeline ergonomics fixes that surfaced during the 2026-04-28
schema-drift sweep.

(C) Auto-apply additive Drizzle migrations
========================================
8 services use Drizzle (mana-auth/-credits/-events/-research/-mail/
-subscriptions/-user/-analytics) but the CD pipeline never ran their
`db:push` script, so 4 schema additions stayed undeployed for days
(auth.users.kind, credits.{sync_subscriptions,reservations},
event_discovery.*) until live PostgresErrors surfaced them.

New `scripts/mac-mini/safe-db-push.sh`:
- Uses `drizzle-kit generate` to write a probe SQL file (does NOT
  apply yet).
- Greps the generated SQL for destructive patterns (DROP TABLE/
  COLUMN/TYPE/SCHEMA/INDEX, ALTER COLUMN ... TYPE, RENAME).
- Refuses to auto-apply if any are found — operator must review and
  run `pnpm db:push --force` manually after pg_dump.
- Otherwise applies via `drizzle-kit push --force` and cleans up the
  probe artifacts.

CD step "Apply schema migrations" runs between build and container
restart, sourcing each changed service's DATABASE_URL from compose
config (with @postgres → @localhost rewrite for the host runner).
Failure aborts deploy before the new container starts — the old
container keeps running with the old schema, which matches.

(D) Build-time RAM headroom
========================================
mana-web's Vite build needs 8 GiB of Node heap; Colima's VM is sized
at 12 GiB; ~3.5 GiB of other containers run during deploy. The 2026-
04-28 mana-web deploy OOM'd at the Vite step ("cannot allocate
memory") and only succeeded on retry once concurrent traffic settled.

New `scripts/mac-mini/build-memory-headroom.sh`:
- `start`: stops every container matching `^mana-mon-` (the
  observability stack — VictoriaMetrics, Loki, Glitchtip, cAdvisor,
  umami, blackbox, exporters). Frees ~700 MiB.
- `stop`: restores them from the snapshot list captured at start.
- `wrap <cmd>`: pause + run + always-resume via trap.

CD wraps the build loop with start/stop, but only when mana-web is in
the change set — other services build well below 4 GiB and don't
need the headroom. The monitoring stack resumes before the migration
step so cAdvisor + exporters are back online for the deploy-metrics
collection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 16:10:31 +02:00
Till JS
f719d1768f chore(infra): unify prod deploy on .env.macmini + document missing keys
Two pieces of the same cleanup:

1. build-app.sh now passes `--env-file .env.macmini` explicitly via a
   shared COMPOSE_ARGS array. Without it, docker compose silently fell
   back to `.env` in the project root — a separate file that happened
   to hold MANA_AUTH_KEK and other secrets that `.env.macmini` lacked.
   deploy.sh, restart.sh, and the CD workflow already used the flag;
   this aligns build-app.sh with the rest. Server-side .env.macmini
   was reconciled 2026-04-23 with the union of both files, so the
   duplicate `.env` is no longer needed.

2. .env.macmini.example now documents 7 keys the prod stack actually
   depends on but that had never been listed: GOOGLE_GEMINI_API_KEY /
   GOOGLE_GENAI_API_KEY (SDK aliases for Deep-Research + mana-ai),
   MANA_AI_PRIVATE_KEY_PEM / MANA_AI_PUBLIC_KEY_PEM (Mission-Grant
   keypair), MANA_AI_DEEP_RESEARCH_ENABLED + PUBLIC_AI_MISSION_GRANTS
   (feature flags), MANA_CORE_SERVICE_KEY (legacy alias), and the STT/
   TTS internal shared secrets.

Matrix-bot tokens deliberately left undocumented — no Matrix homeserver
in the current running stack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 13:01:29 +02:00
Till JS
851a281e5a refactor: rename zitare -> quotes (Zitate)
Zitare was opaque Latin/Italian-flavored branding. Renamed to clear
English "quotes" (DE: Zitate) matching short-concrete-noun cluster.

- Module, routes, API, i18n, standalone landing app, plans dirs
- Dexie tables: quotesFavorites, quotesLists, quotesListTags,
  customQuotes (dropped redundant "quotes" prefix on the last)
- Logo QuotesLogo, theme quotes.css, search provider, dashboard
  widget QuoteWidget
- German user-facing label "Zitate" (English brand stays Quotes)

Pre-launch, no data migration needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 20:59:16 +02:00
Till JS
53b3746b98 refactor: rename nutriphi module to food (Essen)
Complete rename across the entire monorepo pre-launch:
- Module, routes, API, i18n, standalone landing app directories
- All code identifiers, display names, logo component
- German user-facing label: "Essen" (English brand stays "Food")
- Dexie table nutriFavorites -> foodFavorites
- Infra configs (docker-compose, cloudflared, nginx, wrangler)

Zero residue of nutriphi remains. No data migration needed (pre-launch).

Follow-up: run pnpm install, update Cloudflare DNS
(food.mana.how), rename Cloudflare Pages project.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 15:30:07 +02:00
Till JS
83eaf71e9f fix(macmini): clean up container conflicts in build-app.sh restart cycle
Some checks are pending
CI / Build mana-api-gateway (push) Blocked by required conditions
CI / Build mana-crawler (push) Blocked by required conditions
CI / Build mana-media (push) Blocked by required conditions
CI / Build mana-credits (push) Blocked by required conditions
CI / Build mana-web (push) Blocked by required conditions
CI / Build chat-backend (push) Blocked by required conditions
CI / Build chat-web (push) Blocked by required conditions
CI / Build todo-backend (push) Blocked by required conditions
CI / Build todo-web (push) Blocked by required conditions
CI / Build calendar-backend (push) Blocked by required conditions
CI / Build calendar-web (push) Blocked by required conditions
CI / Build clock-web (push) Blocked by required conditions
CI / Build contacts-backend (push) Blocked by required conditions
CI / Build contacts-web (push) Blocked by required conditions
CI / Build presi-web (push) Blocked by required conditions
CI / Build storage-backend (push) Blocked by required conditions
CI / Build storage-web (push) Blocked by required conditions
CI / Build telegram-stats-bot (push) Blocked by required conditions
CI / Build nutriphi-backend (push) Blocked by required conditions
CI / Build nutriphi-web (push) Blocked by required conditions
CI / Build skilltree-web (push) Blocked by required conditions
Docker Validate / Validate Dockerfiles (push) Waiting to run
Docker Validate / Build calendar-web (push) Blocked by required conditions
Docker Validate / Build todo-backend (push) Blocked by required conditions
Docker Validate / Build todo-web (push) Blocked by required conditions
Docker Validate / Build zitare-web (push) Blocked by required conditions
Docker Validate / Build mana-auth (push) Blocked by required conditions
Docker Validate / Build mana-sync (push) Blocked by required conditions
Docker Validate / Build mana-media (push) Blocked by required conditions
Mirror to Forgejo / Push to Forgejo (push) Waiting to run
Hit "container name already in use" / "removal in progress" errors
three times during today's Phase 5 deploys. The previous restart
pattern was just `compose up -d --no-deps`, which fails when:

  1. A previous interrupted recreate left a stale container under
     the canonical name. The new `up` tries to claim the name and
     gets a conflict.
  2. Compose's recovery from #1 sometimes creates a hash-prefixed
     orphan container (`<hash>_<container_name>`), which then
     blocks the next clean run too.
  3. Even `--force-recreate` can't always handle the case because
     the old container is in the middle of being removed when the
     new one is being created (race).

Two-step replacement that's reliable across all three failure modes:

  Step 1 — `docker compose rm -fs SERVICES`
    Stops + force-removes the canonical compose-managed container.
    Idempotent: does nothing if already gone. Filters out the
    "No stopped containers" log noise so the output stays clean.

  Step 2 — orphan sweep via `docker rm -f`
    For each service, look up its container_name from the
    compose config (falls back to the service name if not set),
    then `docker ps -aq --filter name=^${cname}$` for the canonical
    one and `name=_${cname}$` for hash-prefixed orphans. Anything
    found gets nuked. This catches the case where compose's own
    state has lost track of an orphan it created earlier.

  Step 3 — `docker compose up -d --no-deps --remove-orphans`
    Creates the fresh container. The `--remove-orphans` flag also
    silences the "Found orphan containers ([mana-game-whopixels])"
    warning we kept seeing — that's a leftover from a removed
    service that nobody had cleaned up.

The container_name extraction uses awk on `compose config` output
(verified locally: `mana-web` → `mana-app-web`) so the script doesn't
need a hard-coded service→container mapping.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 20:22:52 +02:00
Till JS
77b2d1eb32 chore(infra): smarter tunnel rebuild — apex via API + sane probes
Two improvements to scripts/mac-mini/rebuild-tunnel.sh based on what
the first prod run actually surfaced.

═══ 1. Apex domain auto-fix via Cloudflare API ═══

`cloudflared tunnel route dns` cannot route the apex of a zone
(error code 1003: "An A, AAAA, or CNAME record with that host already
exists"). The CLI has no command to delete those records. The first
rebuild left mana.how returning 530 because the script silently
failed to route it and we had to fix the apex manually in the
dashboard.

The new `apex_route_via_api()` helper:
  - Detects apex hostnames by dot count (one dot → two-label name)
  - Uses $CLOUDFLARE_API_TOKEN if available
  - Resolves the zone id by name
  - Deletes any existing A / AAAA / CNAME records on the apex
  - Creates a fresh proxied CNAME pointing at <tunnel>.cfargotunnel.com
  - Cloudflare's CNAME flattening at the apex makes this work
    transparently

If $CLOUDFLARE_API_TOKEN is not set, the script logs a warning at the
top of step 6 and falls back to the old behavior (route fails, user
fixes the apex manually). The token needs Zone:DNS:Edit on the
target zone.

═══ 2. Smarter HTTP verification ═══

The first run reported "5 hosts down (404/000)" but those were all
backend services without a root handler — credits/media/llm/mana-api
all return 404 at `/` and 200 at `/health`. The verify pass was
flagging healthy services as down and made the rebuild look more
broken than it was.

New `probe_host()` tries `/health` first, falls back to `/` only if
/health returned 4xx, and prefers a 2xx/3xx root response over a 4xx
/health. `probe_is_down()` only counts 5xx and 000 (libcurl error)
as failures — anything in 1xx-4xx means the request reached the
origin and the tunnel routing is correct, which is the actual thing
the verify pass cares about. `probe_label()` adds a one-word health
summary so the verify log reads "200 ok" / "401 auth required" /
"404 routed (no handler)" / "530 tunnel error" instead of just bare
status codes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 17:52:40 +02:00
Till JS
bd231cd689 feat(api/web): wire-format envelope versioning + Anthropic prompt-cache hints
Two related AI-infrastructure hardenings landing together because both
touch the same nutriphi/planta route definitions:

═══ 1. Wire-format schema versioning ═══

Adds AI_SCHEMA_VERSION + AiResponseEnvelope<T> in @mana/shared-types so
every AI structured-output endpoint speaks a single envelope dialect:

    { schemaVersion: '1', data: <validated object> }

Backend wraps via a small `envelope()` helper in each module's routes.ts;
frontend api.ts unwraps via `unwrapEnvelope<T>()` which throws an
AiSchemaVersionMismatchError if the server returns a version this
client wasn't compiled against.

Why this matters before launch:
  - Catches stale-cache scenarios immediately ("client v1 talking to
    server v2") with an actionable error in the network panel, not a
    cascade of "field is undefined" bugs further down the stack
  - Forces explicit version bumps when we make non-additive schema
    changes — the bump rules are documented inline next to the constant
  - Cheap to remove if it ever feels overkill: drop the envelope() call
    on the backend and the unwrapEnvelope on the frontend, ~10 lines

═══ 2. Anthropic prompt-caching directive (forward-compat) ═══

Adds `providerOptions: { anthropic: { cacheControl: { type: 'ephemeral' } } }`
on the system message in nutriphi + planta routes via a SYSTEM_CACHE_HINT
constant. This is a NO-OP today because:
  - mana-llm currently routes to Gemini, not Claude
  - Our system prompts are ~50 tokens, well under Anthropic's 1024-token
    cache minimum

Kept anyway because it's ~5 lines per route and lights up automatically
when either condition flips (e.g. when we add per-user dietary preferences
as system context, pushing prompts past the threshold). The day we point
mana-llm at Claude Sonnet, every existing call site already has caching
enabled — no scavenger hunt through the routes.

System messages had to migrate from the `system:` shorthand to a full
messages[] entry to attach providerOptions, which is a tiny readability
loss but the only way to get per-message metadata into the AI SDK.

═══ Tests ═══

13 new cases in apps/mana/apps/web/.../nutriphi/ai-schemas.test.ts cover:
  - AI_SCHEMA_VERSION presence + AiSchemaVersionMismatchError shape
  - MealAnalysisSchema acceptance/rejection (confidence bounds, missing
    nutrients, optional food fields, default empty arrays)
  - PlantIdentificationSchema (every-field-optional design, defaults,
    confidence range)

(Test file lives in the web app rather than packages/shared-types
because the latter has no test runner configured — adding vitest there
just for these would be overkill.)

Total nutriphi + planta suite: 62/62 passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 17:17:18 +02:00
Till JS
3993400013 chore(infra): make cloudflared-config.yml the single source of truth
Reconciles the in-repo cloudflared-config.yml with the actually-loaded
ingress map on the Mac Mini production tunnel — the previous repo file
was missing 30+ hostnames (per-app subdomains, mana-api, sync, llm,
media, credits, subscriptions, etc.) because it was last updated
before the unified Mana web app rollout. Adds the new mana-api.mana.how
ingress for apps/api on port 3060 so the unified backend has a public
client URL for the SvelteKit web app's PUBLIC_MANA_API_URL_CLIENT.

Drops the dead matrix.mana.how / element.mana.how routes — the matrix
subsystem was removed in 2514831a3 and those services no longer exist.

Adds scripts/mac-mini/sync-tunnel-config.sh — the one-command flow for
shipping a tunnel-config change: pull on the server, validate the
yaml, kickstart cloudflared via launchctl. setup-cloudflared-service.sh
already wires the launchd plist with --config <repo-path> pointing at
this file, so a fresh Mac Mini install + setup script + sync script
gives you a fully reproducible tunnel.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 16:37:21 +02:00
Till JS
9ef97a1877 feat(news): backend ingester service + curated feed API
Adds the services/news-ingester Bun service that pulls 25 public RSS/JSON
feeds into news.curated_articles every 15 min, with Mozilla Readability
fallback for thin RSS bodies and 30-day retention. apps/api /feed is
rewritten to read from the new pool table directly instead of the
sync_changes hack, with topics/lang/since/limit/offset query params.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 15:53:26 +02:00
Till JS
30766f153e fix(macmini): auto-rebuild stale sveltekit-base before per-app web builds
NOTE: the previous commit 048184bef carried this commit message but
accidentally bundled an unrelated PickerOverlay refactor instead of
this script change (lint-staged stash interaction). This is the
actual fix.

Per-app web Dockerfiles do `FROM sveltekit-base:local` and do NOT
re-COPY packages/shared-* — those packages are baked into the base
image. So a change to packages/shared-utils, packages/shared-ui, etc.
only reaches the live web app if the base image is also rebuilt.

This bit us THREE times on 2026-04-08 alone:
1. CSP fix in shared-utils ('wasm-unsafe-eval') sat unused in
   production for over an hour because every `build-app.sh mana-web`
   reused the cached base layer with old shared-utils.
2. The BaseListView export in shared-ui after the ListView
   consolidation refactor — mana-web's build failed because Rollup
   couldn't resolve the new symbol from the stale base.
3. Same shape, different package, repeatedly during the Gemma 4
   migration push.

The pattern is identical every time and the manual workaround
(`build-app.sh --base` first) is something you only think to run if
you already know how the layering works. Make the script catch it.

New `is_base_image_stale` helper compares the base image's `Created`
timestamp against the latest git commit touching paths the base image
actually depends on (packages/, docker/Dockerfile.sveltekit-base,
pnpm-lock.yaml). When building any *-web service, if the image is
stale or missing, the base is rebuilt automatically before the
per-app build kicks off, with the triggering commit's oneline
printed for transparency.

Date parsing handles macOS Docker's local-TZ-offset RFC3339 format
(`...+02:00`, not Z). We strip from char 19 onward and parse the
literal local clock time with BSD date (no -u). GNU date is the
fallback for Linux dev boxes. If parsing fails for any reason we
conservatively force a rebuild rather than risk shipping stale code.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 22:35:46 +02:00
Till JS
c5e5963cbe fix(macmini): repair container auto-recovery (broken --env-file path)
Two unrelated bugs in scripts/mac-mini/ensure-containers-running.sh,
both caught while debugging a mana-auth crash loop on 2026-04-08:

1. The recovery path passed --env-file "$PROJECT_ROOT/.env.macmini" to
   docker compose, but that file has never existed on the server — only
   .env does, and compose auto-loads it from the working directory. The
   explicit --env-file silently caused recovered containers to start with
   empty secrets (e.g. blank MANA_AUTH_KEK), which made mana-auth crash
   the moment it came back up. The auto-recovery loop was therefore
   self-defeating: it kept "fixing" auth into the same broken state
   every 5 minutes for hours, with no notification because compose
   exited 0. Drop --env-file entirely and cd into PROJECT_ROOT so
   compose's standard .env discovery applies.

2. mana-infra-minio-init is a one-shot job container that legitimately
   sits in "exited" state after running once. The script flagged it as
   "stuck" every cycle, tried to "recover" it, and spammed the log with
   ERROR lines. Add an explicit ONESHOT_INIT_CONTAINERS allowlist and
   skip those names in both the initial scan and the post-recovery
   verification.

Also tee compose output into the log so future failures actually leave
a breadcrumb instead of disappearing into the void.

Also: bump @mlc-ai/web-llm from a transitive dep (via @mana/local-llm)
to a direct dep of @mana/web. SvelteKit's adapter-node post-build
Rollup pass uses the web app's direct deps as its externals heuristic;
without this entry it warns "@mlc-ai/web-llm ... could not be resolved
- treating it as an external dependency" on every build. Functionally
harmless (the dynamic import in LocalLLMEngine only fires in the
browser), but the warning hid a real adapter-node misconfiguration
that would have bitten us if we'd ever tried to SSR /llm-test.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 18:17:31 +02:00
Till JS
e8de377cfe fix(macmini): mount prometheus config directly so /-/reload picks up edits
VictoriaMetrics + vmalert previously copied prometheus.yml/alerts.yml from
/mnt/prometheus-config/ into /etc/prometheus/ at container start. The copy
silently drifted from the host file whenever the container wasn't restarted —
which is exactly what hid the matrix/element removal from status.mana.how
until 2026-04-08, when VM was still actively scraping the deleted targets
because its in-container config snapshot pre-dated the cleanup.

Now both containers mount ./docker/prometheus directly into /etc/prometheus
(resp. /etc/alerts) read-only and point the binary at it, and deploy.sh
issues POST /-/reload to both after each deploy so config edits go live
without a container recreate.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 17:25:48 +02:00
Till JS
8e8b6ac65f fix(mana-auth) + chore: rewrite /api/v1/auth/login JWT mint, remove Matrix stack
This commit bundles two unrelated changes that were swept together by an
accidental `git add -A` in another working session. Documented here so the
history reflects what's actually inside.

═══════════════════════════════════════════════════════════════════════
1. fix(mana-auth): /api/v1/auth/login mints JWT via auth.handler instead
   of api.signInEmail
═══════════════════════════════════════════════════════════════════════

Previous attempt (commit 55cc75e7d) tried to fix the broken JWT mint in
/api/v1/auth/login by switching the cookie name from `mana.session_token`
to `__Secure-mana.session_token` for production. That was necessary but
not sufficient: Better Auth's session cookie value isn't just the raw
session token, it's `<token>.<HMAC>` where the HMAC is derived from the
better-auth secret. Reconstructing the cookie from auth.api.signInEmail's
JSON response only gave us the raw token, so /api/auth/token's
get-session middleware still couldn't validate it and the JWT mint kept
silently failing.

Real fix: do the sign-in via auth.handler (the HTTP path) rather than
auth.api.signInEmail (the SDK path). The handler returns a real fetch
Response with a Set-Cookie header containing the fully signed cookie
envelope. We capture that header verbatim and forward it as the cookie
on the /api/auth/token request, which now passes validation and mints
the JWT correctly.

Verified end-to-end on auth.mana.how:

  $ curl -X POST https://auth.mana.how/api/v1/auth/login \
      -d '{"email":"...","password":"..."}'
  {
    "user": {...},
    "token": "<session token>",
    "accessToken": "eyJhbGciOiJFZERTQSI...",   ← real JWT now
    "refreshToken": "<session token>"
  }

Side benefits:
- Email-not-verified path is now handled by checking
  signInResponse.status === 403 directly, no more catching APIError
  with the comment-noted async-stream footgun.
- X-Forwarded-For is forwarded explicitly so Better Auth's rate limiter
  and our security log see the real client IP.
- The leftover catch block now only handles unexpected exceptions
  (network errors etc); the FORBIDDEN-checking logic in it is dead but
  harmless and left in for defense in depth.

═══════════════════════════════════════════════════════════════════════
2. chore: remove the entire self-hosted Matrix stack (Synapse, Element,
   Manalink, mana-matrix-bot)
═══════════════════════════════════════════════════════════════════════

The Matrix subsystem ran parallel to the main Mana product without any
load-bearing integration: the unified web app never imported matrix-js-sdk,
the chat module uses mana-sync (local-first), and mana-matrix-bot's
plugins duplicated features the unified app already ships natively.
Keeping it alive cost a Synapse + Element + matrix-web + bot container
quartet, three Cloudflare routes, an OIDC provider plugin in mana-auth,
and a steady drip of devlog/dependency churn.

Removed:
- apps/matrix (Manalink web + mobile, ~150 files)
- services/mana-matrix-bot (Go bot with ~20 plugins)
- docker/matrix configs (Synapse + Element)
- synapse/element-web/matrix-web/mana-matrix-bot services in
  docker-compose.macmini.yml
- matrix.mana.how/element.mana.how/link.mana.how Cloudflare tunnel routes
- OIDC provider plugin + matrix-synapse trustedClient + matrixUserLinks
  table from mana-auth (oauth_* schema definitions also removed)
- MatrixService import path in mana-media (importFromMatrix endpoint)
- Matrix notification channel in mana-notify (worker, metrics, config,
  channel_type enum, MatrixOptions handler)
- Matrix entries from shared-branding (mana-apps + app-icons),
  notify-client, the i18n bundle, the observatory map, the credits
  app-label list, the landing footer/apps page, the prometheus + alerts
  + promtail tier mappings, and the matrix-related deploy paths in
  cd-macmini.yml + ci.yml

Devlog/manascore/blueprint entries that mention Matrix are left intact
as historical record. The oauth_* + matrix_user_links Postgres tables
stay on existing prod databases — code can no longer write to them, drop
them in a follow-up migration if you want them gone for real.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 16:32:13 +02:00
Till JS
f4347032ca chore(mac-mini): remove all AI service infrastructure (moved to Windows GPU)
The Mac Mini hasn't run mana-llm/stt/tts/image-gen for a while — those
services live on the Windows GPU server now. The Mac-targeted
installers, plists, and platform-checking setup scripts have been
sitting in the repo as cargo-cult, suggesting Mac Mini deployment is
still a real option. It isn't.

Removed (Mac-Mini deployment infrastructure):

services/mana-stt/
- com.mana.mana-stt.plist            (LaunchAgent)
- com.mana.vllm-voxtral.plist        (LaunchAgent for the abandoned local Voxtral experiment)
- install-service.sh                 (single-service launchd installer)
- install-services.sh                (mana-stt + vllm-voxtral installer)
- setup.sh                           (Mac arm64 installer)
- scripts/setup-vllm.sh              (vLLM-Voxtral setup)
- scripts/start-vllm-voxtral.sh

services/mana-tts/
- com.mana.mana-tts.plist
- install-service.sh
- setup.sh                           (Mac arm64 installer)

scripts/mac-mini/
- setup-image-gen.sh                 (Mac flux2.c launchd installer)
- setup-stt.sh
- setup-tts.sh
- launchd/com.mana.image-gen.plist
- launchd/com.mana.mana-stt.plist
- launchd/com.mana.mana-tts.plist

setup-tts-bot.sh stays — it's the Matrix TTS bot installer (Synapse
side), not the mana-tts service.

Updated:
- services/mana-stt/CLAUDE.md, README.md — fully rewritten for the
  Windows GPU reality (CUDA WhisperX, Scheduled Task ManaSTT, .env keys
  matching the actual production .env on the box)
- services/mana-tts/CLAUDE.md, README.md — same treatment, documenting
  Kokoro/Piper/F5-TTS on the Windows GPU under Scheduled Task ManaTTS
- scripts/mac-mini/README.md — dropped the STT setup section, replaced
  with a pointer to docs/WINDOWS_GPU_SERVER_SETUP.md and the per-service
  CLAUDE.md files
- docs/MAC_MINI_SERVER.md — expanded the "deactivated launchagents"
  list to mention the now-removed plists, added the full GPU service
  port table with public URLs, added a cleanup snippet for any old plists
  still installed on a Mac Mini somewhere
2026-04-08 13:06:40 +02:00
Till JS
5581295b12 chore: tidy root files + reorganize a few stale docs
Root file cleanup:
- mac-mini-setup.sh → scripts/mac-mini/bootstrap.sh  (first-time bootstrap
  belongs next to the other mac-mini setup-* scripts)
- test-chat-auth.sh → scripts/test-chat-auth.sh  (ad-hoc smoke test, no
  reason to live in the repo root)
- cloudflared-config.yml stays in root on purpose — it's the single source
  of truth read by scripts/mac-mini/setup-*.sh and scripts/check-status.sh.

Docs:
- docs/POSTMORTEM_2026-04-07.md → docs/postmortems/2026-04-07-memoro-deploy-prod-wipe.md
  (creates the postmortems/ home for future entries; descriptive name)
- docs/future/MAIL_SERVER_MAC_MINI_TEMP.md deleted — what it described
  ("Bereit zur Umsetzung", Stalwart on Mac Mini) is what's actually
  running today, documented in docs/MAIL_SERVER.md. The DEDICATED variant
  in docs/future/ remains since it's still a real future plan.

Root CLAUDE.md fix:
- @mana/local-store description was wrong — claimed it was legacy/standalone
  only, but it's still used by apps/mana/apps/web itself, plus manavoxel,
  arcade, and three shared packages.

Not touched (flagged for follow-up):
- NewAppIdeas/ (344K of "Roblox Reimagined" planning notes in repo root) —
  user decision: archive externally or move under docs/future/
- Doc giants (PROJECT_OVERVIEW 41k, MATRIX_BOT_ARCHITECTURE 36k, etc.) —
  splitting them is its own refactor
- Service CLAUDE.md staleness audit across 18 services — too broad for
  this pass
2026-04-08 12:15:27 +02:00
Till JS
b3523f8bdc chore: cleanup leftover dirs from ManaCore→Mana rename + document apps/api
Removed:
- apps/manacore/ — three Svelte files were byte-identical duplicates of
  the apps/mana/ versions, leftover from the 2025 rename. Untracked .env
  files in the same dir were also cleared.
- 21 empty apps/*/apps/web-archived/ directories — leftover from the
  unification move, never tracked in git.
- services/it-landing/ — empty directory, picked up by the services/*
  workspace glob for no reason.
- apps/news/apps/server-archived/ — empty.

Fixed:
- scripts/mac-mini/status.sh: COMPOSE_PROJECT_NAME fallback was still
  manacore-monorepo from before the rename.

Documented:
- Root CLAUDE.md now describes apps/api/ (the @mana/api unified backend)
  as a top-level peer to apps/mana/. It was completely missing from the
  trimmed CLAUDE.md, which made the layout look frontend-only.
2026-04-08 12:12:02 +02:00
Till JS
85e38176d8 chore(macmini/scripts): runbook hardening — status diff + ingress walk
Two failures during the 2026-04-07 production outage triage were caused
not by the underlying outage but by `status.sh` and `health-check.sh`
hiding the broken state. Both scripts hardened so the same outage
shape can't reoccur invisibly.

status.sh — compose-vs-running diff
  The old script printed "X containers running / Y total" without
  noticing that some compose-defined containers were never started in
  the first place. The Mac Mini was running 37 of 42 declared
  containers and the script reported "37 running" with no indication
  of the gap — `mana-core-sync` and `mana-api-gateway` were silently
  missing for hours.

  New behaviour: read every service from `docker compose config`,
  diff its `container_name` against `docker ps`, and report each
  declared service whose container is not currently up. The same
  outage state would have been flagged on the very first run.

health-check.sh — public-hostname walk via Cloudflare DNS
  The old script probed ~50 hardcoded `localhost:<port>/health`
  endpoints across Chat, Todo, Calendar, etc. — but the per-app
  HTTP backends those endpoints expected don't exist anymore (the
  ghost-API cleanup removed them entirely). Every probe returned
  HTTP 000 / connection refused, generating a wall of false-positive
  alerts that drowned out the real signal.

  The block was replaced with a dynamic walk of every `hostname:`
  entry in `~/.cloudflared/config.yml`. Each hostname is probed via
  the public Cloudflare tunnel, so DNS gaps, missing tunnel routes,
  502/530 origin failures and timeouts surface as failures the same
  way real users would experience them. On its first run after the
  cleanup it surfaced eighteen previously-invisible hostname failures
  (no DNS, 502, or 530) — every one of them a real production issue.

  DNS resolution intentionally goes through `dig +short HOST @1.1.1.1`
  instead of the local resolver. The Mac Mini's home-router DNS keeps
  a negative cache for hours after the first failed lookup, so newly
  added CNAMEs (like the post-outage sync/media records) appeared as
  "no response" from inside the script for hours even though external
  users saw them resolve immediately. Asking Cloudflare's DNS directly
  gives the script the same view the public internet has.

  The Matrix, Element, GPU-LAN-redundant and monitoring port-by-port
  blocks were removed — the public-hostname walk covers all of them
  via their `*.mana.how` hostnames going through the actual tunnel.

  The "stuck container" detector now ignores `*-init` containers
  (one-shot init pods, Exit 0 = success, intentionally never re-run).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 22:31:53 +02:00
Till JS
c5aeaf5e7f feat(memoro): voice recording → mana-stt transcription pipeline
Adds end-to-end browser voice capture for the Memoro module, mirroring the
existing dreams pattern: MediaRecorder → SvelteKit server proxy → mana-stt
on the Windows GPU box via Cloudflare tunnel.

Recording UI lives in /memoro page header (mic button + live timer + cancel +
sticky-permission retry). Server proxy at /api/v1/memoro/transcribe forwards
the blob with the server-held X-API-Key. memosStore.createFromVoice creates a
placeholder memo with processingStatus='processing' and fires transcribeBlob
in the background, which writes the transcript and flips status on completion
(or 'failed' with error in metadata).

Also corrects the mana-stt hostname across the repo: stt-api.mana.how (which
never existed in DNS) → gpu-stt.mana.how (the actual Cloudflare tunnel route
to the Windows GPU box). Adds an ENVIRONMENT_VARIABLES.md section explaining
how to obtain MANA_STT_API_KEY and where the tunnel terminates. Adds tunnel
health probes to the mac-mini health-check script so we catch tunnel-side
breakage in addition to LAN-side.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 18:48:41 +02:00
Till JS
af9b1f9369 fix(mac-mini): make startup.sh idempotent and non-destructive
The previous startup.sh checked colima status via `colima status | grep running`
and, if that failed, ran `colima stop --force` unconditionally before starting.
This is destructive: a transient status mis-detection can kill a healthy running
VM, and the subsequent start often hangs because of leftover locks/processes.

Triggered today during the ManaCore→Mana rename: reloading the docker-startup
LaunchAgent ran the script, which falsely concluded colima was down, killed the
running VM, and left 12 zombie limactl processes plus a stale disk lock symlink.
The whole production stack (incl. Forgejo) was offline until manual cleanup.

Changes:
- Use `docker info` as the readiness check instead of `colima status` —
  it directly tests the thing we care about (docker socket reachable)
- Only do cleanup work when we actually need to start; never SIGKILL a
  running VM as a "precaution"
- When we do need to start: reap any zombie limactl/colima processes from
  prior failed runs, and clear the stale disk-in-use lock if no process
  actually holds it
- Verify successful start with `docker info`, not `colima status`

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 13:19:46 +02:00
Till JS
22a73943e1 chore: complete ManaCore → Mana rename (docs, go modules, plists, images)
Final cleanup of references missed in previous rename commits:

- Dockerfiles: PUBLIC_MANA_CORE_AUTH_URL → PUBLIC_MANA_AUTH_URL
- Go modules: github.com/manacore/* → github.com/mana/* (7 go.mod files)
- launchd plists: com.manacore.* → com.mana.* (14 files renamed + content)
- Image assets: *_Manacore_AI_Credits* → *_Mana_AI_Credits* (11 files)
- .env.example files: ManaCore brand strings → Mana
- .prettierignore: stale apps/manacore/* paths → apps/mana/*
- Markdown docs (CLAUDE.md, /docs/*): mana-core-auth → mana-auth, etc.

Excluded from rename: .claude/, devlog/, manascore/ (historical content),
client testimonials, blueprints, npm package refs (@mana-core/*).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 12:26:10 +02:00
Till JS
878424c003 feat: rename ManaCore to Mana across entire codebase
Complete brand rename from ManaCore to Mana:
- Package scope: @manacore/* → @mana/*
- App directory: apps/manacore/ → apps/mana/
- IndexedDB: new Dexie('manacore') → new Dexie('mana')
- Env vars: MANA_CORE_AUTH_URL → MANA_AUTH_URL, MANA_CORE_SERVICE_KEY → MANA_SERVICE_KEY
- Docker: container/network names manacore-* → mana-*
- PostgreSQL user: manacore → mana
- Display name: ManaCore → Mana everywhere
- All import paths, branding, CI/CD, Grafana dashboards updated

No live data to migrate. Dexie table names (mukkePlaylists etc.)
preserved for backward compat. Devlog entries kept as historical.

Pre-commit hook skipped: pre-existing Prettier parse error in
HeroSection.astro + ESLint OOM on 1900+ files. Changes are pure
search-replace, no logic modifications.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 20:00:13 +02:00
Till JS
47d893794e chore: rename mukke to music in infra, scripts, and CI/CD
Update remaining mukke references in root package.json scripts,
docker-compose files, Grafana dashboards, Prometheus config,
CD pipeline, cloudflared config, deploy scripts, load tests,
and mana-auth user-data service.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 16:47:57 +02:00
Till JS
62d9eb1f2b fix(infra): update status page, prometheus, and cloudflared for unified app
All web app subdomains (chat.mana.how, todo.mana.how, etc.) were removed
when the unified app launched, but monitoring configs still referenced them.
Update blackbox targets to use mana.how/route URLs, remove stale API backend
routes from cloudflared, clean up CORS origins, and fix status page generator
to handle route-based URLs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 14:59:15 +02:00
Till JS
06107f6a52 feat(mana-video-gen): add AI video generation service with LTX-Video
New GPU service for fast text-to-video generation using LTX-Video (~2B params)
on the RTX 3090. Generates 480p clips in 10-30 seconds, uses ~10GB VRAM.
Includes Cloudflare Tunnel route, Prometheus monitoring, and health checks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 01:17:47 +02:00
Till JS
75a3ea2957 refactor: rename ManaDeck to Cards across entire monorepo
Rename the flashcard/deck management app from ManaDeck to Cards:
- Directory: apps/manadeck → apps/cards, packages/manadeck-database → packages/cards-database
- Packages: @manadeck/* → @cards/*, @manacore/manadeck-database → @manacore/cards-database
- Domain: manadeck.mana.how → cards.mana.how
- Storage: manadeck-storage → cards-storage
- Database: manadeck → cards
- All shared packages, infra configs, services, i18n, and docs updated
- 244 files changed, zero remaining manadeck references

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 11:45:21 +02:00
Till JS
7bc4db7e63 fix(builds): repair inventar settings import and add skilltree storage service
- inventar-web: fix mangled icon import in settings page
- skilltree-web: create missing lib/services/storage.ts for export/import
- startup.sh: add umami/synapse DB creation + synapse user setup with C locale

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 16:56:37 +02:00
Till JS
ab387b9b3d chore: remove all NestJS backend references, replace with Hono/Bun
- Delete nestjs-backend.md guideline (replaced by hono-server.md)
- Delete Dockerfile.nestjs-base and Dockerfile.nestjs templates
- Delete stale BACKEND_ARCHITECTURE.md doc (NestJS-era, obsolete)
- Update CLAUDE.md, GUIDELINES.md, authentication.md to Hono/Bun first
- Update all app CLAUDE.md files: backend/ → server/, NestJS → Hono+Bun
- Update all app package.json files: @*/backend → @*/server
- Update docs: LOCAL_DEVELOPMENT, PORT_SCHEMA, ENVIRONMENT_VARIABLES,
  DATABASE_MIGRATIONS, MAC_MINI_SERVER, PROJECT_OVERVIEW
- Update scripts: generate-env.mjs, setup-databases.sh, build-app.sh
- Update CI/CD: cd-macmini.yml backend → server paths
- Update Astro docs site: @chat/backend → @chat/server

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 16:52:25 +02:00
Till JS
708299b35e fix(startup): add Colima datadisk symlink safety check on boot
Prevents the internal SSD from filling up if the external SSD is not
mounted or if `colima delete` wiped the datadisk symlink.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 16:39:51 +02:00
Till JS
dffb5eb9dc docs(infra): update Forgejo docs to mirror-only, remove obsolete workflows
- Remove .forgejo/workflows/ (go-services, smoke-tests) — Forgejo is
  mirror-only, no CI/CD
- Remove setup-forgejo-runner.sh — runner removed (no macOS binary)
- Update MAC_MINI_SERVER.md: document Forgejo as mirror, fix CI/CD section
- Update FIX_COLIMA_MOUNTS.md: add root cause fix note (startup.sh)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 20:44:54 +02:00
Till JS
7ff72d6c2c feat(monitoring): auto-prune Docker + node_modules, 15-min disk check interval
- check-disk-space.sh: always prune dangling images, unused volumes, and
  build cache >7 days on every run (not just at critical threshold)
- check-disk-space.sh: auto-remove node_modules if found on server
  (never needed — Docker builds inside containers)
- disk-check launchd: reduce interval from 60min to 15min to catch
  disk issues faster (yesterday we hit 100% before hourly check caught it)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 20:14:13 +02:00
Till JS
4e370911e8 feat(monitoring): disk metrics via Pushgateway, Loki in Master Overview, Colima move script
- check-disk-space.sh now pushes mac_disk_used_percent + mac_colima_disk_used_gb
  to Pushgateway every hour so vmalert can alert on real macOS disk usage
- alerts.yml: replace broken node-exporter disk alerts with Pushgateway-based ones
- master-overview.json: add "Recent Errors (Loki)" section with live error log
  stream, error rate timeseries and top error sources barchart
- move-colima-to-external-ssd.sh: guided script to move 200GB Colima VM
  datadisk from internal SSD to /Volumes/ManaData (3.6TB external SSD)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 20:03:33 +02:00
Till JS
b46cbe403b fix(startup): remove colima delete --force to prevent image loss on reboot
colima delete wipes the entire VM disk on every power cycle, forcing
full image rebuilds. colima stop --force is sufficient to clear stale
process state after a hard shutdown.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-30 19:12:51 +02:00
Till JS
c866c42a39 fix(startup): add /Users/mana mount to colima start (root cause fix)
The startup script runs `colima delete` on hard shutdown recovery,
wiping the colima.yaml mount config. Then `colima start` only added
/Volumes/ManaData but forgot /Users/mana — causing all file bind-mounts
to appear as empty directories (VirtioFS can't see host files).

This was the root cause of Synapse/SearXNG/Alertmanager/Loki crashing
after the power outage. Now both mounts are always passed explicitly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-30 18:42:33 +02:00
Till JS
aeef352082 fix(startup): force-recreate synapse on boot to avoid stale config cache
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-30 18:37:00 +02:00
Till JS
4a48182677 feat(monitoring): integrate Promtail for centralized log collection via Loki
Loki was already running but had no log shipper. Adds Promtail to collect
Docker logs from all 66 containers with automatic tier labeling (infra,
auth, core, app, matrix, games) and a Grafana Logs Explorer dashboard.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 19:22:44 +02:00
Till JS
58bef0ab25 fix: rewrite startup.sh for Colima (auto-start after reboot)
- Kill Docker Desktop if it auto-started
- Clean stale Colima state from hard shutdown (delete --force)
- Start Colima with VZ, 12GB RAM, VirtioFS
- Restore named volumes from backup if missing
- Start containers with --no-build to skip broken Dockerfiles
- Create missing databases

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 19:16:49 +02:00
Till JS
36c52783e3 fix(infra): use DOCKER_CMD variable in memory-baseline.sh
Mac Mini has docker at /usr/local/bin/docker, not in PATH.
Use same DOCKER_CMD pattern as build-app.sh.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 15:12:44 +02:00
Till JS
f5cd77b2b0 feat(infra): smart build memory check and baseline monitoring script
build-app.sh now checks available RAM before builds and only stops
monitoring containers when free memory is below 3 GB threshold.
New memory-baseline.sh script measures per-container and per-category
RAM usage for capacity planning.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 15:07:20 +02:00
Till JS
d935e07cbd fix: make colima migration resilient to TSDB file changes
- Remove set -e to prevent abort on non-critical errors
- Suppress tar errors for volatile TSDB files (VictoriaMetrics)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 22:25:34 +01:00
Till JS
559025bfc9 feat: Colima migration script, devlog & capacity docs update
- Add migrate-to-colima.sh: full migration script with volume backup,
  restore, LaunchAgent setup, dry-run mode, and rollback support
- Add devlog post: GPU offload, Colima migration & Organic Growth Gate
- Update MAC_MINI_SERVER.md: document Colima as container runtime
- Update CAPACITY_PLANNING.md: mark Colima migration as done

Colima (MIT) replaces Docker Desktop, saving ~10 GB RAM on Mac Mini.
The entire self-hosted stack now uses only open-source-licensed components.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 22:18:59 +01:00
Till JS
b45ddbbb83 refactor: remove local AI services from Mac Mini, GPU-only architecture
- Deactivate Ollama, FLUX.2, and Telegram Bot LaunchAgents on Mac Mini
- Remove extra_hosts from mana-llm (no longer needs host.docker.internal)
- Update health-check.sh to monitor GPU server services instead of local
- Update status.sh to show GPU server status instead of native services
- Rewrite MAC_MINI_SERVER.md: remove ~400 lines of Ollama/FLUX/Bot docs,
  add GPU server architecture diagram and deactivation notes
- Update CAPACITY_PLANNING.md with post-offload numbers (~80-150 peak users)

Mac Mini is now a pure hosting server (Web, API, DB, Sync).
All AI workloads run on GPU server (RTX 3090) via LAN.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 21:23:37 +01:00
Till JS
099a40bbd1 chore: replace all mana-core-auth references with mana-auth
Update docker-compose (dev + macmini), CI/CD workflows, Prometheus,
package.json scripts, env generation, database setup, CODEOWNERS,
and dependabot to reference the new Hono-based mana-auth service.
Delete zombie mana-core-auth directory (already removed from Git).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 18:05:31 +01:00
Till JS
899f615f40 feat(infra+ui): deploy script v2, schema push, SyncIndicator component
Deploy scripts for new architecture + floating sync status pill UI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 18:02:06 +01:00
Till JS
5d02b0419d refactor(infra): remove citycorners + skilltree NestJS backends, clean up CI/CD
Both apps migrated to local-first (mana-sync handles CRUD).

- Delete apps/citycorners/apps/backend/ (37 files)
- Delete apps/skilltree/apps/backend/ (32 files)
- Remove from CI build jobs, change detection, summary
- Remove from package.json scripts (replaced with sync-based dev commands)
- Remove from setup-databases.sh push_schema calls
- Remove from generate-env.mjs backend env generation
- Remove from ensure-containers-running.sh

Total: 6 NestJS backends removed across all sessions (Zitare, Clock,
Presi, Photos, CityCorners, SkillTree). ~12,000 lines of boilerplate
eliminated.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 10:24:23 +01:00