Commit graph

11 commits

Author SHA1 Message Date
Till JS
aeaefaf675 infra(phase 2f-1 rollback): verdaccio bleibt auf Mac Mini
Phase 2f-1 hatte verdaccio von der Mini auf die GPU-Box verlegt — das
Storage-Volume kam dort aber nie an. Der GPU-Container war leer (keine
htpasswd, keine @mana/*-Pakete), externe `npm install @mana/foo` lief
auf 404. Rollback statt Storage-Migration nachzuholen, weil:

- Mini's Standalone-Verdaccio (~/projects/verdaccio/) hat alle Daten
  inklusive claudebot-Service-Account und 9 published Pakete
- npm-Reads sind ohnehin niedrig (CI-builds), Mini-Disk hat Platz
- Vereinfacht den User-/Token-Pflad-Lebenszyklus (eine Quelle, keine
  Sync-Choreografie)

Cleanup:
- DNS npm.mana.how zurück auf Mini-Tunnel via Cloudflare-API
- Mini cloudflared-config.yml: npm.mana.how-Ingress wieder eingetragen
- GPU-Box: verdaccio-Container + 3 Volumes entfernt (mana_verdaccio-storage,
  mana_verdaccio-plugins, verdaccio-storage)
- infrastructure/docker-compose.gpu-box.yml: verdaccio-Service-Block raus
- infrastructure/verdaccio/config.yaml: gelöscht (war GPU-spezifischer
  Bundle, der Code/mana hat die kanonische Kopie für Mini)
- docs/PLAN_OPTION_C.md: Phase 2f markiert als ⚠️ teilweise zurückgerollt

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 23:12:11 +02:00
Till JS
c84742005b infra(phase 2g): mana-research → GPU-Box
Web-Research-Orchestrator (16+ search-/LLM-providers) auf die GPU-Box
verlagert. Cross-LAN für mana-auth/mana-credits/mana-llm/mana-search/
postgres/redis (192.168.178.131). research.mana.how routet jetzt zum
mana-gpu-server-Tunnel (CF config v29). Mini-Container-Count 42 → 41.

PUBLIC_MANA_RESEARCH_URL in mana-app-web auf https-URL umgestellt —
Mini-Container können 192.168.178.11 nicht direkt erreichen (Colima-NAT),
daher Cross-LAN-Bridge via Cloudflare-Tunnel wie bei mana-ai.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 20:26:10 +02:00
Till JS
e77134bd8b docs(infra): Phase 2f added to PLAN_OPTION_C + hostname table updated to v28
- PLAN_OPTION_C.md: new row covers verdaccio + news-ingester + mana-ai
  with the cross-arch + workspace-deps gotchas
- infrastructure/README.md: hostname table catches up to npm.mana.how
  (Phase 2f-1) and mana-ai.mana.how (Phase 2f-3); config v26 → v28
- infrastructure/.env.gpu-box.example: MANA_SERVICE_KEY +
  MANA_AI_PRIVATE_KEY_PEM block added with note that the values mirror
  Mini's .env.macmini (the latter's matching public-half stays on
  mana-auth, that's what makes Mission-Grant decryption work)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 17:09:28 +02:00
Till JS
f47edc14af feat(gpu-box): mana-ai (AI Mission Runner) migrated, mana-ai.mana.how → GPU tunnel
Phase 2f-3 (final of the 2f-trio). The background tick-loop runner is
the most coupled of the three: it queries mana-api, mana-llm, and
mana-research, and writes through to the mana_sync DB. Wired up via
cross-LAN host-IPs to those Mini-side services + the existing RSA
key-pair for Mission-Grant decryption (MANA_AI_PRIVATE_KEY_PEM moved
into /srv/mana/.env on the GPU-Box; the matching MANA_AI_PUBLIC_KEY_PEM
stays on mana-auth's env-set as before).

Bonus rationale: AI Mission Runner now sits in the same compose
network as the GPU-Box's gpu-llm/gpu-ollama tasks, so future
"agent talks to local LLM" paths skip the Cloudflare round-trip.

Tunnel: mana-ai.mana.how repointed at the mana-gpu-server tunnel
(config v28). The Mini-side ingress was removed in the same step.
OTEL_EXPORTER_OTLP_ENDPOINT cleared since Tempo was retired in 2c.

Mini-side: container stopped + removed from docker-compose.macmini.yml.
Running count went from 39 → 42 because of unrelated services that
re-appeared on the latest CD pull (cards-server, memoro-web), but the
actual mana-ai service is gone — net move accomplished.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 16:40:57 +02:00
Till JS
b03165ce97 feat(gpu-box): news-ingester migrated, Mini compose drops the service block
Phase 2f-2. RSS/Atom ingester (15-min tick → mana_platform.news.curated_articles)
moved to GPU-Box. Service has zero hot-path coupling, all the writes go
cross-LAN to Mini-postgres analog to the Glitchtip pattern.

Two implementation gotchas worth recording:

1. cross-arch image transfer doesn't work. Saved news-ingester:local
   from the Mini (Apple M4 → linux/arm64), tried `docker load` on the
   GPU-Box (linux/amd64) and got 'exec format error' on every restart.
   Native build on the GPU-Box was the only path forward.

2. The original services/news-ingester/Dockerfile assumes
   pnpm-workspace state from prior builds (no COPY for packages/shared-rss
   in the build context). Fresh builds error with
   ERR_PNPM_WORKSPACE_PKG_NOT_FOUND.

Workaround: a GPU-Box-specific Dockerfile at infrastructure/news-ingester/
that vendors shared-rss into the build via a workspace:* → file:ref
sed swap. Build context is the repo root (sparse-clone provides
packages/shared-rss + services/news-ingester). The Mini-side Dockerfile
stays untouched so existing CD builds aren't disturbed.

Mini-side: container stopped + removed from docker-compose.macmini.yml,
running container count 44 → 39.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 16:27:45 +02:00
Till JS
6e40546119 feat(gpu-box): add verdaccio service + bundle config in repo
Phase 2f-1: verdaccio (npm.mana.how) was the heaviest non-hot-path
service still left on the Mini after Phase 2 — read-mostly registry
that ci/local pnpm-installs hit, latency-unkritisch. Moved into
infrastructure/docker-compose.gpu-box.yml. Storage volume content
(@mana/* packages + htpasswd) migrated via tar-stream.

Config came from the mana-platform repo's
infrastructure/verdaccio/config.yaml. Copied into mana-monorepo so the
GPU-Box's sparse-clone (already pulling scripts/ +
packages/shared-branding) can also bind-mount it without needing a
second repo on the box.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 15:54:37 +02:00
Till JS
cf5349cdd2 feat(gpu-box): adopt photon container into compose with healthcheck
photon was the last 'health: none' container on the GPU-Box —
pre-existing user setup created via raw docker-run before Phase 2.
Adopted into infrastructure/docker-compose.gpu-box.yml with the
exact same image / volumes / cmd / port mapping so the OSM index in
/opt/photon-data survives untouched, plus a curl-based healthcheck
against /api?q=Berlin&limit=1 (Photon has no /health endpoint —
this is the canonical liveness probe).

start_period 120s gives Java the warmup window without false-flagging.
Recreate took ~10s including healthy state, no perceptible downtime
on photon.mana.how.

After this, all 20 GPU-Box containers report healthy. Mac Mini still
has 2 long-standing 'unhealthy' (mana-verdaccio's wget probe is
broken but npm.mana.how serves 200; mana-mail/Stalwart in bootstrap
mode, never configured) — both pre-existing, neither user-impacting.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 15:42:54 +02:00
Till JS
384be93274 feat(gpu-box): healthchecks for glitchtip-worker, gpu-promtail, status-gen
Three containers were running with no healthcheck — Docker showed them
as 'none', so an actual crash inside the container would only surface
once the process itself exited (and got restarted by restart-policy).
Added container-internal probes that don't depend on tools the image
doesn't ship:

- glitchtip-worker: bash + /dev/tcp/glitchtip-redis/6379 — confirms the
  Celery broker is reachable. Bare-metal probe, no extra deps.
- gpu-promtail: bash + /dev/tcp/loki/3100 — confirms the loki sink the
  worker is shipping to is reachable. Replaces the wget-based check
  that errored 'executable file not found' on every tick.
- status-page-gen: stat + date — confirms /output/status.json was
  rewritten in the last 3 min (script writes it every 60s). Catches
  the case where the apk-install loop wedges or the generator
  silently dies.

CMD-SHELL is /bin/sh which is dash on Debian-based images and dash
doesn't support /dev/tcp — used CMD form with explicit bash for the
two TCP probes.

photon stays without a healthcheck — pre-existing user container, not
in this compose file. Adding it would require a recreate which loses
the warm OSM cache.

After rollout: 17/20 GPU-Box containers healthy + 3 'none' (status-nginx,
glitchtip-redis, gpu-node-exporter — all standard upstream images
without built-in /health endpoints; their service is checked indirectly
via downstream consumers' healthchecks).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 15:29:04 +02:00
Till JS
cd888cd54a fix(gpu-box): drop gpu-promtail healthcheck — image has no curl/wget/nc
Promtail v3.0.0 ships a minimal alpine-ish image with only the
promtail binary. The original Mini compose's wget-based healthcheck
errored out with 'executable file not found' on every tick, marking
the container as 'unhealthy' for hours despite Loki actively
receiving logs from it. Restart-policy unless-stopped catches real
crashes anyway, so the healthcheck adds noise without value.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 15:13:47 +02:00
Till JS
6f1b0329f0 docs(infra): document photon.mana.how + cross-LAN workaround pattern
Phase 2c had 3 cross-LAN-routing pain points; Phase 2e + the photon
fix solved 2 of them, so the doc was misleading. Refactored the
"Bekannte Limits" block in PLAN_OPTION_C.md into a proper
cross-LAN-pattern table that lists each known case + its current
status. Phase-2c-original gpu-* and Mini-Promtail entries kept as
the remaining open items, with the same Cloudflare-Tunnel-as-LAN-bridge
workaround spelled out (Loki-HTTP-Push via loki.mana.how would be the
next obvious move).

Plus infrastructure/README.md now lists every active public-hostname
the mana-gpu-server tunnel exposes (v26).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 14:45:34 +02:00
Till JS
d8a35afd99 infra(gpu-box): commit GPU-Box compose to repo + Phase 2e docs
The GPU-Box stack has been carrying real production workload since
Phase 2c (monitoring) but only existed as a /srv/mana/docker-compose.gpu-box.yml
on the box itself. If the WSL filesystem dies, none of it is
reproducible. Bring the file into infrastructure/ as the source of
truth (live file on the box must be kept synchronous; manual rsync
for now since there's no CD into the GPU box).

Plus:
- infrastructure/.env.gpu-box.example as the secrets template
- infrastructure/README.md describing what runs there + how the
  Cloudflare-tunnel ingress is API-managed (not config.yml)
- .gitignore for the live infrastructure/.env.gpu-box copy
- MAC_MINI_SERVER.md status-page section now points at the GPU-Box
  setup instead of the long-stopped Mini container
- PLAN_OPTION_C.md: Phase 2e row + GPU-Box service tree update

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 13:28:49 +02:00