photon was the last 'health: none' container on the GPU-Box —
pre-existing user setup created via raw docker-run before Phase 2.
Adopted into infrastructure/docker-compose.gpu-box.yml with the
exact same image / volumes / cmd / port mapping so the OSM index in
/opt/photon-data survives untouched, plus a curl-based healthcheck
against /api?q=Berlin&limit=1 (Photon has no /health endpoint —
this is the canonical liveness probe).
start_period 120s gives Java the warmup window without false-flagging.
Recreate took ~10s including healthy state, no perceptible downtime
on photon.mana.how.
After this, all 20 GPU-Box containers report healthy. Mac Mini still
has 2 long-standing 'unhealthy' (mana-verdaccio's wget probe is
broken but npm.mana.how serves 200; mana-mail/Stalwart in bootstrap
mode, never configured) — both pre-existing, neither user-impacting.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three containers were running with no healthcheck — Docker showed them
as 'none', so an actual crash inside the container would only surface
once the process itself exited (and got restarted by restart-policy).
Added container-internal probes that don't depend on tools the image
doesn't ship:
- glitchtip-worker: bash + /dev/tcp/glitchtip-redis/6379 — confirms the
Celery broker is reachable. Bare-metal probe, no extra deps.
- gpu-promtail: bash + /dev/tcp/loki/3100 — confirms the loki sink the
worker is shipping to is reachable. Replaces the wget-based check
that errored 'executable file not found' on every tick.
- status-page-gen: stat + date — confirms /output/status.json was
rewritten in the last 3 min (script writes it every 60s). Catches
the case where the apk-install loop wedges or the generator
silently dies.
CMD-SHELL is /bin/sh which is dash on Debian-based images and dash
doesn't support /dev/tcp — used CMD form with explicit bash for the
two TCP probes.
photon stays without a healthcheck — pre-existing user container, not
in this compose file. Adding it would require a recreate which loses
the warm OSM cache.
After rollout: 17/20 GPU-Box containers healthy + 3 'none' (status-nginx,
glitchtip-redis, gpu-node-exporter — all standard upstream images
without built-in /health endpoints; their service is checked indirectly
via downstream consumers' healthchecks).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Promtail v3.0.0 ships a minimal alpine-ish image with only the
promtail binary. The original Mini compose's wget-based healthcheck
errored out with 'executable file not found' on every tick, marking
the container as 'unhealthy' for hours despite Loki actively
receiving logs from it. Restart-policy unless-stopped catches real
crashes anyway, so the healthcheck adds noise without value.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2c had 3 cross-LAN-routing pain points; Phase 2e + the photon
fix solved 2 of them, so the doc was misleading. Refactored the
"Bekannte Limits" block in PLAN_OPTION_C.md into a proper
cross-LAN-pattern table that lists each known case + its current
status. Phase-2c-original gpu-* and Mini-Promtail entries kept as
the remaining open items, with the same Cloudflare-Tunnel-as-LAN-bridge
workaround spelled out (Loki-HTTP-Push via loki.mana.how would be the
next obvious move).
Plus infrastructure/README.md now lists every active public-hostname
the mana-gpu-server tunnel exposes (v26).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The GPU-Box stack has been carrying real production workload since
Phase 2c (monitoring) but only existed as a /srv/mana/docker-compose.gpu-box.yml
on the box itself. If the WSL filesystem dies, none of it is
reproducible. Bring the file into infrastructure/ as the source of
truth (live file on the box must be kept synchronous; manual rsync
for now since there's no CD into the GPU box).
Plus:
- infrastructure/.env.gpu-box.example as the secrets template
- infrastructure/README.md describing what runs there + how the
Cloudflare-tunnel ingress is API-managed (not config.yml)
- .gitignore for the live infrastructure/.env.gpu-box copy
- MAC_MINI_SERVER.md status-page section now points at the GPU-Box
setup instead of the long-stopped Mini container
- PLAN_OPTION_C.md: Phase 2e row + GPU-Box service tree update
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>