feat(gpu-box): healthchecks for glitchtip-worker, gpu-promtail, status-gen

Three containers were running with no healthcheck — Docker showed them
as 'none', so an actual crash inside the container would only surface
once the process itself exited (and got restarted by restart-policy).
Added container-internal probes that don't depend on tools the image
doesn't ship:

- glitchtip-worker: bash + /dev/tcp/glitchtip-redis/6379 — confirms the
  Celery broker is reachable. Bare-metal probe, no extra deps.
- gpu-promtail: bash + /dev/tcp/loki/3100 — confirms the loki sink the
  worker is shipping to is reachable. Replaces the wget-based check
  that errored 'executable file not found' on every tick.
- status-page-gen: stat + date — confirms /output/status.json was
  rewritten in the last 3 min (script writes it every 60s). Catches
  the case where the apk-install loop wedges or the generator
  silently dies.

CMD-SHELL is /bin/sh which is dash on Debian-based images and dash
doesn't support /dev/tcp — used CMD form with explicit bash for the
two TCP probes.

photon stays without a healthcheck — pre-existing user container, not
in this compose file. Adding it would require a recreate which loses
the warm OSM cache.

After rollout: 17/20 GPU-Box containers healthy + 3 'none' (status-nginx,
glitchtip-redis, gpu-node-exporter — all standard upstream images
without built-in /health endpoints; their service is checked indirectly
via downstream consumers' healthchecks).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Till JS 2026-05-07 15:29:04 +02:00
parent 8a90cd296c
commit 384be93274

View file

@ -299,7 +299,12 @@ services:
depends_on:
loki:
condition: service_started
# healthcheck disabled: promtail image has no curl/wget/nc; restart policy handles crashes
healthcheck:
test: ['CMD', 'bash', '-c', 'exec 3<>/dev/tcp/loki/3100']
interval: 60s
timeout: 5s
retries: 3
start_period: 15s
# ============================================
# Phase 2d — Glitchtip mit dedizierter Postgres + Redis (2026-05-06)
@ -378,6 +383,12 @@ services:
SECRET_KEY: ${GLITCHTIP_SECRET_KEY}
GLITCHTIP_DOMAIN: https://glitchtip.mana.how
CELERY_WORKER_AUTOSCALE: '1,3'
healthcheck:
test: ['CMD', 'bash', '-c', 'exec 3<>/dev/tcp/glitchtip-redis/6379']
interval: 60s
timeout: 5s
retries: 3
start_period: 30s
# ============================================
# Phase 2e — Status-Page (2026-05-07): generator + nginx auf GPU-Box.
@ -412,6 +423,12 @@ services:
sh /tmp/generate.sh
sleep 60
done
healthcheck:
test: ['CMD-SHELL', '[ -f /output/status.json ] && [ $$(( $$(date +%s) - $$(stat -c %Y /output/status.json) )) -lt 180 ]']
interval: 90s
timeout: 5s
retries: 2
start_period: 60s
status-nginx:
image: nginx:alpine