managarten/scripts
Till JS 85e38176d8 chore(macmini/scripts): runbook hardening — status diff + ingress walk
Two failures during the 2026-04-07 production outage triage were caused
not by the underlying outage but by `status.sh` and `health-check.sh`
hiding the broken state. Both scripts hardened so the same outage
shape can't reoccur invisibly.

status.sh — compose-vs-running diff
  The old script printed "X containers running / Y total" without
  noticing that some compose-defined containers were never started in
  the first place. The Mac Mini was running 37 of 42 declared
  containers and the script reported "37 running" with no indication
  of the gap — `mana-core-sync` and `mana-api-gateway` were silently
  missing for hours.

  New behaviour: read every service from `docker compose config`,
  diff its `container_name` against `docker ps`, and report each
  declared service whose container is not currently up. The same
  outage state would have been flagged on the very first run.

health-check.sh — public-hostname walk via Cloudflare DNS
  The old script probed ~50 hardcoded `localhost:<port>/health`
  endpoints across Chat, Todo, Calendar, etc. — but the per-app
  HTTP backends those endpoints expected don't exist anymore (the
  ghost-API cleanup removed them entirely). Every probe returned
  HTTP 000 / connection refused, generating a wall of false-positive
  alerts that drowned out the real signal.

  The block was replaced with a dynamic walk of every `hostname:`
  entry in `~/.cloudflared/config.yml`. Each hostname is probed via
  the public Cloudflare tunnel, so DNS gaps, missing tunnel routes,
  502/530 origin failures and timeouts surface as failures the same
  way real users would experience them. On its first run after the
  cleanup it surfaced eighteen previously-invisible hostname failures
  (no DNS, 502, or 530) — every one of them a real production issue.

  DNS resolution intentionally goes through `dig +short HOST @1.1.1.1`
  instead of the local resolver. The Mac Mini's home-router DNS keeps
  a negative cache for hours after the first failed lookup, so newly
  added CNAMEs (like the post-outage sync/media records) appeared as
  "no response" from inside the script for hours even though external
  users saw them resolve immediately. Asking Cloudflare's DNS directly
  gives the script the same view the public internet has.

  The Matrix, Element, GPU-LAN-redundant and monitoring port-by-port
  blocks were removed — the public-hostname walk covers all of them
  via their `*.mana.how` hostnames going through the actual tunnel.

  The "stuck container" detector now ignores `*-init` containers
  (one-shot init pods, Exit 0 = success, intentionally never re-run).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 22:31:53 +02:00
..
mac-mini chore(macmini/scripts): runbook hardening — status diff + ingress walk 2026-04-07 22:31:53 +02:00
test-data feat: rename ManaCore to Mana across entire codebase 2026-04-05 20:00:13 +02:00
test-reporting chore: complete ManaCore → Mana rename (docs, go modules, plists, images) 2026-04-07 12:26:10 +02:00
audit-workspace-deps.mjs feat: rename ManaCore to Mana across entire codebase 2026-04-05 20:00:13 +02:00
backup-monitoring.sh feat: rename ManaCore to Mana across entire codebase 2026-04-05 20:00:13 +02:00
check-status.sh feat: rename ManaCore to Mana across entire codebase 2026-04-05 20:00:13 +02:00
create-gift-codes.mjs feat: rename ManaCore to Mana across entire codebase 2026-04-05 20:00:13 +02:00
deploy-metrics.sh fix(deploy): fix image size measurement in deploy metrics 2026-03-20 21:13:03 +01:00
ecosystem-audit.mjs feat: rename ManaCore to Mana across entire codebase 2026-04-05 20:00:13 +02:00
fix-mixed-imports.mjs Fix wrong type 2025-12-04 23:25:25 +01:00
generate-dockerfiles.mjs feat(infra): extend Dockerfile validator to backends and services 2026-03-25 08:57:10 +01:00
generate-env.mjs feat(dreams): voice capture via mana-stt 2026-04-07 14:39:11 +02:00
generate-status-page.sh feat: rename ManaCore to Mana across entire codebase 2026-04-05 20:00:13 +02:00
lighthouse-audit.sh feat: rename ManaCore to Mana across entire codebase 2026-04-05 20:00:13 +02:00
run-tests-with-coverage.sh feat: rename ManaCore to Mana across entire codebase 2026-04-05 20:00:13 +02:00
setup-databases.sh feat(events): add mana-events service + public RSVP flow (Phase 1b) 2026-04-07 14:27:48 +02:00
validate-dockerfiles.mjs feat: rename ManaCore to Mana across entire codebase 2026-04-05 20:00:13 +02:00
validate-monorepo.mjs feat: rename ManaCore to Mana across entire codebase 2026-04-05 20:00:13 +02:00