Two failures during the 2026-04-07 production outage triage were caused
not by the underlying outage but by `status.sh` and `health-check.sh`
hiding the broken state. Both scripts hardened so the same outage
shape can't reoccur invisibly.
status.sh — compose-vs-running diff
The old script printed "X containers running / Y total" without
noticing that some compose-defined containers were never started in
the first place. The Mac Mini was running 37 of 42 declared
containers and the script reported "37 running" with no indication
of the gap — `mana-core-sync` and `mana-api-gateway` were silently
missing for hours.
New behaviour: read every service from `docker compose config`,
diff its `container_name` against `docker ps`, and report each
declared service whose container is not currently up. The same
outage state would have been flagged on the very first run.
health-check.sh — public-hostname walk via Cloudflare DNS
The old script probed ~50 hardcoded `localhost:<port>/health`
endpoints across Chat, Todo, Calendar, etc. — but the per-app
HTTP backends those endpoints expected don't exist anymore (the
ghost-API cleanup removed them entirely). Every probe returned
HTTP 000 / connection refused, generating a wall of false-positive
alerts that drowned out the real signal.
The block was replaced with a dynamic walk of every `hostname:`
entry in `~/.cloudflared/config.yml`. Each hostname is probed via
the public Cloudflare tunnel, so DNS gaps, missing tunnel routes,
502/530 origin failures and timeouts surface as failures the same
way real users would experience them. On its first run after the
cleanup it surfaced eighteen previously-invisible hostname failures
(no DNS, 502, or 530) — every one of them a real production issue.
DNS resolution intentionally goes through `dig +short HOST @1.1.1.1`
instead of the local resolver. The Mac Mini's home-router DNS keeps
a negative cache for hours after the first failed lookup, so newly
added CNAMEs (like the post-outage sync/media records) appeared as
"no response" from inside the script for hours even though external
users saw them resolve immediately. Asking Cloudflare's DNS directly
gives the script the same view the public internet has.
The Matrix, Element, GPU-LAN-redundant and monitoring port-by-port
blocks were removed — the public-hostname walk covers all of them
via their `*.mana.how` hostnames going through the actual tunnel.
The "stuck container" detector now ignores `*-init` containers
(one-shot init pods, Exit 0 = success, intentionally never re-run).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds end-to-end browser voice capture for the Memoro module, mirroring the
existing dreams pattern: MediaRecorder → SvelteKit server proxy → mana-stt
on the Windows GPU box via Cloudflare tunnel.
Recording UI lives in /memoro page header (mic button + live timer + cancel +
sticky-permission retry). Server proxy at /api/v1/memoro/transcribe forwards
the blob with the server-held X-API-Key. memosStore.createFromVoice creates a
placeholder memo with processingStatus='processing' and fires transcribeBlob
in the background, which writes the transcript and flips status on completion
(or 'failed' with error in metadata).
Also corrects the mana-stt hostname across the repo: stt-api.mana.how (which
never existed in DNS) → gpu-stt.mana.how (the actual Cloudflare tunnel route
to the Windows GPU box). Adds an ENVIRONMENT_VARIABLES.md section explaining
how to obtain MANA_STT_API_KEY and where the tunnel terminates. Adds tunnel
health probes to the mac-mini health-check script so we catch tunnel-side
breakage in addition to LAN-side.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New GPU service for fast text-to-video generation using LTX-Video (~2B params)
on the RTX 3090. Generates 480p clips in 10-30 seconds, uses ~10GB VRAM.
Includes Cloudflare Tunnel route, Prometheus monitoring, and health checks.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Deactivate Ollama, FLUX.2, and Telegram Bot LaunchAgents on Mac Mini
- Remove extra_hosts from mana-llm (no longer needs host.docker.internal)
- Update health-check.sh to monitor GPU server services instead of local
- Update status.sh to show GPU server status instead of native services
- Rewrite MAC_MINI_SERVER.md: remove ~400 lines of Ollama/FLUX/Bot docs,
add GPU server architecture diagram and deactivation notes
- Update CAPACITY_PLANNING.md with post-offload numbers (~80-150 peak users)
Mac Mini is now a pure hosting server (Web, API, DB, Sync).
All AI workloads run on GPU server (RTX 3090) via LAN.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add GlitchTip to health-check.sh monitoring endpoints
- Add native disk space checks for / and /Volumes/ManaData with 80%/90% thresholds
- Extend Prometheus disk alerts to include /host_mnt/Volumes/ManaData mountpoint
- Add ManaData disk usage gauge to Grafana system-overview dashboard
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Medium priority stability improvements:
Alerting:
- Add vmalert for evaluating Prometheus alert rules
- Add alertmanager for alert routing and grouping
- Add alert-notifier service for Telegram/ntfy notifications
- Enable cadvisor scraping in prometheus config
Disk Monitoring:
- Add check-disk-space.sh for hourly disk monitoring
- Alert on 80% (warning) and 90% (critical) thresholds
- Auto-cleanup Docker when disk is critical
- Add com.manacore.disk-check.plist for LaunchD
Weekly Reports:
- Add weekly-report.sh for system health summary
- Includes: backup status, disk usage, container health,
database stats, error log summary
- Runs every Sunday at 10 AM via LaunchD
Health Check Updates:
- Add checks for vmalert, alertmanager, alert-notifier
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Disable api-gateway and skilltree-web (no working images/Dockerfiles)
- Fix mana-search Dockerfile healthcheck port and endpoint
- Update health-check.sh to skip disabled services
- Fix search service health endpoint (/api/v1/health)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add ensure-containers-running.sh to detect and auto-start stuck containers
- Add LaunchD plist for automatic container health checks every 5 minutes
- Update health-check.sh with correct ports (3031/5011 for todo, etc.)
- Update deploy.sh health checks to match docker-compose.macmini.yml
- Fix container name references (mana-infra-postgres instead of manacore-postgres)
This prevents 502 errors when containers get stuck in "Created" status.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The backends exclude /health from the api/v1 prefix, so the correct
endpoint is /health not /api/v1/health. Also added missing services
(Contacts, Storage, Presi) and use /health for web apps.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Web apps: check root URL (/) instead of /health (SvelteKit has no health endpoint)
- Todo backend: fix path to /api/v1/health
- Remove redundant PostgreSQL HTTP check (checked via docker exec)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
SSH sessions don't inherit the full PATH, so docker command
wasn't found. Now all scripts explicitly add /usr/local/bin
and /opt/homebrew/bin to PATH.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- setup-autostart.sh: Configure launchd services for boot
- startup.sh: Main startup script (waits for Docker, starts containers)
- health-check.sh: Check all services (runs every 5 min)
- status.sh: Full system status overview
- restart.sh: Restart containers (with --pull and --force options)
- stop.sh: Stop all containers gracefully
- README.md: Complete documentation
Includes optional ntfy.sh push notifications for health check failures.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>