fix(macmini): blackbox-exporter uses 1.1.1.1/8.8.8.8 directly for DNS

Docker's embedded DNS resolver (127.0.0.11) forwards to the host
resolver, which on the Mac Mini forwards to the home router's
FRITZ!Box DNS. The router keeps a stale negative cache for hours
after a hostname first fails, so any newly added Cloudflare CNAME
(e.g. the GPU public hostnames recreated via the Cloudflare dashboard
during the 2026-04-07 cleanup) appears as "no such host" to the
blackbox probes for the entire negative-cache TTL — even though the
hostname resolves fine via 1.1.1.1 directly the entire time.

Symptom before the fix:
  health-check.sh (uses dig @1.1.1.1)  → All services healthy 
  status.mana.how (via blackbox/VM)    → 4 GPU services down 

The two views were lying to each other in opposite directions —
the public-facing status page reported four healthy services as
down while the operator runbook reported them as up. Confusing
and exactly the kind of monitoring discrepancy a launch should not
ship with.

Fix: pin the blackbox container to public DNS (Cloudflare + Google)
in compose. Blackbox now resolves directly against 1.1.1.1, bypassing
the home-router negative cache entirely. After the recreate the four
GPU probes flipped from probe_success=0 to probe_success=1 within
one scrape interval, and status.mana.how went from 38/42 to 41/42
(only gpu-video remains down — LTX Video Gen is intentionally not
deployed on the Windows GPU box yet).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Till JS 2026-04-07 23:47:57 +02:00
parent 24001e9545
commit 05ae348b12

View file

@ -1411,6 +1411,17 @@ services:
container_name: mana-mon-blackbox
restart: always
mem_limit: 128m
# Use Cloudflare + Google public resolvers instead of Docker's
# embedded DNS (127.0.0.11). Docker DNS forwards to the host
# resolver which forwards to the home router (FRITZ!Box), and the
# router keeps a stale negative cache for hours after a hostname
# first fails. New CNAMEs (e.g. fresh GPU public hostnames added
# via the Cloudflare dashboard) appear as "no such host" to the
# blackbox probes for the entire negative-cache TTL even though
# they resolve fine via 1.1.1.1 directly.
dns:
- 1.1.1.1
- 8.8.8.8
command: ["--config.file=/etc/blackbox/blackbox.yml"]
volumes:
- ./docker/blackbox/blackbox.yml:/etc/blackbox/blackbox.yml:ro