chore(macmini): infra cleanup — compose env, blackbox mem, prometheus gpu probes

Three Mac Mini infrastructure follow-ups bundled:

1. docker-compose.macmini.yml — drop ghost backend env vars from
   the mana-app-web service (todo, calendar, contacts, chat, storage,
   cards, music, nutriphi `PUBLIC_*_API_URL{,_CLIENT}` plus the memoro
   server URLs). The matching consumers were removed in the earlier
   ghost-API cleanup commits, so these env entries had been wiring
   nothing into the running container for several deploys. Force-
   recreating mana-app-web after pulling this commit will pick up
   the slimmer env automatically.

2. docker-compose.macmini.yml — bump `mana-mon-blackbox` mem_limit
   from 32m to 128m. blackbox-exporter v0.25 sits north of 32m
   under load and was OOM-restart-looping every ~90 seconds, which
   in turn made `status.mana.how` and the prometheus probe metrics
   stale (since the scraper was missing every other window).

3. docker/prometheus/prometheus.yml — split `blackbox-gpu` into two
   jobs:
     - `blackbox-gpu` now probes `/health` via the http_health
       module, because the GPU services (whisper STT, FLUX image
       gen, Coqui TTS) return 401/404 on `/` by design (auth or
       API-only). The previous http_2xx-on-`/` probe was reporting
       all four as down even though they answered `/health` with
       200, which inflated the down count on status.mana.how.
     - `blackbox-gpu-root` keeps the http_2xx-on-`/` probe for
       Ollama, which has no `/health` endpoint but does answer
       2xx on its root.
   Both jobs share the same blackbox-exporter relabel rewrite so
   the targets are routed through the exporter container, not
   scraped directly by VictoriaMetrics.

Verified post-fix: status.mana.how reports 41/42 services up (only
`gpu-video` remains down — LTX Video Gen is intentionally not
deployed yet on the Windows GPU box).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Till JS 2026-04-07 22:59:38 +02:00
parent 4cfa869f33
commit a55aae6cb5
2 changed files with 31 additions and 25 deletions

View file

@ -307,18 +307,36 @@ scrape_configs:
- target_label: __address__
replacement: blackbox-exporter:9115
# GPU Server Services
# GPU Server Services — probe /health, not /
# The GPU services (whisper STT, TTS, FLUX image gen) only return 2xx
# on /health; their root path returns 401/403/404 by design (auth or
# API-only). Ollama is the exception — its / returns 200, but it has
# no /health endpoint, so we keep it on / via a separate target.
- job_name: 'blackbox-gpu'
metrics_path: /probe
params:
module: [http_health]
static_configs:
- targets:
- https://gpu-stt.mana.how/health
- https://gpu-tts.mana.how/health
- https://gpu-img.mana.how/health
- https://gpu-video.mana.how/health
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
- job_name: 'blackbox-gpu-root'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://gpu-ollama.mana.how
- https://gpu-stt.mana.how
- https://gpu-tts.mana.how
- https://gpu-img.mana.how
- https://gpu-video.mana.how
relabel_configs:
- source_labels: [__address__]
target_label: __param_target