mirror of
https://github.com/Memo-2023/mana-monorepo.git
synced 2026-05-14 19:01:08 +02:00
chore(macmini): infra cleanup — compose env, blackbox mem, prometheus gpu probes
Three Mac Mini infrastructure follow-ups bundled:
1. docker-compose.macmini.yml — drop ghost backend env vars from
the mana-app-web service (todo, calendar, contacts, chat, storage,
cards, music, nutriphi `PUBLIC_*_API_URL{,_CLIENT}` plus the memoro
server URLs). The matching consumers were removed in the earlier
ghost-API cleanup commits, so these env entries had been wiring
nothing into the running container for several deploys. Force-
recreating mana-app-web after pulling this commit will pick up
the slimmer env automatically.
2. docker-compose.macmini.yml — bump `mana-mon-blackbox` mem_limit
from 32m to 128m. blackbox-exporter v0.25 sits north of 32m
under load and was OOM-restart-looping every ~90 seconds, which
in turn made `status.mana.how` and the prometheus probe metrics
stale (since the scraper was missing every other window).
3. docker/prometheus/prometheus.yml — split `blackbox-gpu` into two
jobs:
- `blackbox-gpu` now probes `/health` via the http_health
module, because the GPU services (whisper STT, FLUX image
gen, Coqui TTS) return 401/404 on `/` by design (auth or
API-only). The previous http_2xx-on-`/` probe was reporting
all four as down even though they answered `/health` with
200, which inflated the down count on status.mana.how.
- `blackbox-gpu-root` keeps the http_2xx-on-`/` probe for
Ollama, which has no `/health` endpoint but does answer
2xx on its root.
Both jobs share the same blackbox-exporter relabel rewrite so
the targets are routed through the exporter container, not
scraped directly by VictoriaMetrics.
Verified post-fix: status.mana.how reports 41/42 services up (only
`gpu-video` remains down — LTX Video Gen is intentionally not
deployed yet on the Windows GPU box).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
4cfa869f33
commit
a55aae6cb5
2 changed files with 31 additions and 25 deletions
|
|
@ -307,18 +307,36 @@ scrape_configs:
|
|||
- target_label: __address__
|
||||
replacement: blackbox-exporter:9115
|
||||
|
||||
# GPU Server Services
|
||||
# GPU Server Services — probe /health, not /
|
||||
# The GPU services (whisper STT, TTS, FLUX image gen) only return 2xx
|
||||
# on /health; their root path returns 401/403/404 by design (auth or
|
||||
# API-only). Ollama is the exception — its / returns 200, but it has
|
||||
# no /health endpoint, so we keep it on / via a separate target.
|
||||
- job_name: 'blackbox-gpu'
|
||||
metrics_path: /probe
|
||||
params:
|
||||
module: [http_health]
|
||||
static_configs:
|
||||
- targets:
|
||||
- https://gpu-stt.mana.how/health
|
||||
- https://gpu-tts.mana.how/health
|
||||
- https://gpu-img.mana.how/health
|
||||
- https://gpu-video.mana.how/health
|
||||
relabel_configs:
|
||||
- source_labels: [__address__]
|
||||
target_label: __param_target
|
||||
- source_labels: [__param_target]
|
||||
target_label: instance
|
||||
- target_label: __address__
|
||||
replacement: blackbox-exporter:9115
|
||||
|
||||
- job_name: 'blackbox-gpu-root'
|
||||
metrics_path: /probe
|
||||
params:
|
||||
module: [http_2xx]
|
||||
static_configs:
|
||||
- targets:
|
||||
- https://gpu-ollama.mana.how
|
||||
- https://gpu-stt.mana.how
|
||||
- https://gpu-tts.mana.how
|
||||
- https://gpu-img.mana.how
|
||||
- https://gpu-video.mana.how
|
||||
relabel_configs:
|
||||
- source_labels: [__address__]
|
||||
target_label: __param_target
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue