feat(mana-ai): Prometheus /metrics endpoint + status.mana.how integration

Wires mana-ai into the existing observability stack so tick throughput,
plan-failure rates, planner latencies, and snapshot refresh health are
visible in Grafana + Prometheus, and the service's uptime surfaces on
status.mana.how under the "Internal" section.

- `src/metrics.ts` — prom-client Registry with `mana_ai_` prefix.
  Counters: ticks_total, plans_produced_total, plans_written_back_total,
  parse_failures_total, mission_errors_total, snapshots_new/updated,
  snapshot_rows_applied_total, http_requests_total.
  Histograms: tick_duration_seconds (0.1–120s), planner_request_
  duration_seconds (0.25–60s), http_request_duration_seconds (0.005–10s).
- `src/index.ts` — HTTP middleware labels every request by
  method/path/status; `/metrics` serves the Prometheus text format.
- `src/cron/tick.ts` — increments counters + wraps the tick with
  `tickDuration.startTimer()`. Snapshot stats fold through.
- `src/planner/client.ts` — wraps `complete()` in a latency histogram
  timer so planner tail latency shows up separately from tick duration.
- `docker/prometheus/prometheus.yml` —
  1. New `mana-ai` scrape job against `mana-ai:3066/metrics` (30s).
  2. `/health` added to the `blackbox-internal` job so uptime shows on
     status.mana.how alongside mana-geocoding.
- `scripts/generate-status-page.sh` — friendly label for the new probe:
  `mana-ai:3066/health` → "Mana AI Runner" (generator already iterates
  `blackbox-internal`, no other changes needed).
- `package.json` — prom-client ^15.1.3

All 17 Bun tests still pass; tsc clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Till JS 2026-04-15 01:41:40 +02:00
parent 767b64cdd4
commit 0bf01f434e
9 changed files with 184 additions and 3 deletions

View file

@ -123,6 +123,15 @@ scrape_configs:
metrics_path: '/metrics'
scrape_interval: 30s
# Mana AI Service (Bun) — background Mission Runner for the AI Workbench.
# Exposes tick stats, planner-request latencies, snapshot refresh
# counters, and standard HTTP metrics at /metrics.
- job_name: 'mana-ai'
static_configs:
- targets: ['mana-ai:3066']
metrics_path: '/metrics'
scrape_interval: 30s
# ============================================
# GPU Server (Windows PC, LAN: 192.168.178.11)
# ============================================
@ -297,6 +306,8 @@ scrape_configs:
# Upstream Pelias health, proxied through the wrapper so the
# blackbox-exporter doesn't need host.docker.internal access.
- http://mana-geocoding:3018/health/pelias
# mana-ai (Mission Runner) — internal-only, no CF tunnel.
- http://mana-ai:3066/health
relabel_configs:
- source_labels: [__address__]
target_label: __param_target