feat(mana-ai): Prometheus /metrics endpoint + status.mana.how integration

Wires mana-ai into the existing observability stack so tick throughput, plan-failure rates, planner latencies, and snapshot refresh health are visible in Grafana + Prometheus, and the service's uptime surfaces on status.mana.how under the "Internal" section. - `src/metrics.ts` — prom-client Registry with `mana_ai_` prefix. Counters: ticks_total, plans_produced_total, plans_written_back_total, parse_failures_total, mission_errors_total, snapshots_new/updated, snapshot_rows_applied_total, http_requests_total. Histograms: tick_duration_seconds (0.1–120s), planner_request_ duration_seconds (0.25–60s), http_request_duration_seconds (0.005–10s). - `src/index.ts` — HTTP middleware labels every request by method/path/status; `/metrics` serves the Prometheus text format. - `src/cron/tick.ts` — increments counters + wraps the tick with `tickDuration.startTimer()`. Snapshot stats fold through. - `src/planner/client.ts` — wraps `complete()` in a latency histogram timer so planner tail latency shows up separately from tick duration. - `docker/prometheus/prometheus.yml` — 1. New `mana-ai` scrape job against `mana-ai:3066/metrics` (30s). 2. `/health` added to the `blackbox-internal` job so uptime shows on status.mana.how alongside mana-geocoding. - `scripts/generate-status-page.sh` — friendly label for the new probe: `mana-ai:3066/health` → "Mana AI Runner" (generator already iterates `blackbox-internal`, no other changes needed). - `package.json` — prom-client ^15.1.3 All 17 Bun tests still pass; tsc clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-14 19:41:09 +02:00 · 2026-04-15 01:41:40 +02:00 · 2026-04-15 01:41:40 +02:00 · 0bf01f434e
commit 0bf01f434e
parent 767b64cdd4
9 changed files with 184 additions and 3 deletions
--- a/docker/prometheus/prometheus.yml
+++ b/docker/prometheus/prometheus.yml
@ -123,6 +123,15 @@ scrape_configs:
    metrics_path: '/metrics'
    scrape_interval: 30s

+  # Mana AI Service (Bun) — background Mission Runner for the AI Workbench.
+  # Exposes tick stats, planner-request latencies, snapshot refresh
+  # counters, and standard HTTP metrics at /metrics.
+  - job_name: 'mana-ai'
+    static_configs:
+      - targets: ['mana-ai:3066']
+    metrics_path: '/metrics'
+    scrape_interval: 30s
+
  # ============================================
  # GPU Server (Windows PC, LAN: 192.168.178.11)
  # ============================================
@ -297,6 +306,8 @@ scrape_configs:
          # Upstream Pelias health, proxied through the wrapper so the
          # blackbox-exporter doesn't need host.docker.internal access.
          - http://mana-geocoding:3018/health/pelias
+          # mana-ai (Mission Runner) — internal-only, no CF tunnel.
+          - http://mana-ai:3066/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target