managarten/docker/grafana/dashboards
Till JS 004b3b7fca chore(observability): Grafana dashboard for agent-loop metrics
One focused dashboard covering the M1+M2 instrumentation in a single
view. Sections top-to-bottom:

  1. Service Health — mana-mcp + mana-ai up/down, 1h deny rate,
     compactions/h. The deny rate is the single most important
     number during POLICY_MODE=log-only soak: a non-zero
     deny/min in log-only means real traffic that enforce mode
     would reject.

  2. Policy Gate (mana-mcp)
     - Decisions / sec by outcome (allow/deny/flagged)
     - Deny reasons breakdown — the soak signal for flipping to
       enforce. If one reason dominates, address it before the flip.
     - Tool invocations / sec by outcome (success / handler-error /
       input-invalid)
     - Top 10 invoked tools (24h) — usage heatmap for prioritising
       which tools deserve the best policy-hint tuning.
     - Handler p50/p95/p99 latency per tool.

  3. Reminder Channel (mana-ai)
     - Rate by producer (token-budget, retry-loop, compacted)
     - Rate by severity. The interesting signal is whether
       warn/escalate trend DOWN over time — it means the LLM is
       actually reacting to the hints. If warn stays flat, the
       producer wording probably isn't landing.

  4. Context Compactor (mana-ai)
     - Triggers/h cumulative
     - Turns folded per compaction (p50/p95). Values < 3 flag
       MANA_AI_COMPACT_MAX_CTX misconfig — the threshold is firing
       on already-short histories.

  5. Mission Runner Baseline — tick duration + planner rounds for
     correlation (e.g. "did enabling the compactor change mean
     tick duration?").

Dashboard provisioning already auto-loads anything in /var/lib/grafana/
dashboards (docker/grafana/provisioning/dashboards/default.yml), so
this is live after the next grafana restart. UID agent-loop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 18:09:32 +02:00
..
agent-loop.json chore(observability): Grafana dashboard for agent-loop metrics 2026-04-23 18:09:32 +02:00
application-details.json feat: rename ManaCore to Mana across entire codebase 2026-04-05 20:00:13 +02:00
auth-service.json feat: rename ManaCore to Mana across entire codebase 2026-04-05 20:00:13 +02:00
backends.json feat: rename ManaCore to Mana across entire codebase 2026-04-05 20:00:13 +02:00
business-metrics.json feat: rename ManaCore to Mana across entire codebase 2026-04-05 20:00:13 +02:00
database-details.json feat: rename ManaCore to Mana across entire codebase 2026-04-05 20:00:13 +02:00
deploy-tracking.json fix(infra): fix deploy tracking dashboard datasource UIDs and instant queries 2026-03-20 17:35:41 +01:00
error-tracking.json feat(grafana): add GlitchTip error tracking dashboard 2026-03-19 21:14:09 +01:00
logs-explorer.json chore(matrix): final scrub of stale matrix references 2026-04-08 16:47:54 +02:00
mana-llm.json feat(monitoring): add LLM Grafana dashboard, Prometheus scraping, and alerts 2026-03-24 11:16:27 +01:00
master-overview.json refactor: rename zitare -> quotes (Zitate) 2026-04-14 20:59:16 +02:00
system-overview.json refactor: rename zitare -> quotes (Zitate) 2026-04-14 20:59:16 +02:00
uptime.json feat(monitoring): add mana-geocoding + Pelias to prod compose, Prometheus, Grafana, and status.mana.how 2026-04-11 16:11:01 +02:00
user-statistics.json feat: rename ManaCore to Mana across entire codebase 2026-04-05 20:00:13 +02:00