managarten/docker/grafana
Till JS 004b3b7fca chore(observability): Grafana dashboard for agent-loop metrics
One focused dashboard covering the M1+M2 instrumentation in a single
view. Sections top-to-bottom:

  1. Service Health — mana-mcp + mana-ai up/down, 1h deny rate,
     compactions/h. The deny rate is the single most important
     number during POLICY_MODE=log-only soak: a non-zero
     deny/min in log-only means real traffic that enforce mode
     would reject.

  2. Policy Gate (mana-mcp)
     - Decisions / sec by outcome (allow/deny/flagged)
     - Deny reasons breakdown — the soak signal for flipping to
       enforce. If one reason dominates, address it before the flip.
     - Tool invocations / sec by outcome (success / handler-error /
       input-invalid)
     - Top 10 invoked tools (24h) — usage heatmap for prioritising
       which tools deserve the best policy-hint tuning.
     - Handler p50/p95/p99 latency per tool.

  3. Reminder Channel (mana-ai)
     - Rate by producer (token-budget, retry-loop, compacted)
     - Rate by severity. The interesting signal is whether
       warn/escalate trend DOWN over time — it means the LLM is
       actually reacting to the hints. If warn stays flat, the
       producer wording probably isn't landing.

  4. Context Compactor (mana-ai)
     - Triggers/h cumulative
     - Turns folded per compaction (p50/p95). Values < 3 flag
       MANA_AI_COMPACT_MAX_CTX misconfig — the threshold is firing
       on already-short histories.

  5. Mission Runner Baseline — tick duration + planner rounds for
     correlation (e.g. "did enabling the compactor change mean
     tick duration?").

Dashboard provisioning already auto-loads anything in /var/lib/grafana/
dashboards (docker/grafana/provisioning/dashboards/default.yml), so
this is live after the next grafana restart. UID agent-loop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 18:09:32 +02:00
..
dashboards chore(observability): Grafana dashboard for agent-loop metrics 2026-04-23 18:09:32 +02:00
provisioning feat(mana-ai): OpenTelemetry tracing + Grafana Tempo backend 2026-04-16 15:21:23 +02:00