managarten/docker
Till JS 004b3b7fca chore(observability): Grafana dashboard for agent-loop metrics
One focused dashboard covering the M1+M2 instrumentation in a single
view. Sections top-to-bottom:

  1. Service Health — mana-mcp + mana-ai up/down, 1h deny rate,
     compactions/h. The deny rate is the single most important
     number during POLICY_MODE=log-only soak: a non-zero
     deny/min in log-only means real traffic that enforce mode
     would reject.

  2. Policy Gate (mana-mcp)
     - Decisions / sec by outcome (allow/deny/flagged)
     - Deny reasons breakdown — the soak signal for flipping to
       enforce. If one reason dominates, address it before the flip.
     - Tool invocations / sec by outcome (success / handler-error /
       input-invalid)
     - Top 10 invoked tools (24h) — usage heatmap for prioritising
       which tools deserve the best policy-hint tuning.
     - Handler p50/p95/p99 latency per tool.

  3. Reminder Channel (mana-ai)
     - Rate by producer (token-budget, retry-loop, compacted)
     - Rate by severity. The interesting signal is whether
       warn/escalate trend DOWN over time — it means the LLM is
       actually reacting to the hints. If warn stays flat, the
       producer wording probably isn't landing.

  4. Context Compactor (mana-ai)
     - Triggers/h cumulative
     - Turns folded per compaction (p50/p95). Values < 3 flag
       MANA_AI_COMPACT_MAX_CTX misconfig — the threshold is firing
       on already-short histories.

  5. Mission Runner Baseline — tick duration + planner rounds for
     correlation (e.g. "did enabling the compactor change mean
     tick duration?").

Dashboard provisioning already auto-loads anything in /var/lib/grafana/
dashboards (docker/grafana/provisioning/dashboards/default.yml), so
this is live after the next grafana restart. UID agent-loop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 18:09:32 +02:00
..
alert-notifier feat: rename ManaCore to Mana across entire codebase 2026-04-05 20:00:13 +02:00
alertmanager feat: rename ManaCore to Mana across entire codebase 2026-04-05 20:00:13 +02:00
blackbox feat(monitoring): add uptime monitoring via Blackbox Exporter 2026-03-31 17:43:25 +02:00
grafana chore(observability): Grafana dashboard for agent-loop metrics 2026-04-23 18:09:32 +02:00
init-db feat(mail): add mana-mail service and frontend module (Phase 1 MVP) 2026-04-13 20:35:54 +02:00
loki feat(gpu-server): complete GPU server setup with AI services, monitoring, and public access 2026-03-27 21:35:30 +01:00
nginx refactor: rename zitare -> quotes (Zitate) 2026-04-14 20:59:16 +02:00
postgres fix(infra): use postgres -c flags instead of config_file override 2026-03-24 11:42:42 +01:00
prometheus chore(observability): scrape mana-mcp at :3069 2026-04-23 14:24:13 +02:00
promtail fix(mana-auth) + chore: rewrite /api/v1/auth/login JWT mint, remove Matrix stack 2026-04-08 16:32:13 +02:00
shared 🐛 fix(docker): add missing build-shared-packages.sh script for Docker builds 2025-12-25 20:51:15 +01:00
templates chore: remove all NestJS backend references, replace with Hono/Bun 2026-03-31 16:52:25 +02:00
tempo feat(mana-ai): OpenTelemetry tracing + Grafana Tempo backend 2026-04-16 15:21:23 +02:00
Dockerfile.hono-server feat(infra): add docker-compose for new Hono services + DB init 2026-03-28 17:54:24 +01:00
Dockerfile.sveltekit-base fix(docker): remove deleted subscriptions pkg + add shared-ai to sveltekit-base 2026-04-16 16:15:01 +02:00