One focused dashboard covering the M1+M2 instrumentation in a single
view. Sections top-to-bottom:
1. Service Health — mana-mcp + mana-ai up/down, 1h deny rate,
compactions/h. The deny rate is the single most important
number during POLICY_MODE=log-only soak: a non-zero
deny/min in log-only means real traffic that enforce mode
would reject.
2. Policy Gate (mana-mcp)
- Decisions / sec by outcome (allow/deny/flagged)
- Deny reasons breakdown — the soak signal for flipping to
enforce. If one reason dominates, address it before the flip.
- Tool invocations / sec by outcome (success / handler-error /
input-invalid)
- Top 10 invoked tools (24h) — usage heatmap for prioritising
which tools deserve the best policy-hint tuning.
- Handler p50/p95/p99 latency per tool.
3. Reminder Channel (mana-ai)
- Rate by producer (token-budget, retry-loop, compacted)
- Rate by severity. The interesting signal is whether
warn/escalate trend DOWN over time — it means the LLM is
actually reacting to the hints. If warn stays flat, the
producer wording probably isn't landing.
4. Context Compactor (mana-ai)
- Triggers/h cumulative
- Turns folded per compaction (p50/p95). Values < 3 flag
MANA_AI_COMPACT_MAX_CTX misconfig — the threshold is firing
on already-short histories.
5. Mission Runner Baseline — tick duration + planner rounds for
correlation (e.g. "did enabling the compactor change mean
tick duration?").
Dashboard provisioning already auto-loads anything in /var/lib/grafana/
dashboards (docker/grafana/provisioning/dashboards/default.yml), so
this is live after the next grafana restart. UID agent-loop.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Production deployment + observability for the self-hosted geocoding stack:
**docker-compose.macmini.yml**
- New mana-geocoding container (port 3018, internal-only — no traefik
labels, no Cloudflare route). Uses host.docker.internal to reach the
Pelias API on the host's pelias compose stack. Dockerfile added under
services/mana-geocoding/ using the same Bun/Hono pattern as mana-events.
**Prometheus**
- New blackbox-internal job probing mana-geocoding:3018/health, the
Pelias API on host.docker.internal:4000/v1/status, and Elasticsearch
at host.docker.internal:9200/_cluster/health. Kept separate from
blackbox-api which is reserved for public HTTPS endpoints.
**status.mana.how (generate-status-page.sh)**
- Include blackbox-internal in the metric query and add an "Interne
Dienste" section with its own summary card, right between Infrastruktur
and GPU Dienste. Summary grid goes from 4 to 5 columns with a
900px breakpoint.
- friendly_name() now handles http:// URLs and rewrites container-name
hosts like mana-geocoding:3018/health → "Mana Geocoding",
host.docker.internal:4000 → "Pelias API",
host.docker.internal:9200 → "Pelias Elasticsearch".
**Grafana uptime dashboard**
- Add an "Internal" series to the "Alle Dienste — Uptime-Verlauf" panel
- New "Interne Dienste Status" table panel showing per-instance up/down
- New "Geocoding Ø Latenz" stat panel for probe_duration_seconds
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
A grep audit after the previous matrix removal commits found a handful
of stragglers in non-runtime files that the earlier sweeps missed:
- services/mana-llm/CLAUDE.md: removed matrix-ollama-bot from the
consumer-apps diagram and from the related-services table
- services/mana-video-gen/CLAUDE.md: removed "Matrix Bots" integration
bullet
- packages/notify-client/README.md: removed sendMatrix() doc entry
(the method itself was already gone in the prior cleanup)
- docker/grafana/dashboards/logs-explorer.json: dropped the "Matrix
Stack" log row that queried tier="matrix" (would show no data forever)
- docker/grafana/dashboards/master-overview.json: dropped the "Matrix
Bots" stat panel that counted up{job=~"matrix-.*-bot"}
- apps/mana/apps/landing/src/data/ecosystem-health.json: regenerated via
scripts/ecosystem-audit.mjs to drop matrix from the app list, icon
counts, file analytics, top offenders and authGuard missing list
- .gitignore: removed services/matrix-stt-bot/data/ pattern (the
service itself was deleted long ago)
Production-side stragglers also addressed (not in this commit):
- DROP USER synapse on prod Postgres (the parallel cleanup commit
2514831a3 dropped DATABASE matrix + DATABASE synapse but left the
role behind)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The matrix subsystem was removed in a prior commit. This commit cleans
up the small leftovers that grep found:
- docker-compose.macmini.yml: dropped the "Matrix Stack" port-range
comment, the "matrix" category from the naming convention, and a
stale watchtower comment about Matrix notifications.
- packages/credits/src/operations.ts: removed AI_BOT_CHAT credit
operation type and its definition. It was the billing entry for "Chat
with AI via Matrix bot" — no callers left.
- services/mana-credits gifts schema + service + validation: removed the
targetMatrixId column / param / Zod field. The corresponding
PostgreSQL column was dropped manually with
`ALTER TABLE gifts.gift_codes DROP COLUMN target_matrix_id` on prod.
- docker/grafana/dashboards/{master,system}-overview.json: removed the
`up{job="synapse"}` panel queries — they would have shown No Data
forever now that Synapse is gone.
Production-side cleanup performed in parallel (not in this commit):
- Stopped + removed mana-matrix-{synapse,element,web,bot} containers
- Removed mana-matrix-bot:local, matrix-web:latest,
matrixdotorg/synapse:latest, vectorim/element-web:latest images (~3 GB)
- Removed mana-matrix-bots-data Docker volume
- Removed /Volumes/ManaData/matrix/ media store (4.3 MB)
- DROP DATABASE matrix; DROP DATABASE synapse; on Postgres
Cosmetic leftovers intentionally untouched:
- Eisenhower matrix in todo (LayoutMode 'matrix') — productivity concept
- ${{ matrix.service }} in .github/workflows — GitHub Actions strategy
- services/mana-media/apps/api/dist/.../matrix/* — stale build output
(not in git, regenerated next mana-media build)
- check-disk-space.sh now pushes mac_disk_used_percent + mac_colima_disk_used_gb
to Pushgateway every hour so vmalert can alert on real macOS disk usage
- alerts.yml: replace broken node-exporter disk alerts with Pushgateway-based ones
- master-overview.json: add "Recent Errors (Loki)" section with live error log
stream, error rate timeseries and top error sources barchart
- move-colima-to-external-ssd.sh: guided script to move 200GB Colima VM
datadisk from internal SSD to /Volumes/ManaData (3.6TB external SSD)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Loki was already running but had no log shipper. Adds Promtail to collect
Docker logs from all 66 containers with automatic tier labeling (infra,
auth, core, app, matrix, games) and a Grafana Logs Explorer dashboard.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire mana-llm service into the monitoring stack:
Prometheus (docker/prometheus/prometheus.yml):
- Add mana-llm scrape job (port 3025, 15s interval)
- Include mana-llm in ServiceDown alert expression
Alerts (docker/prometheus/alerts.yml):
- New llm_alerts group with 4 rules:
- LLMServiceDown: mana-llm down > 1 min (critical)
- LLMHighErrorRate: > 10% errors for 5 min (warning)
- OllamaProviderDown: > 50% requests via Google fallback (warning)
- LLMSlowResponses: p95 > 30s for 5 min (warning)
Grafana Dashboard (docker/grafana/dashboards/mana-llm.json):
- 6 stat panels: status, req/min, error rate, fallback rate, latency, tokens/min
- Requests by Provider (stacked area: Ollama vs Google vs OpenRouter)
- Tokens by Type (prompt vs completion)
- Latency Percentiles (p50, p90, p99)
- Latency by Provider comparison
- Requests by Model breakdown
- Errors by Type
- Google Fallback Rate over time (with threshold coloring)
- Provider Distribution pie chart (24h)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add GlitchTip to health-check.sh monitoring endpoints
- Add native disk space checks for / and /Volumes/ManaData with 80%/90% thresholds
- Extend Prometheus disk alerts to include /host_mnt/Volumes/ManaData mountpoint
- Add ManaData disk usage gauge to Grafana system-overview dashboard
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add MetricsModule to 8 backends missing it (photos, zitare, mukke,
planta, picture, storage, presi, nutriphi)
- Enable Prometheus scraping for all 15 backends in prometheus.yml
(was only 6, with 3 commented out and 6 missing entirely)
- Update ServiceDown alert rule to cover all 15 backends
- Update Grafana dashboards (backends, master-overview, system-overview)
with all backend services in health panels
- Fix imprecise regex in application-details dashboard
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add explicit uid: deploy-tracking to datasource provisioning
- Add instant: true to all Prometheus stat/gauge panel queries
- Pushgateway gauges need instant queries, not range queries
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instrument the CD pipeline to record per-deploy and per-service metrics
(build time, image size, startup time, health status) into PostgreSQL and
push gauges to Pushgateway. Adds a Grafana dashboard with 13 panels covering
deploy frequency, build performance, service health, and history.
New files:
- scripts/mac-mini/init-deploy-tracking.sql (idempotent DDL)
- scripts/deploy-metrics.sh (bash library for CI)
- docker/grafana/provisioning/datasources/deploy-tracking.yml
- docker/grafana/dashboards/deploy-tracking.json
Modified:
- docker/prometheus/prometheus.yml (pushgateway scrape job)
- .github/workflows/cd-macmini.yml (build/health instrumentation)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add PostgreSQL datasource pointing to GlitchTip database
- Add Error Tracking dashboard with 7 panels:
- Total Open Issues (stat)
- Issues by Project (pie chart)
- Total Events (stat)
- Projects Tracked (stat)
- Resolved vs Unresolved (stat)
- New Issues Over Time (stacked bar chart, 30 days)
- Recent Issues (table with 50 latest, color-coded levels)
- Dashboard links to GlitchTip UI for detailed investigation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix LoggerService mock in better-auth.service.spec.ts
- Fix name assertion in auth.controller.spec.ts (empty string fallback)
- Fix createRemoteJWKSet mock in jwt-auth.guard.spec.ts
- Add Grafana dashboard for Auth Service monitoring
- Add 10 auth-specific Prometheus alert rules
- Update production readiness plan to 100% complete
All 199 unit tests passing.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add node-exporter service to docker-compose for CPU/Memory/Disk monitoring
- Enable node-exporter scrape target in Prometheus config
- Update System Overview dashboard with Host System section:
- CPU, Memory, Disk usage gauges
- Total RAM, Total Disk, Uptime, Load stats
- CPU & Memory over time graph
- Network I/O graph
- Add Node Exporter to service status panel
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Removed node_* metrics (node-exporter not deployed)
- Removed container_last_seen (cAdvisor not deployed)
- Added Service Status, Traffic Overview, Database sections
- All panels now use available Prometheus metrics
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Added Total Requests counter for overall user interaction
- Added Requests/sec for current load visibility
- Reduced panel width to fit 8 metrics in one row
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Move Key Metrics section to top of dashboard
- Add new panels: Services UP, Apps Running, Matrix Bots, Avg Response Time
- Reorganize layout for better overview at a glance
- Remove CPU/Memory/Disk (no node-exporter), add Redis Keys
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Replace Prometheus with VictoriaMetrics (2-year retention)
- Add DuckDB analytics module for business KPIs (unlimited retention)
- Add master overview dashboard combining all metrics
- Add business metrics dashboard for user growth tracking
- Add backup script for VictoriaMetrics snapshots and DuckDB
- Add ADR documentation for monitoring stack decision
Analytics API endpoints:
- GET /api/v1/analytics/health - Service health
- GET /api/v1/analytics/latest - Latest metrics snapshot
- GET /api/v1/analytics/growth - User growth over time
- GET /api/v1/analytics/monthly - Monthly aggregates
- POST /api/v1/analytics/snapshot - Manual snapshot trigger
- Add user metrics to mana-core-auth MetricsService:
- auth_users_total: Total registered users
- auth_users_verified: Email-verified users
- auth_users_created_today/this_week/this_month
- Create Grafana user-statistics dashboard with:
- User overview stats (total, verified, verification rate, new today)
- Registration period breakdown (today/week/month)
- User growth trends over time
- Enhance telegram-stats-bot /users command:
- Add yesterday comparison with trends
- Add week-over-week comparison
- Add mini bar chart for last 7 days registration
- Include user stats in daily Telegram report
Reverting 618c58c5 which broke the CI workflow.
Will re-add notifications after fixing the issue.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add notify-start job with Telegram notification for build start
- Add notify-complete job with build status and duration notification
- Push CI metrics to Prometheus Pushgateway for Grafana visualization
- Create CI/CD Grafana dashboard with build status, duration, and history
- Add Pushgateway scrape config to Prometheus
Requires TELEGRAM_BOT_TOKEN, TELEGRAM_CHAT_ID, and PUSHGATEWAY_URL secrets.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
New dashboards:
- Application Details: Node.js runtime (heap, event loop, GC),
HTTP details (status codes, methods, top routes), error analysis
- Database Details: PostgreSQL and Redis metrics with detailed breakdowns
Alerting rules (docker/prometheus/alerts.yml):
- Service: down, high/very high error rate, slow response time
- Infrastructure: high CPU/memory/disk usage
- Database: PostgreSQL/Redis down, high connections, low cache hit
- Container: high CPU/memory, restarts
All dashboards include service selector variable for filtering.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add MetricsModule with prom-client for todo backend
- Add MetricsInterceptor for request tracking
- Update COMMANDS.md with presi and storage commands
- Update Grafana dashboards for backend monitoring
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>