node-exporter runs in VM and can't see host macOS disks directly.
Use custom mac_disk_used_percent metrics pushed via Pushgateway instead.
Also add ColimaVMDiskLarge alert when datadisk exceeds 150 GB.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wire mana-llm service into the monitoring stack:
Prometheus (docker/prometheus/prometheus.yml):
- Add mana-llm scrape job (port 3025, 15s interval)
- Include mana-llm in ServiceDown alert expression
Alerts (docker/prometheus/alerts.yml):
- New llm_alerts group with 4 rules:
- LLMServiceDown: mana-llm down > 1 min (critical)
- LLMHighErrorRate: > 10% errors for 5 min (warning)
- OllamaProviderDown: > 50% requests via Google fallback (warning)
- LLMSlowResponses: p95 > 30s for 5 min (warning)
Grafana Dashboard (docker/grafana/dashboards/mana-llm.json):
- 6 stat panels: status, req/min, error rate, fallback rate, latency, tokens/min
- Requests by Provider (stacked area: Ollama vs Google vs OpenRouter)
- Tokens by Type (prompt vs completion)
- Latency Percentiles (p50, p90, p99)
- Latency by Provider comparison
- Requests by Model breakdown
- Errors by Type
- Google Fallback Rate over time (with threshold coloring)
- Provider Distribution pie chart (24h)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add GlitchTip to health-check.sh monitoring endpoints
- Add native disk space checks for / and /Volumes/ManaData with 80%/90% thresholds
- Extend Prometheus disk alerts to include /host_mnt/Volumes/ManaData mountpoint
- Add ManaData disk usage gauge to Grafana system-overview dashboard
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add MetricsModule to 8 backends missing it (photos, zitare, mukke,
planta, picture, storage, presi, nutriphi)
- Enable Prometheus scraping for all 15 backends in prometheus.yml
(was only 6, with 3 commented out and 6 missing entirely)
- Update ServiceDown alert rule to cover all 15 backends
- Update Grafana dashboards (backends, master-overview, system-overview)
with all backend services in health panels
- Fix imprecise regex in application-details dashboard
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix LoggerService mock in better-auth.service.spec.ts
- Fix name assertion in auth.controller.spec.ts (empty string fallback)
- Fix createRemoteJWKSet mock in jwt-auth.guard.spec.ts
- Add Grafana dashboard for Auth Service monitoring
- Add 10 auth-specific Prometheus alert rules
- Update production readiness plan to 100% complete
All 199 unit tests passing.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
New dashboards:
- Application Details: Node.js runtime (heap, event loop, GC),
HTTP details (status codes, methods, top routes), error analysis
- Database Details: PostgreSQL and Redis metrics with detailed breakdowns
Alerting rules (docker/prometheus/alerts.yml):
- Service: down, high/very high error rate, slow response time
- Infrastructure: high CPU/memory/disk usage
- Database: PostgreSQL/Redis down, high connections, low cache hit
- Container: high CPU/memory, restarts
All dashboards include service selector variable for filtering.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>