managarten

mirror of https://github.com/Memo-2023/mana-monorepo.git synced 2026-05-14 19:01:08 +02:00

Author	SHA1	Message	Date
Till JS	402baf7c7f	feat(monitoring): add uptime monitoring via Blackbox Exporter - scripts/check-status.sh: parallel HTTP check aller mana.how Domains aus cloudflared-config.yml - docker/blackbox/blackbox.yml: Blackbox Exporter Config (http_2xx, http_health Module) - docker-compose.macmini.yml: blackbox-exporter Container (Port 9115, 32MB RAM) - docker/prometheus/prometheus.yml: 4 Scrape-Jobs (blackbox-web, blackbox-api, blackbox-infra, blackbox-gpu) - docker/prometheus/alerts.yml: 5 Alert-Regeln (WebAppDown, APIDown, InfraToolDown, GPUServiceDown, SlowHTTPResponse) - docker/grafana/dashboards/uptime.json: Grafana Uptime-Dashboard mit Status-Tables und Verlauf - package.json: check:status Script Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-31 17:43:25 +02:00
Till JS	be1096ec85	fix(monitoring): update disk alerts to use mac_disk_used_percent metrics node-exporter runs in VM and can't see host macOS disks directly. Use custom mac_disk_used_percent metrics pushed via Pushgateway instead. Also add ColimaVMDiskLarge alert when datadisk exceeds 150 GB. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-30 20:01:46 +02:00
Till JS	169821de1a	feat(monitoring): add LLM Grafana dashboard, Prometheus scraping, and alerts Wire mana-llm service into the monitoring stack: Prometheus (docker/prometheus/prometheus.yml): - Add mana-llm scrape job (port 3025, 15s interval) - Include mana-llm in ServiceDown alert expression Alerts (docker/prometheus/alerts.yml): - New llm_alerts group with 4 rules: - LLMServiceDown: mana-llm down > 1 min (critical) - LLMHighErrorRate: > 10% errors for 5 min (warning) - OllamaProviderDown: > 50% requests via Google fallback (warning) - LLMSlowResponses: p95 > 30s for 5 min (warning) Grafana Dashboard (docker/grafana/dashboards/mana-llm.json): - 6 stat panels: status, req/min, error rate, fallback rate, latency, tokens/min - Requests by Provider (stacked area: Ollama vs Google vs OpenRouter) - Tokens by Type (prompt vs completion) - Latency Percentiles (p50, p90, p99) - Latency by Provider comparison - Requests by Model breakdown - Errors by Type - Google Fallback Rate over time (with threshold coloring) - Provider Distribution pie chart (24h) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 11:16:27 +01:00
Till JS	143112f77a	feat(observability): add mana-search, mana-media, and Synapse to monitoring - Add Prometheus scraping for mana-search (port 3020, already has metrics) - Add Prometheus scraping for mana-media (port 3015, MetricsModule added) - Add Prometheus scraping for Matrix Synapse (port 9002, already enabled) - Add MetricsModule to mana-media with media_ prefix - Update Dockerfile for mana-media to include shared-nestjs-metrics - Replace hardcoded ServiceDown alert list with dynamic regex (.*-backend\|mana-core-auth\|mana-search\|mana-media\|synapse) - Replace hardcoded backends.json query with dynamic regex - Add Search, Media, Synapse to master-overview and system-overview dashboards Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-23 10:46:59 +01:00
Till JS	c8de944c8d	feat(monitoring): add GlitchTip health check and disk space monitoring - Add GlitchTip to health-check.sh monitoring endpoints - Add native disk space checks for / and /Volumes/ManaData with 80%/90% thresholds - Extend Prometheus disk alerts to include /host_mnt/Volumes/ManaData mountpoint - Add ManaData disk usage gauge to Grafana system-overview dashboard Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-23 09:33:09 +01:00
Till JS	6fa6509fa5	feat(observability): add metrics and monitoring for all 15 backends - Add MetricsModule to 8 backends missing it (photos, zitare, mukke, planta, picture, storage, presi, nutriphi) - Enable Prometheus scraping for all 15 backends in prometheus.yml (was only 6, with 3 commented out and 6 missing entirely) - Update ServiceDown alert rule to cover all 15 backends - Update Grafana dashboards (backends, master-overview, system-overview) with all backend services in health panels - Fix imprecise regex in application-details dashboard Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-23 09:09:04 +01:00
Till-JS	fe33f4b355	✅ fix(mana-core-auth): complete production readiness with test fixes - Fix LoggerService mock in better-auth.service.spec.ts - Fix name assertion in auth.controller.spec.ts (empty string fallback) - Fix createRemoteJWKSet mock in jwt-auth.guard.spec.ts - Add Grafana dashboard for Auth Service monitoring - Add 10 auth-specific Prometheus alert rules - Update production readiness plan to 100% complete All 199 unit tests passing. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-01 14:18:58 +01:00
Till-JS	8c259a008b	feat(monitoring): add comprehensive Grafana dashboards and alerting New dashboards: - Application Details: Node.js runtime (heap, event loop, GC), HTTP details (status codes, methods, top routes), error analysis - Database Details: PostgreSQL and Redis metrics with detailed breakdowns Alerting rules (docker/prometheus/alerts.yml): - Service: down, high/very high error rate, slow response time - Infrastructure: high CPU/memory/disk usage - Database: PostgreSQL/Redis down, high connections, low cache hit - Container: high CPU/memory, restarts All dashboards include service selector variable for filtering. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-26 09:47:18 +01:00

8 commits