Commit graph

18 commits

Author SHA1 Message Date
Till JS
d7799ec95d refactor(photos): remove NestJS backend, use local-first + direct mana-media
The Photos NestJS backend was a proxy to mana-media that enriched
responses with local album/favorite/tag data. Now:

- Albums store → local-first via albumCollection + albumItemCollection
- Favorites → local-first via favoriteCollection (toggle in IndexedDB)
- Photo tags → local-first via photoTagCollection
- Photo listing/stats → direct mana-media API calls from frontend
- Upload → direct mana-media upload from frontend
- Delete → direct mana-media delete from frontend

Removed 27 TypeScript files, 1 Docker container, 1 port (3039).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 02:18:03 +01:00
Till JS
dd2f814cf3 refactor(presi): replace NestJS backend with lightweight Hono server
The Presi NestJS backend (40 source files, 50 deps) was a CRUD wrapper
around decks, slides, and themes — all now handled by local-first sync.

Only the share-link feature requires server-side state (public URLs
without auth), so a minimal Hono + Bun server replaces the entire
NestJS backend:

- apps/presi/apps/server/ — Hono server with share routes + GDPR admin
  Uses @manacore/shared-hono for auth (JWKS), health, admin, errors
- Web app API client stripped to share-only (was 270 lines → 90 lines)
- Removed from docker-compose, CI/CD, Prometheus, env generation
- NestJS backend deleted (40 TS files, 8 test specs, 3038 lines)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 02:08:40 +01:00
Till JS
32939fbfb5 refactor(infra): remove zitare + clock NestJS backends, add shared-hono package
Both apps are fully local-first via Dexie.js + mana-sync. Their NestJS
backends were pure CRUD wrappers (20 + 31 source files) that are no
longer needed.

Changes:
- Add packages/shared-hono: JWT auth via JWKS (jose), Drizzle DB factory,
  health route, generic GDPR admin handler, error middleware
- Migrate zitare lists page from fetch() to listsStore (local-first)
- Rewrite clock timers store from API-based to timerCollection (Dexie)
- Update clock +layout.svelte CommandBar search to use local collections
- Remove zitare-backend + clock-backend from docker-compose, CI/CD,
  Prometheus, env generation, setup scripts
- Add docs/TECHNOLOGY_AUDIT_2026_03.md with full repo analysis

Net result: -2 Docker containers, -2 ports, -2728 lines of code

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 22:43:46 +01:00
Till JS
a31ccc6c62 feat(infra): add api.mana.how route + Prometheus scrape targets for Go services
- Cloudflare Tunnel: api.mana.how → localhost:3060 (Go API Gateway)
- Prometheus: scrape targets for mana-api-gateway:3060 and mana-matrix-bot:4000

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 21:27:04 +01:00
Till JS
169821de1a feat(monitoring): add LLM Grafana dashboard, Prometheus scraping, and alerts
Wire mana-llm service into the monitoring stack:

Prometheus (docker/prometheus/prometheus.yml):
- Add mana-llm scrape job (port 3025, 15s interval)
- Include mana-llm in ServiceDown alert expression

Alerts (docker/prometheus/alerts.yml):
- New llm_alerts group with 4 rules:
  - LLMServiceDown: mana-llm down > 1 min (critical)
  - LLMHighErrorRate: > 10% errors for 5 min (warning)
  - OllamaProviderDown: > 50% requests via Google fallback (warning)
  - LLMSlowResponses: p95 > 30s for 5 min (warning)

Grafana Dashboard (docker/grafana/dashboards/mana-llm.json):
- 6 stat panels: status, req/min, error rate, fallback rate, latency, tokens/min
- Requests by Provider (stacked area: Ollama vs Google vs OpenRouter)
- Tokens by Type (prompt vs completion)
- Latency Percentiles (p50, p90, p99)
- Latency by Provider comparison
- Requests by Model breakdown
- Errors by Type
- Google Fallback Rate over time (with threshold coloring)
- Provider Distribution pie chart (24h)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 11:16:27 +01:00
Till JS
143112f77a feat(observability): add mana-search, mana-media, and Synapse to monitoring
- Add Prometheus scraping for mana-search (port 3020, already has metrics)
- Add Prometheus scraping for mana-media (port 3015, MetricsModule added)
- Add Prometheus scraping for Matrix Synapse (port 9002, already enabled)
- Add MetricsModule to mana-media with media_ prefix
- Update Dockerfile for mana-media to include shared-nestjs-metrics
- Replace hardcoded ServiceDown alert list with dynamic regex
  (.*-backend|mana-core-auth|mana-search|mana-media|synapse)
- Replace hardcoded backends.json query with dynamic regex
- Add Search, Media, Synapse to master-overview and system-overview dashboards

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 10:46:59 +01:00
Till JS
c8de944c8d feat(monitoring): add GlitchTip health check and disk space monitoring
- Add GlitchTip to health-check.sh monitoring endpoints
- Add native disk space checks for / and /Volumes/ManaData with 80%/90% thresholds
- Extend Prometheus disk alerts to include /host_mnt/Volumes/ManaData mountpoint
- Add ManaData disk usage gauge to Grafana system-overview dashboard

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 09:33:09 +01:00
Till JS
6fa6509fa5 feat(observability): add metrics and monitoring for all 15 backends
- Add MetricsModule to 8 backends missing it (photos, zitare, mukke,
  planta, picture, storage, presi, nutriphi)
- Enable Prometheus scraping for all 15 backends in prometheus.yml
  (was only 6, with 3 commented out and 6 missing entirely)
- Update ServiceDown alert rule to cover all 15 backends
- Update Grafana dashboards (backends, master-overview, system-overview)
  with all backend services in health panels
- Fix imprecise regex in application-details dashboard

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 09:09:04 +01:00
Till JS
3f91c4656a feat(infra): add deploy tracking with PostgreSQL, Pushgateway & Grafana dashboard
Instrument the CD pipeline to record per-deploy and per-service metrics
(build time, image size, startup time, health status) into PostgreSQL and
push gauges to Pushgateway. Adds a Grafana dashboard with 13 panels covering
deploy frequency, build performance, service health, and history.

New files:
- scripts/mac-mini/init-deploy-tracking.sql (idempotent DDL)
- scripts/deploy-metrics.sh (bash library for CI)
- docker/grafana/provisioning/datasources/deploy-tracking.yml
- docker/grafana/dashboards/deploy-tracking.json

Modified:
- docker/prometheus/prometheus.yml (pushgateway scrape job)
- .github/workflows/cd-macmini.yml (build/health instrumentation)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 17:08:03 +01:00
Till-JS
acc8de36ee feat(monitoring): add alerting stack and maintenance scripts
Medium priority stability improvements:

Alerting:
- Add vmalert for evaluating Prometheus alert rules
- Add alertmanager for alert routing and grouping
- Add alert-notifier service for Telegram/ntfy notifications
- Enable cadvisor scraping in prometheus config

Disk Monitoring:
- Add check-disk-space.sh for hourly disk monitoring
- Alert on 80% (warning) and 90% (critical) thresholds
- Auto-cleanup Docker when disk is critical
- Add com.manacore.disk-check.plist for LaunchD

Weekly Reports:
- Add weekly-report.sh for system health summary
- Includes: backup status, disk usage, container health,
  database stats, error log summary
- Runs every Sunday at 10 AM via LaunchD

Health Check Updates:
- Add checks for vmalert, alertmanager, alert-notifier

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-12 13:46:57 +01:00
Till-JS
fe33f4b355 fix(mana-core-auth): complete production readiness with test fixes
- Fix LoggerService mock in better-auth.service.spec.ts
- Fix name assertion in auth.controller.spec.ts (empty string fallback)
- Fix createRemoteJWKSet mock in jwt-auth.guard.spec.ts
- Add Grafana dashboard for Auth Service monitoring
- Add 10 auth-specific Prometheus alert rules
- Update production readiness plan to 100% complete

All 199 unit tests passing.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-01 14:18:58 +01:00
Till-JS
7aa5115c78 📊 feat(monitoring): add node-exporter for host system metrics
- Add node-exporter service to docker-compose for CPU/Memory/Disk monitoring
- Enable node-exporter scrape target in Prometheus config
- Update System Overview dashboard with Host System section:
  - CPU, Memory, Disk usage gauges
  - Total RAM, Total Disk, Uptime, Load stats
  - CPU & Memory over time graph
  - Network I/O graph
- Add Node Exporter to service status panel

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-01 12:38:44 +01:00
Till-JS
1b39aa8308 🔧 fix(prometheus): disable non-existent scrape targets
Commented out:
- node-exporter (container not deployed)
- cadvisor (container not deployed)
- storage/presi/nutriphi-backend (no /metrics endpoint yet)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-01 05:53:22 +01:00
Till-JS
dac6a85427 🔧 fix(prometheus): correct backend ports and add missing services
- chat-backend: 3002 → 3030
- todo-backend: 3018 → 3031
- calendar-backend: 3016 → 3032
- clock-backend: 3017 → 3033
- contacts-backend: 3015 → 3034
- Add storage-backend (3035), presi-backend (3036), nutriphi-backend (3037)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-01 05:51:50 +01:00
Till-JS
edf13b7102 revert: fix CI by reverting Telegram notifications
Reverting 618c58c5 which broke the CI workflow.
Will re-add notifications after fixing the issue.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-26 10:40:10 +01:00
Till-JS
618c58c519 feat(ci): add Telegram notifications and Grafana CI/CD dashboard
- Add notify-start job with Telegram notification for build start
- Add notify-complete job with build status and duration notification
- Push CI metrics to Prometheus Pushgateway for Grafana visualization
- Create CI/CD Grafana dashboard with build status, duration, and history
- Add Pushgateway scrape config to Prometheus

Requires TELEGRAM_BOT_TOKEN, TELEGRAM_CHAT_ID, and PUSHGATEWAY_URL secrets.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-26 10:31:17 +01:00
Till-JS
8c259a008b feat(monitoring): add comprehensive Grafana dashboards and alerting
New dashboards:
- Application Details: Node.js runtime (heap, event loop, GC),
  HTTP details (status codes, methods, top routes), error analysis
- Database Details: PostgreSQL and Redis metrics with detailed breakdowns

Alerting rules (docker/prometheus/alerts.yml):
- Service: down, high/very high error rate, slow response time
- Infrastructure: high CPU/memory/disk usage
- Database: PostgreSQL/Redis down, high connections, low cache hit
- Container: high CPU/memory, restarts

All dashboards include service selector variable for filtering.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-26 09:47:18 +01:00
Till-JS
6d86a08d63 feat: add monitoring dashboard (Prometheus + Grafana + Umami + Admin)
Phase 1: Infrastructure
- Add docker/prometheus/prometheus.yml with scrape configs for all services
- Add docker/grafana/provisioning for auto-configured datasources
- Add docker/grafana/dashboards (system-overview, backends-docker)
- Update docker-compose.macmini.yml with monitoring services:
  - prometheus, grafana, node-exporter, cadvisor
  - postgres-exporter, redis-exporter, umami
- Add grafana.mana.how and analytics.mana.how to Caddyfile

Phase 2: Backend Metrics
- Create packages/shared-nestjs-metrics with:
  - MetricsModule (auto /metrics endpoint)
  - MetricsService (Counter, Histogram, Gauge helpers)
  - MetricsMiddleware (auto HTTP request tracking)

Phase 3: Umami Web Analytics
- Add Umami tracking scripts to all landing pages
- Add Umami tracking scripts to all web apps
- Create scripts/mac-mini/setup-umami-db.sh

Phase 4: Admin Dashboard (ManaCore Web)
- Add admin routes: /admin, /admin/users, /admin/system
- Create StatCard, QuickLinks, UserTable components
- Add Admin link to navigation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 15:31:39 +01:00