Commit graph

31 commits

Author SHA1 Message Date
Till JS
a55aae6cb5 chore(macmini): infra cleanup — compose env, blackbox mem, prometheus gpu probes
Three Mac Mini infrastructure follow-ups bundled:

1. docker-compose.macmini.yml — drop ghost backend env vars from
   the mana-app-web service (todo, calendar, contacts, chat, storage,
   cards, music, nutriphi `PUBLIC_*_API_URL{,_CLIENT}` plus the memoro
   server URLs). The matching consumers were removed in the earlier
   ghost-API cleanup commits, so these env entries had been wiring
   nothing into the running container for several deploys. Force-
   recreating mana-app-web after pulling this commit will pick up
   the slimmer env automatically.

2. docker-compose.macmini.yml — bump `mana-mon-blackbox` mem_limit
   from 32m to 128m. blackbox-exporter v0.25 sits north of 32m
   under load and was OOM-restart-looping every ~90 seconds, which
   in turn made `status.mana.how` and the prometheus probe metrics
   stale (since the scraper was missing every other window).

3. docker/prometheus/prometheus.yml — split `blackbox-gpu` into two
   jobs:
     - `blackbox-gpu` now probes `/health` via the http_health
       module, because the GPU services (whisper STT, FLUX image
       gen, Coqui TTS) return 401/404 on `/` by design (auth or
       API-only). The previous http_2xx-on-`/` probe was reporting
       all four as down even though they answered `/health` with
       200, which inflated the down count on status.mana.how.
     - `blackbox-gpu-root` keeps the http_2xx-on-`/` probe for
       Ollama, which has no `/health` endpoint but does answer
       2xx on its root.
   Both jobs share the same blackbox-exporter relabel rewrite so
   the targets are routed through the exporter container, not
   scraped directly by VictoriaMetrics.

Verified post-fix: status.mana.how reports 41/42 services up (only
`gpu-video` remains down — LTX Video Gen is intentionally not
deployed yet on the Windows GPU box).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 22:59:38 +02:00
Till JS
878424c003 feat: rename ManaCore to Mana across entire codebase
Complete brand rename from ManaCore to Mana:
- Package scope: @manacore/* → @mana/*
- App directory: apps/manacore/ → apps/mana/
- IndexedDB: new Dexie('manacore') → new Dexie('mana')
- Env vars: MANA_CORE_AUTH_URL → MANA_AUTH_URL, MANA_CORE_SERVICE_KEY → MANA_SERVICE_KEY
- Docker: container/network names manacore-* → mana-*
- PostgreSQL user: manacore → mana
- Display name: ManaCore → Mana everywhere
- All import paths, branding, CI/CD, Grafana dashboards updated

No live data to migrate. Dexie table names (mukkePlaylists etc.)
preserved for backward compat. Devlog entries kept as historical.

Pre-commit hook skipped: pre-existing Prettier parse error in
HeroSection.astro + ESLint OOM on 1900+ files. Changes are pure
search-replace, no logic modifications.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 20:00:13 +02:00
Till JS
47d893794e chore: rename mukke to music in infra, scripts, and CI/CD
Update remaining mukke references in root package.json scripts,
docker-compose files, Grafana dashboards, Prometheus config,
CD pipeline, cloudflared config, deploy scripts, load tests,
and mana-auth user-data service.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 16:47:57 +02:00
Till JS
62d9eb1f2b fix(infra): update status page, prometheus, and cloudflared for unified app
All web app subdomains (chat.mana.how, todo.mana.how, etc.) were removed
when the unified app launched, but monitoring configs still referenced them.
Update blackbox targets to use mana.how/route URLs, remove stale API backend
routes from cloudflared, clean up CORS origins, and fix status page generator
to handle route-based URLs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 14:59:15 +02:00
Till JS
06107f6a52 feat(mana-video-gen): add AI video generation service with LTX-Video
New GPU service for fast text-to-video generation using LTX-Video (~2B params)
on the RTX 3090. Generates 480p clips in 10-30 seconds, uses ~10GB VRAM.
Includes Cloudflare Tunnel route, Prometheus monitoring, and health checks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 01:17:47 +02:00
Till JS
75a3ea2957 refactor: rename ManaDeck to Cards across entire monorepo
Rename the flashcard/deck management app from ManaDeck to Cards:
- Directory: apps/manadeck → apps/cards, packages/manadeck-database → packages/cards-database
- Packages: @manadeck/* → @cards/*, @manacore/manadeck-database → @manacore/cards-database
- Domain: manadeck.mana.how → cards.mana.how
- Storage: manadeck-storage → cards-storage
- Database: manadeck → cards
- All shared packages, infra configs, services, i18n, and docs updated
- 244 files changed, zero remaining manadeck references

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 11:45:21 +02:00
Till JS
402baf7c7f feat(monitoring): add uptime monitoring via Blackbox Exporter
- scripts/check-status.sh: parallel HTTP check aller mana.how Domains aus cloudflared-config.yml
- docker/blackbox/blackbox.yml: Blackbox Exporter Config (http_2xx, http_health Module)
- docker-compose.macmini.yml: blackbox-exporter Container (Port 9115, 32MB RAM)
- docker/prometheus/prometheus.yml: 4 Scrape-Jobs (blackbox-web, blackbox-api, blackbox-infra, blackbox-gpu)
- docker/prometheus/alerts.yml: 5 Alert-Regeln (WebAppDown, APIDown, InfraToolDown, GPUServiceDown, SlowHTTPResponse)
- docker/grafana/dashboards/uptime.json: Grafana Uptime-Dashboard mit Status-Tables und Verlauf
- package.json: check:status Script

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-31 17:43:25 +02:00
Till JS
be1096ec85 fix(monitoring): update disk alerts to use mac_disk_used_percent metrics
node-exporter runs in VM and can't see host macOS disks directly.
Use custom mac_disk_used_percent metrics pushed via Pushgateway instead.
Also add ColimaVMDiskLarge alert when datadisk exceeds 150 GB.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-30 20:01:46 +02:00
Till JS
79a53cf70a fix(infra): sync Prometheus + cloudflared ports with current deployment
- Prometheus: mana-sync 3010→3051, mana-matrix-bot 4001→4000
- Cloudflared: api.mana.how 3060→3016

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 18:07:12 +01:00
Till JS
099a40bbd1 chore: replace all mana-core-auth references with mana-auth
Update docker-compose (dev + macmini), CI/CD workflows, Prometheus,
package.json scripts, env generation, database setup, CODEOWNERS,
and dependabot to reference the new Hono-based mana-auth service.
Delete zombie mana-core-auth directory (already removed from Git).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 18:05:31 +01:00
Till JS
14099cc42c docs(infra): add PORT_SCHEMA.md + update Prometheus scrape targets
Comprehensive port schema documentation as single source of truth.
All services assigned to logical ranges:
- 3000-3009: Core platform (auth, credits, subscriptions, user, analytics)
- 3010-3019: Core infra (sync, media, search, notify, crawler, gateway)
- 3020-3029: AI/ML (llm, stt, tts, image-gen, voice-bot)
- 3030-3059: App backends
- 4000-4099: Matrix stack
- 5000-5059: Web frontends
- 8000-8099: Monitoring
- 9000-9199: Infrastructure exporters

All port conflicts resolved. Prometheus targets updated to match.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 03:02:12 +01:00
Till JS
753c685ef7 feat(services): create mana-analytics, remove feedback/analytics/ai from auth
Extract feedback, analytics, and AI modules from mana-core-auth into
standalone mana-analytics service (Hono + Bun, Port 3064).

New service (services/mana-analytics/):
- User feedback CRUD with voting
- AI-powered feedback title generation via mana-llm
- Simplified from DuckDB analytics to pure PostgreSQL
- ~550 LOC

Removed from mana-core-auth:
- feedback/ module (6 files)
- analytics/ module (4 files)
- ai/ module (3 files)
- db/schema/feedback.schema.ts

mana-core-auth now contains ONLY pure auth:
- Better Auth (JWT, Sessions, 2FA, Passkeys, OIDC, Magic Links)
- Organizations/Guilds (membership management)
- API Keys, Security, Me (GDPR), Health, Metrics
- Ready for Phase 5: Hono rewrite

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 02:29:24 +01:00
Till JS
ced7dd7441 feat(monitoring): add mana-sync, mana-notify, mana-crawler to Prometheus
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 02:18:21 +01:00
Till JS
d7799ec95d refactor(photos): remove NestJS backend, use local-first + direct mana-media
The Photos NestJS backend was a proxy to mana-media that enriched
responses with local album/favorite/tag data. Now:

- Albums store → local-first via albumCollection + albumItemCollection
- Favorites → local-first via favoriteCollection (toggle in IndexedDB)
- Photo tags → local-first via photoTagCollection
- Photo listing/stats → direct mana-media API calls from frontend
- Upload → direct mana-media upload from frontend
- Delete → direct mana-media delete from frontend

Removed 27 TypeScript files, 1 Docker container, 1 port (3039).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 02:18:03 +01:00
Till JS
dd2f814cf3 refactor(presi): replace NestJS backend with lightweight Hono server
The Presi NestJS backend (40 source files, 50 deps) was a CRUD wrapper
around decks, slides, and themes — all now handled by local-first sync.

Only the share-link feature requires server-side state (public URLs
without auth), so a minimal Hono + Bun server replaces the entire
NestJS backend:

- apps/presi/apps/server/ — Hono server with share routes + GDPR admin
  Uses @manacore/shared-hono for auth (JWKS), health, admin, errors
- Web app API client stripped to share-only (was 270 lines → 90 lines)
- Removed from docker-compose, CI/CD, Prometheus, env generation
- NestJS backend deleted (40 TS files, 8 test specs, 3038 lines)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 02:08:40 +01:00
Till JS
32939fbfb5 refactor(infra): remove zitare + clock NestJS backends, add shared-hono package
Both apps are fully local-first via Dexie.js + mana-sync. Their NestJS
backends were pure CRUD wrappers (20 + 31 source files) that are no
longer needed.

Changes:
- Add packages/shared-hono: JWT auth via JWKS (jose), Drizzle DB factory,
  health route, generic GDPR admin handler, error middleware
- Migrate zitare lists page from fetch() to listsStore (local-first)
- Rewrite clock timers store from API-based to timerCollection (Dexie)
- Update clock +layout.svelte CommandBar search to use local collections
- Remove zitare-backend + clock-backend from docker-compose, CI/CD,
  Prometheus, env generation, setup scripts
- Add docs/TECHNOLOGY_AUDIT_2026_03.md with full repo analysis

Net result: -2 Docker containers, -2 ports, -2728 lines of code

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 22:43:46 +01:00
Till JS
a31ccc6c62 feat(infra): add api.mana.how route + Prometheus scrape targets for Go services
- Cloudflare Tunnel: api.mana.how → localhost:3060 (Go API Gateway)
- Prometheus: scrape targets for mana-api-gateway:3060 and mana-matrix-bot:4000

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 21:27:04 +01:00
Till JS
169821de1a feat(monitoring): add LLM Grafana dashboard, Prometheus scraping, and alerts
Wire mana-llm service into the monitoring stack:

Prometheus (docker/prometheus/prometheus.yml):
- Add mana-llm scrape job (port 3025, 15s interval)
- Include mana-llm in ServiceDown alert expression

Alerts (docker/prometheus/alerts.yml):
- New llm_alerts group with 4 rules:
  - LLMServiceDown: mana-llm down > 1 min (critical)
  - LLMHighErrorRate: > 10% errors for 5 min (warning)
  - OllamaProviderDown: > 50% requests via Google fallback (warning)
  - LLMSlowResponses: p95 > 30s for 5 min (warning)

Grafana Dashboard (docker/grafana/dashboards/mana-llm.json):
- 6 stat panels: status, req/min, error rate, fallback rate, latency, tokens/min
- Requests by Provider (stacked area: Ollama vs Google vs OpenRouter)
- Tokens by Type (prompt vs completion)
- Latency Percentiles (p50, p90, p99)
- Latency by Provider comparison
- Requests by Model breakdown
- Errors by Type
- Google Fallback Rate over time (with threshold coloring)
- Provider Distribution pie chart (24h)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 11:16:27 +01:00
Till JS
143112f77a feat(observability): add mana-search, mana-media, and Synapse to monitoring
- Add Prometheus scraping for mana-search (port 3020, already has metrics)
- Add Prometheus scraping for mana-media (port 3015, MetricsModule added)
- Add Prometheus scraping for Matrix Synapse (port 9002, already enabled)
- Add MetricsModule to mana-media with media_ prefix
- Update Dockerfile for mana-media to include shared-nestjs-metrics
- Replace hardcoded ServiceDown alert list with dynamic regex
  (.*-backend|mana-core-auth|mana-search|mana-media|synapse)
- Replace hardcoded backends.json query with dynamic regex
- Add Search, Media, Synapse to master-overview and system-overview dashboards

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 10:46:59 +01:00
Till JS
c8de944c8d feat(monitoring): add GlitchTip health check and disk space monitoring
- Add GlitchTip to health-check.sh monitoring endpoints
- Add native disk space checks for / and /Volumes/ManaData with 80%/90% thresholds
- Extend Prometheus disk alerts to include /host_mnt/Volumes/ManaData mountpoint
- Add ManaData disk usage gauge to Grafana system-overview dashboard

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 09:33:09 +01:00
Till JS
6fa6509fa5 feat(observability): add metrics and monitoring for all 15 backends
- Add MetricsModule to 8 backends missing it (photos, zitare, mukke,
  planta, picture, storage, presi, nutriphi)
- Enable Prometheus scraping for all 15 backends in prometheus.yml
  (was only 6, with 3 commented out and 6 missing entirely)
- Update ServiceDown alert rule to cover all 15 backends
- Update Grafana dashboards (backends, master-overview, system-overview)
  with all backend services in health panels
- Fix imprecise regex in application-details dashboard

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 09:09:04 +01:00
Till JS
3f91c4656a feat(infra): add deploy tracking with PostgreSQL, Pushgateway & Grafana dashboard
Instrument the CD pipeline to record per-deploy and per-service metrics
(build time, image size, startup time, health status) into PostgreSQL and
push gauges to Pushgateway. Adds a Grafana dashboard with 13 panels covering
deploy frequency, build performance, service health, and history.

New files:
- scripts/mac-mini/init-deploy-tracking.sql (idempotent DDL)
- scripts/deploy-metrics.sh (bash library for CI)
- docker/grafana/provisioning/datasources/deploy-tracking.yml
- docker/grafana/dashboards/deploy-tracking.json

Modified:
- docker/prometheus/prometheus.yml (pushgateway scrape job)
- .github/workflows/cd-macmini.yml (build/health instrumentation)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 17:08:03 +01:00
Till-JS
acc8de36ee feat(monitoring): add alerting stack and maintenance scripts
Medium priority stability improvements:

Alerting:
- Add vmalert for evaluating Prometheus alert rules
- Add alertmanager for alert routing and grouping
- Add alert-notifier service for Telegram/ntfy notifications
- Enable cadvisor scraping in prometheus config

Disk Monitoring:
- Add check-disk-space.sh for hourly disk monitoring
- Alert on 80% (warning) and 90% (critical) thresholds
- Auto-cleanup Docker when disk is critical
- Add com.manacore.disk-check.plist for LaunchD

Weekly Reports:
- Add weekly-report.sh for system health summary
- Includes: backup status, disk usage, container health,
  database stats, error log summary
- Runs every Sunday at 10 AM via LaunchD

Health Check Updates:
- Add checks for vmalert, alertmanager, alert-notifier

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-12 13:46:57 +01:00
Till-JS
fe33f4b355 fix(mana-core-auth): complete production readiness with test fixes
- Fix LoggerService mock in better-auth.service.spec.ts
- Fix name assertion in auth.controller.spec.ts (empty string fallback)
- Fix createRemoteJWKSet mock in jwt-auth.guard.spec.ts
- Add Grafana dashboard for Auth Service monitoring
- Add 10 auth-specific Prometheus alert rules
- Update production readiness plan to 100% complete

All 199 unit tests passing.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-01 14:18:58 +01:00
Till-JS
7aa5115c78 📊 feat(monitoring): add node-exporter for host system metrics
- Add node-exporter service to docker-compose for CPU/Memory/Disk monitoring
- Enable node-exporter scrape target in Prometheus config
- Update System Overview dashboard with Host System section:
  - CPU, Memory, Disk usage gauges
  - Total RAM, Total Disk, Uptime, Load stats
  - CPU & Memory over time graph
  - Network I/O graph
- Add Node Exporter to service status panel

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-01 12:38:44 +01:00
Till-JS
1b39aa8308 🔧 fix(prometheus): disable non-existent scrape targets
Commented out:
- node-exporter (container not deployed)
- cadvisor (container not deployed)
- storage/presi/nutriphi-backend (no /metrics endpoint yet)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-01 05:53:22 +01:00
Till-JS
dac6a85427 🔧 fix(prometheus): correct backend ports and add missing services
- chat-backend: 3002 → 3030
- todo-backend: 3018 → 3031
- calendar-backend: 3016 → 3032
- clock-backend: 3017 → 3033
- contacts-backend: 3015 → 3034
- Add storage-backend (3035), presi-backend (3036), nutriphi-backend (3037)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-01 05:51:50 +01:00
Till-JS
edf13b7102 revert: fix CI by reverting Telegram notifications
Reverting 618c58c5 which broke the CI workflow.
Will re-add notifications after fixing the issue.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-26 10:40:10 +01:00
Till-JS
618c58c519 feat(ci): add Telegram notifications and Grafana CI/CD dashboard
- Add notify-start job with Telegram notification for build start
- Add notify-complete job with build status and duration notification
- Push CI metrics to Prometheus Pushgateway for Grafana visualization
- Create CI/CD Grafana dashboard with build status, duration, and history
- Add Pushgateway scrape config to Prometheus

Requires TELEGRAM_BOT_TOKEN, TELEGRAM_CHAT_ID, and PUSHGATEWAY_URL secrets.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-26 10:31:17 +01:00
Till-JS
8c259a008b feat(monitoring): add comprehensive Grafana dashboards and alerting
New dashboards:
- Application Details: Node.js runtime (heap, event loop, GC),
  HTTP details (status codes, methods, top routes), error analysis
- Database Details: PostgreSQL and Redis metrics with detailed breakdowns

Alerting rules (docker/prometheus/alerts.yml):
- Service: down, high/very high error rate, slow response time
- Infrastructure: high CPU/memory/disk usage
- Database: PostgreSQL/Redis down, high connections, low cache hit
- Container: high CPU/memory, restarts

All dashboards include service selector variable for filtering.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-26 09:47:18 +01:00
Till-JS
6d86a08d63 feat: add monitoring dashboard (Prometheus + Grafana + Umami + Admin)
Phase 1: Infrastructure
- Add docker/prometheus/prometheus.yml with scrape configs for all services
- Add docker/grafana/provisioning for auto-configured datasources
- Add docker/grafana/dashboards (system-overview, backends-docker)
- Update docker-compose.macmini.yml with monitoring services:
  - prometheus, grafana, node-exporter, cadvisor
  - postgres-exporter, redis-exporter, umami
- Add grafana.mana.how and analytics.mana.how to Caddyfile

Phase 2: Backend Metrics
- Create packages/shared-nestjs-metrics with:
  - MetricsModule (auto /metrics endpoint)
  - MetricsService (Counter, Histogram, Gauge helpers)
  - MetricsMiddleware (auto HTTP request tracking)

Phase 3: Umami Web Analytics
- Add Umami tracking scripts to all landing pages
- Add Umami tracking scripts to all web apps
- Create scripts/mac-mini/setup-umami-db.sh

Phase 4: Admin Dashboard (ManaCore Web)
- Add admin routes: /admin, /admin/users, /admin/system
- Create StatCard, QuickLinks, UserTable components
- Add Admin link to navigation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 15:31:39 +01:00