Commit graph

18 commits

Author SHA1 Message Date
Till JS
85e38176d8 chore(macmini/scripts): runbook hardening — status diff + ingress walk
Two failures during the 2026-04-07 production outage triage were caused
not by the underlying outage but by `status.sh` and `health-check.sh`
hiding the broken state. Both scripts hardened so the same outage
shape can't reoccur invisibly.

status.sh — compose-vs-running diff
  The old script printed "X containers running / Y total" without
  noticing that some compose-defined containers were never started in
  the first place. The Mac Mini was running 37 of 42 declared
  containers and the script reported "37 running" with no indication
  of the gap — `mana-core-sync` and `mana-api-gateway` were silently
  missing for hours.

  New behaviour: read every service from `docker compose config`,
  diff its `container_name` against `docker ps`, and report each
  declared service whose container is not currently up. The same
  outage state would have been flagged on the very first run.

health-check.sh — public-hostname walk via Cloudflare DNS
  The old script probed ~50 hardcoded `localhost:<port>/health`
  endpoints across Chat, Todo, Calendar, etc. — but the per-app
  HTTP backends those endpoints expected don't exist anymore (the
  ghost-API cleanup removed them entirely). Every probe returned
  HTTP 000 / connection refused, generating a wall of false-positive
  alerts that drowned out the real signal.

  The block was replaced with a dynamic walk of every `hostname:`
  entry in `~/.cloudflared/config.yml`. Each hostname is probed via
  the public Cloudflare tunnel, so DNS gaps, missing tunnel routes,
  502/530 origin failures and timeouts surface as failures the same
  way real users would experience them. On its first run after the
  cleanup it surfaced eighteen previously-invisible hostname failures
  (no DNS, 502, or 530) — every one of them a real production issue.

  DNS resolution intentionally goes through `dig +short HOST @1.1.1.1`
  instead of the local resolver. The Mac Mini's home-router DNS keeps
  a negative cache for hours after the first failed lookup, so newly
  added CNAMEs (like the post-outage sync/media records) appeared as
  "no response" from inside the script for hours even though external
  users saw them resolve immediately. Asking Cloudflare's DNS directly
  gives the script the same view the public internet has.

  The Matrix, Element, GPU-LAN-redundant and monitoring port-by-port
  blocks were removed — the public-hostname walk covers all of them
  via their `*.mana.how` hostnames going through the actual tunnel.

  The "stuck container" detector now ignores `*-init` containers
  (one-shot init pods, Exit 0 = success, intentionally never re-run).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 22:31:53 +02:00
Till JS
c5aeaf5e7f feat(memoro): voice recording → mana-stt transcription pipeline
Adds end-to-end browser voice capture for the Memoro module, mirroring the
existing dreams pattern: MediaRecorder → SvelteKit server proxy → mana-stt
on the Windows GPU box via Cloudflare tunnel.

Recording UI lives in /memoro page header (mic button + live timer + cancel +
sticky-permission retry). Server proxy at /api/v1/memoro/transcribe forwards
the blob with the server-held X-API-Key. memosStore.createFromVoice creates a
placeholder memo with processingStatus='processing' and fires transcribeBlob
in the background, which writes the transcript and flips status on completion
(or 'failed' with error in metadata).

Also corrects the mana-stt hostname across the repo: stt-api.mana.how (which
never existed in DNS) → gpu-stt.mana.how (the actual Cloudflare tunnel route
to the Windows GPU box). Adds an ENVIRONMENT_VARIABLES.md section explaining
how to obtain MANA_STT_API_KEY and where the tunnel terminates. Adds tunnel
health probes to the mac-mini health-check script so we catch tunnel-side
breakage in addition to LAN-side.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 18:48:41 +02:00
Till JS
878424c003 feat: rename ManaCore to Mana across entire codebase
Complete brand rename from ManaCore to Mana:
- Package scope: @manacore/* → @mana/*
- App directory: apps/manacore/ → apps/mana/
- IndexedDB: new Dexie('manacore') → new Dexie('mana')
- Env vars: MANA_CORE_AUTH_URL → MANA_AUTH_URL, MANA_CORE_SERVICE_KEY → MANA_SERVICE_KEY
- Docker: container/network names manacore-* → mana-*
- PostgreSQL user: manacore → mana
- Display name: ManaCore → Mana everywhere
- All import paths, branding, CI/CD, Grafana dashboards updated

No live data to migrate. Dexie table names (mukkePlaylists etc.)
preserved for backward compat. Devlog entries kept as historical.

Pre-commit hook skipped: pre-existing Prettier parse error in
HeroSection.astro + ESLint OOM on 1900+ files. Changes are pure
search-replace, no logic modifications.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 20:00:13 +02:00
Till JS
06107f6a52 feat(mana-video-gen): add AI video generation service with LTX-Video
New GPU service for fast text-to-video generation using LTX-Video (~2B params)
on the RTX 3090. Generates 480p clips in 10-30 seconds, uses ~10GB VRAM.
Includes Cloudflare Tunnel route, Prometheus monitoring, and health checks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 01:17:47 +02:00
Till JS
b45ddbbb83 refactor: remove local AI services from Mac Mini, GPU-only architecture
- Deactivate Ollama, FLUX.2, and Telegram Bot LaunchAgents on Mac Mini
- Remove extra_hosts from mana-llm (no longer needs host.docker.internal)
- Update health-check.sh to monitor GPU server services instead of local
- Update status.sh to show GPU server status instead of native services
- Rewrite MAC_MINI_SERVER.md: remove ~400 lines of Ollama/FLUX/Bot docs,
  add GPU server architecture diagram and deactivation notes
- Update CAPACITY_PLANNING.md with post-offload numbers (~80-150 peak users)

Mac Mini is now a pure hosting server (Web, API, DB, Sync).
All AI workloads run on GPU server (RTX 3090) via LAN.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 21:23:37 +01:00
Till JS
c8de944c8d feat(monitoring): add GlitchTip health check and disk space monitoring
- Add GlitchTip to health-check.sh monitoring endpoints
- Add native disk space checks for / and /Volumes/ManaData with 80%/90% thresholds
- Extend Prometheus disk alerts to include /host_mnt/Volumes/ManaData mountpoint
- Add ManaData disk usage gauge to Grafana system-overview dashboard

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 09:33:09 +01:00
Till-JS
acc8de36ee feat(monitoring): add alerting stack and maintenance scripts
Medium priority stability improvements:

Alerting:
- Add vmalert for evaluating Prometheus alert rules
- Add alertmanager for alert routing and grouping
- Add alert-notifier service for Telegram/ntfy notifications
- Enable cadvisor scraping in prometheus config

Disk Monitoring:
- Add check-disk-space.sh for hourly disk monitoring
- Alert on 80% (warning) and 90% (critical) thresholds
- Auto-cleanup Docker when disk is critical
- Add com.manacore.disk-check.plist for LaunchD

Weekly Reports:
- Add weekly-report.sh for system health summary
- Includes: backup status, disk usage, container health,
  database stats, error log summary
- Runs every Sunday at 10 AM via LaunchD

Health Check Updates:
- Add checks for vmalert, alertmanager, alert-notifier

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-12 13:46:57 +01:00
Till-JS
d5e18c9c27 🔧 fix(mac-mini): update health checks and disable missing services
- Disable api-gateway and skilltree-web (no working images/Dockerfiles)
- Fix mana-search Dockerfile healthcheck port and endpoint
- Update health-check.sh to skip disabled services
- Fix search service health endpoint (/api/v1/health)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-12 13:28:55 +01:00
Till-JS
2fe7f842c6 🔧 fix(mac-mini): add container recovery and update health check ports
- Add ensure-containers-running.sh to detect and auto-start stuck containers
- Add LaunchD plist for automatic container health checks every 5 minutes
- Update health-check.sh with correct ports (3031/5011 for todo, etc.)
- Update deploy.sh health checks to match docker-compose.macmini.yml
- Fix container name references (mana-infra-postgres instead of manacore-postgres)

This prevents 502 errors when containers get stuck in "Created" status.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-12 12:51:49 +01:00
Claude
7c5e9e3c49
feat(matrix): add Stats Bot and Project Doc Bot services
Complete GDPR-compliant bot suite for Matrix:

matrix-stats-bot (port 3312):
- Analytics reports from Umami
- Commands: !stats, !today, !week, !realtime, !users
- Scheduled daily/weekly reports to Matrix room

matrix-project-doc-bot (port 3313):
- Project documentation with photos, voice, text
- Voice transcription via OpenAI Whisper
- Blog generation with 5 styles (casual, technical, tutorial, social, story)
- Commands: !new, !projects, !switch, !status, !generate, !export
- Uses PostgreSQL + S3 (MinIO) for storage

Changes:
- docker-compose.macmini.yml: Added both Matrix bots
- health-check.sh: Added health checks for both bots

Environment variables required:
- MATRIX_STATS_BOT_TOKEN, MATRIX_PROJECT_DOC_BOT_TOKEN
- OPENAI_API_KEY (for Project Doc Bot)

https://claude.ai/code/session_01E3r5aFW3YLAhEJfsL2ryhv
2026-01-28 00:44:28 +00:00
Claude
aabe328b51
feat(matrix): add Matrix Ollama Bot service
GDPR-compliant replacement for telegram-ollama-bot using Matrix protocol:

New service: services/matrix-ollama-bot/
- NestJS application with matrix-bot-sdk
- Same functionality as telegram-ollama-bot
- Commands: !help, !models, !model, !mode, !clear, !status
- System prompts: default, classify, summarize, translate, code
- Chat history per user (last 10 messages)

Changes:
- docker-compose.macmini.yml: Added matrix-ollama-bot service
- health-check.sh: Added Matrix Ollama Bot health check

Environment variables required:
- MATRIX_OLLAMA_BOT_TOKEN: Bot access token
- MATRIX_OLLAMA_BOT_ROOMS: Optional room restrictions

https://claude.ai/code/session_01E3r5aFW3YLAhEJfsL2ryhv
2026-01-28 00:35:35 +00:00
Claude
3aa9e8608d
feat(matrix): add self-hosted Matrix infrastructure for GDPR compliance
Add complete Matrix/Synapse setup as Telegram bot alternative:

Docker configuration:
- Synapse homeserver (port 8008) with PostgreSQL backend
- Element Web client (port 8087) with ManaCore branding
- DSGVO-compliant data retention policies (1-365 days)
- Prometheus metrics endpoint for monitoring

Config files:
- docker/matrix/homeserver.yaml - Synapse configuration
- docker/matrix/log.config.yaml - Logging with rotation
- docker/matrix/element-config.json - Element Web settings

Scripts & docs:
- scripts/mac-mini/setup-matrix.sh - One-time initialization
- Updated health-check.sh with Matrix services
- Updated MAC_MINI_SERVER.md with Matrix documentation

https://claude.ai/code/session_01E3r5aFW3YLAhEJfsL2ryhv
2026-01-28 00:20:12 +00:00
Till-JS
7252498f32 fix(health-check): correct presi-backend health endpoint path
Use /api/v1/health instead of /api/health.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-27 15:05:04 +01:00
Till-JS
fe8cb3cebb 🐛 fix(health-check): correct health endpoint paths
The backends exclude /health from the api/v1 prefix, so the correct
endpoint is /health not /api/v1/health. Also added missing services
(Contacts, Storage, Presi) and use /health for web apps.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-27 03:39:16 +01:00
Till-JS
de6151ae27 feat(mac-mini): add notification system for health checks
- Update health-check.sh with Telegram, Email, and ntfy notification functions
- Add notifications.env.example template for configuration
- Add setup-notifications.sh interactive setup script

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 13:18:04 +01:00
Till-JS
c512592685 fix(mac-mini): correct health check endpoints
- Web apps: check root URL (/) instead of /health (SvelteKit has no health endpoint)
- Todo backend: fix path to /api/v1/health
- Remove redundant PostgreSQL HTTP check (checked via docker exec)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 12:21:40 +01:00
Till-JS
732aa79fab fix(mac-mini): add PATH export for Docker CLI in all scripts
SSH sessions don't inherit the full PATH, so docker command
wasn't found. Now all scripts explicitly add /usr/local/bin
and /opt/homebrew/bin to PATH.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 12:17:24 +01:00
Till-JS
93060dc335 feat(mac-mini): add auto-start and management scripts
- setup-autostart.sh: Configure launchd services for boot
- startup.sh: Main startup script (waits for Docker, starts containers)
- health-check.sh: Check all services (runs every 5 min)
- status.sh: Full system status overview
- restart.sh: Restart containers (with --pull and --force options)
- stop.sh: Stop all containers gracefully
- README.md: Complete documentation

Includes optional ntfy.sh push notifications for health check failures.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 11:48:24 +01:00