- check-disk-space.sh now pushes mac_disk_used_percent + mac_colima_disk_used_gb
to Pushgateway every hour so vmalert can alert on real macOS disk usage
- alerts.yml: replace broken node-exporter disk alerts with Pushgateway-based ones
- master-overview.json: add "Recent Errors (Loki)" section with live error log
stream, error rate timeseries and top error sources barchart
- move-colima-to-external-ssd.sh: guided script to move 200GB Colima VM
datadisk from internal SSD to /Volumes/ManaData (3.6TB external SSD)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
node-exporter runs in VM and can't see host macOS disks directly.
Use custom mac_disk_used_percent metrics pushed via Pushgateway instead.
Also add ColimaVMDiskLarge alert when datadisk exceeds 150 GB.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
relabel drop removed the entire stream before labels were set, causing
the "at least one label pair required" error. pipeline_stages drop runs
after labels are established, which is correct for filtering by tier.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Containers that don't match any tier regex had no labels, causing Loki to
reject the stream with "at least one label pair is required".
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
todo-web, calendar-web, contacts-web, mana-web all depend on
@manacore/shared-links but it was missing from the base image COPY list.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
vmalert: was copying prometheus.yml into /etc/alerts/ causing parse
failure. Now only copies alerts.yml (the actual rules file).
synapse: mana-auth (Better Auth) has no OIDC discovery endpoint,
so disable OIDC and enable password auth until OIDC is implemented.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Loki was already running but had no log shipper. Adds Promtail to collect
Docker logs from all 66 containers with automatic tier labeling (infra,
auth, core, app, matrix, games) and a Grafana Logs Explorer dashboard.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
build-app.sh now checks available RAM before builds and only stops
monitoring containers when free memory is below 3 GB threshold.
New memory-baseline.sh script measures per-container and per-category
RAM usage for capacity planning.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
patchedDependencies was already cleaned in package.json. The sed
command was mangling the JSON structure.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
shared-errors, shared-logger, shared-llm, notify-client are not
needed by SvelteKit web apps. Their presence caused transitive
dependency conflicts (astro check failing).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Route all AI workloads (Ollama, STT, TTS, Image Gen) to GPU server
(192.168.178.11) via LAN instead of host.docker.internal
- Upgrade default model to gemma3:12b and max concurrent to 5
- Add daily signup limit service (MAX_DAILY_SIGNUPS env var)
- Add GET /api/v1/auth/signup-status public endpoint
- Add k6 load test suite (web-apps, auth, sync-websocket, ollama)
- Add capacity planning documentation
- Fix: add eslint-config to sveltekit-base and calendar Dockerfiles
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After extensive package restructuring (deletions, consolidations, new
packages), the frozen lockfile causes resolution failures in Docker.
Use --no-frozen-lockfile until lockfile stabilizes.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add 8 packages that were created after the base image was defined:
subscriptions, credits, shared-hono, shared-storage, shared-landing-ui,
shared-llm, notify-client, shared-errors, shared-logger
Fixes: Rollup failed to resolve @manacore/subscriptions during web builds.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace deleted shared-feedback-* and shared-help-* packages with
consolidated @manacore/feedback and @manacore/help packages.
Add local-store and shared-auth-stores packages.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add mana-user (3062), mana-subscriptions (3063), mana-analytics (3064)
to docker-compose with health checks and traefik labels
- Replace old NestJS Tier 3 app backends (~300 lines) with comment
placeholder for Hono compute servers (need shared Dockerfile)
- Create docker/Dockerfile.hono-server — shared Bun Dockerfile for
all 14 app compute servers (ARG APP for build context)
- Add 5 new databases to setup-databases.sh: mana_auth, mana_credits,
mana_user, mana_subscriptions, mana_analytics, mana_sync
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract feedback, analytics, and AI modules from mana-core-auth into
standalone mana-analytics service (Hono + Bun, Port 3064).
New service (services/mana-analytics/):
- User feedback CRUD with voting
- AI-powered feedback title generation via mana-llm
- Simplified from DuckDB analytics to pure PostgreSQL
- ~550 LOC
Removed from mana-core-auth:
- feedback/ module (6 files)
- analytics/ module (4 files)
- ai/ module (3 files)
- db/schema/feedback.schema.ts
mana-core-auth now contains ONLY pure auth:
- Better Auth (JWT, Sessions, 2FA, Passkeys, OIDC, Magic Links)
- Organizations/Guilds (membership management)
- API Keys, Security, Me (GDPR), Health, Metrics
- Ready for Phase 5: Hono rewrite
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Photos NestJS backend was a proxy to mana-media that enriched
responses with local album/favorite/tag data. Now:
- Albums store → local-first via albumCollection + albumItemCollection
- Favorites → local-first via favoriteCollection (toggle in IndexedDB)
- Photo tags → local-first via photoTagCollection
- Photo listing/stats → direct mana-media API calls from frontend
- Upload → direct mana-media upload from frontend
- Delete → direct mana-media delete from frontend
Removed 27 TypeScript files, 1 Docker container, 1 port (3039).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Presi NestJS backend (40 source files, 50 deps) was a CRUD wrapper
around decks, slides, and themes — all now handled by local-first sync.
Only the share-link feature requires server-side state (public URLs
without auth), so a minimal Hono + Bun server replaces the entire
NestJS backend:
- apps/presi/apps/server/ — Hono server with share routes + GDPR admin
Uses @manacore/shared-hono for auth (JWKS), health, admin, errors
- Web app API client stripped to share-only (was 270 lines → 90 lines)
- Removed from docker-compose, CI/CD, Prometheus, env generation
- NestJS backend deleted (40 TS files, 8 test specs, 3038 lines)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both apps are fully local-first via Dexie.js + mana-sync. Their NestJS
backends were pure CRUD wrappers (20 + 31 source files) that are no
longer needed.
Changes:
- Add packages/shared-hono: JWT auth via JWKS (jose), Drizzle DB factory,
health route, generic GDPR admin handler, error middleware
- Migrate zitare lists page from fetch() to listsStore (local-first)
- Rewrite clock timers store from API-based to timerCollection (Dexie)
- Update clock +layout.svelte CommandBar search to use local collections
- Remove zitare-backend + clock-backend from docker-compose, CI/CD,
Prometheus, env generation, setup scripts
- Add docs/TECHNOLOGY_AUDIT_2026_03.md with full repo analysis
Net result: -2 Docker containers, -2 ports, -2728 lines of code
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Cloudflare Tunnel: api.mana.how → localhost:3060 (Go API Gateway)
- Prometheus: scrape targets for mana-api-gateway:3060 and mana-matrix-bot:4000
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add docker/Dockerfile.sveltekit-base: pre-built base with all 34 shared
packages (mirrors nestjs-base pattern), eliminates redundant COPY/build
steps from individual web Dockerfiles
- Add scripts/mac-mini/build-app.sh: stops monitoring stack before build
to free RAM, auto-restarts on exit (trap cleanup)
- Migrate todo web Dockerfile to use sveltekit-base:local (47 COPY lines
→ 2, 4 build steps → 0)
- Update CD workflow to build sveltekit-base when deploying web apps
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Validator now checks 52 Dockerfiles (web + backend + service).
Fixed 10 missing COPYs across backends, services, and nestjs-base.
Generator also supports backend/service Dockerfiles with markers.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two infrastructure improvements for tech independence:
1. Cloudflare Fallback Documentation (docs/CLOUDFLARE_FALLBACK.md):
- Plan B: WireGuard + Caddy on Hetzner VPS (€3.79/mo)
- Complete Caddyfile with all 30+ subdomains
- Step-by-step failover checklist (~15 min to switch)
- Plan C: Direct IP with ISP
2. Self-Hosted Landing Pages (eliminates Cloudflare Pages dependency):
- Nginx container (mana-infra-landings) on port 4400
- Multi-site config: each subdomain → separate dist/ folder
- Build script: scripts/mac-mini/build-landings.sh
- Cloudflare Tunnel ingress rules for 10 landing page domains
- Storage: /Volumes/ManaData/landings/ on external SSD
- Domains: it, chats, pics, zitares, presis, clocks,
manadeck, nutriphi, citycorners, docs
Migration path: Build landings locally, set Cloudflare DNS to
tunnel instead of Pages, then decommission CF Pages projects.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The config_file override replaced the entire default PostgreSQL config
including listen_addresses, breaking inter-container communication.
Use inline -c flags instead which only override specific parameters.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire mana-llm service into the monitoring stack:
Prometheus (docker/prometheus/prometheus.yml):
- Add mana-llm scrape job (port 3025, 15s interval)
- Include mana-llm in ServiceDown alert expression
Alerts (docker/prometheus/alerts.yml):
- New llm_alerts group with 4 rules:
- LLMServiceDown: mana-llm down > 1 min (critical)
- LLMHighErrorRate: > 10% errors for 5 min (warning)
- OllamaProviderDown: > 50% requests via Google fallback (warning)
- LLMSlowResponses: p95 > 30s for 5 min (warning)
Grafana Dashboard (docker/grafana/dashboards/mana-llm.json):
- 6 stat panels: status, req/min, error rate, fallback rate, latency, tokens/min
- Requests by Provider (stacked area: Ollama vs Google vs OpenRouter)
- Tokens by Type (prompt vs completion)
- Latency Percentiles (p50, p90, p99)
- Latency by Provider comparison
- Requests by Model breakdown
- Errors by Type
- Google Fallback Rate over time (with threshold coloring)
- Provider Distribution pie chart (24h)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New project with three apps:
- Landing (Astro): static site with SVG illustrations, location data
- Backend (NestJS, port 3025): CRUD API for locations + favorites, Drizzle ORM, auth via mana-core-auth
- Web (SvelteKit, port 5196): Tailwind 4, PillNav, auth (login/register/SSO), Leaflet map, favorites with optimistic updates, theme/settings
Infrastructure: DB init SQL, setup-databases.sh, generate-env.mjs, root package.json scripts, Dockerfiles, docker-compose.macmini.yml (backend:3025, web:5022), Cloudflare wrangler.toml.
Branding: registered in shared-branding (AppId, APP_BRANDING, APP_ICONS, MANA_APPS, CitycornersLogo).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add GlitchTip to health-check.sh monitoring endpoints
- Add native disk space checks for / and /Volumes/ManaData with 80%/90% thresholds
- Extend Prometheus disk alerts to include /host_mnt/Volumes/ManaData mountpoint
- Add ManaData disk usage gauge to Grafana system-overview dashboard
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
LightWrite was replaced by Mukke on the same ports (5180/3010).
Update reverse proxy to use mukke.mana.how and mukke-api.mana.how.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add MetricsModule to 8 backends missing it (photos, zitare, mukke,
planta, picture, storage, presi, nutriphi)
- Enable Prometheus scraping for all 15 backends in prometheus.yml
(was only 6, with 3 commented out and 6 missing entirely)
- Update ServiceDown alert rule to cover all 15 backends
- Update Grafana dashboards (backends, master-overview, system-overview)
with all backend services in health panels
- Fix imprecise regex in application-details dashboard
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ensure sw.js, manifest.webmanifest, and registerSW.js are never cached
by the browser or CDN so service worker updates are picked up immediately
after deploys. Uses a reusable Caddy snippet imported by all web app blocks.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add docker/Dockerfile.nestjs-base with all shared packages pre-built
- Convert 6 backend Dockerfiles (chat, todo, calendar, clock, contacts,
mukke) to inherit from nestjs-base:local
- Fix bugs: duplicate shared-nestjs-setup builds (mukke), unnecessary
shared-error-tracking rebuild in production stage (chat, clock)
- CD pipeline builds base image before services when backends deploy
- Net reduction: 317 lines removed, 112 added (-205 lines)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dockerfile, docker-compose service (port 5100), Caddy and cloudflared
routing for the WhoPixels game. PORT is now configurable via env var.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add explicit uid: deploy-tracking to datasource provisioning
- Add instant: true to all Prometheus stat/gauge panel queries
- Pushgateway gauges need instant queries, not range queries
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instrument the CD pipeline to record per-deploy and per-service metrics
(build time, image size, startup time, health status) into PostgreSQL and
push gauges to Pushgateway. Adds a Grafana dashboard with 13 panels covering
deploy frequency, build performance, service health, and history.
New files:
- scripts/mac-mini/init-deploy-tracking.sql (idempotent DDL)
- scripts/deploy-metrics.sh (bash library for CI)
- docker/grafana/provisioning/datasources/deploy-tracking.yml
- docker/grafana/dashboards/deploy-tracking.json
Modified:
- docker/prometheus/prometheus.yml (pushgateway scrape job)
- .github/workflows/cd-macmini.yml (build/health instrumentation)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add PostgreSQL datasource pointing to GlitchTip database
- Add Error Tracking dashboard with 7 panels:
- Total Open Issues (stat)
- Issues by Project (pie chart)
- Total Events (stat)
- Projects Tracked (stat)
- Resolved vs Unresolved (stat)
- New Issues Over Time (stacked bar chart, 30 days)
- Recent Issues (table with 50 latest, color-coded levels)
- Dashboard links to GlitchTip UI for detailed investigation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Infrastructure:
- Add GlitchTip (web + worker) to docker-compose.macmini.yml (port 8020)
- Add glitchtip.mana.how to Cloudflare Tunnel config
- Add glitchtip database to init-db SQL
- Add GLITCHTIP_DSN to .env.development
Shared Package (@manacore/shared-error-tracking):
- initErrorTracking() - Sentry-compatible init with GlitchTip DSN
- captureException(), captureMessage(), setUser(), setTag(), flush()
- SentryExceptionFilter for NestJS (captures 5xx errors only)
- Graceful no-op when DSN is not configured
Integration:
- Add instrument.ts to calendar, contacts, todo backends
- Import instrument.ts before app bootstrap in all 3 main.ts files
- Error tracking auto-initializes when GLITCHTIP_DSN env var is set
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Create NestJS backend on port 3020 with 4 modules (space, document, ai, token)
- Add Drizzle schema with 5 tables (spaces, documents, token_transactions, model_prices, user_tokens)
- Rewrite web services (spaces, documents, tokens, ai) to use shared API client instead of Supabase
- Move AI API keys server-side (Azure OpenAI, Google Gemini)
- Add seed script for model prices (gpt-4.1, gemini-pro, gemini-flash)
- Add 70 unit tests across 4 test suites (space, document, token, ai services)
- Add monorepo integration (setup-databases.sh, generate-env.mjs, docker init-db, root scripts)
- Remove @supabase/supabase-js dependency and delete supabase.ts from web app
- Update CLAUDE.md with full API documentation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>