Web-Research-Orchestrator (16+ search-/LLM-providers) auf die GPU-Box
verlagert. Cross-LAN für mana-auth/mana-credits/mana-llm/mana-search/
postgres/redis (192.168.178.131). research.mana.how routet jetzt zum
mana-gpu-server-Tunnel (CF config v29). Mini-Container-Count 42 → 41.
PUBLIC_MANA_RESEARCH_URL in mana-app-web auf https-URL umgestellt —
Mini-Container können 192.168.178.11 nicht direkt erreichen (Colima-NAT),
daher Cross-LAN-Bridge via Cloudflare-Tunnel wie bei mana-ai.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mana-ai's /metrics endpoint is no longer exposed on Mini's
192.168.178.131:3067 (service moved to GPU-Box, no public /metrics
tunnel since the endpoint is internal). The blackbox-api job
already probes mana-ai.mana.how/health for liveness, which gives
us up/down without needing the metrics scrape.
Status-page is now 58/58 UP after VM rolled past the stale 3067
samples.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After Phase 2f-3 mana-ai lives on the GPU-Box, so the
blackbox-internal docker-DNS probe (http://mana-ai:3066/health) is
gone — that target sits in a Docker network the blackbox-exporter
can't reach across LAN. Move the probe into blackbox-api against
the public hostname; gives the same up/down signal plus exercises
the Cloudflare-tunnel hop.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two cleanups against the status-page DOWN list:
photon-self (photon.mana.how route):
mana-geocoding's /health/photon-self pings the photon backend, which
lives as a Docker container on the GPU-Box (port 2322). PHOTON_SELF_API_URL
was http://192.168.178.11:2322 — Mini-host can hit that fine but
Mini-Docker-containers can't (Colima-NAT-quirk we keep running into).
Routed photon through the mana-gpu-server tunnel (config v26) and
flipped the env var to https://photon.mana.how. Probe goes UP, geocoding
for sensitive queries (privacy:'local' provider tier) actually works
now too — was effectively orphaned before.
whopxl removed everywhere it still lingered:
Container hasn't existed on the Mini in months (no compose service,
no source dir under apps/, no listener on :5100 — only the dead
cloudflared route + a stale CORS_ORIGINS entry on mana-auth). Cleaned
cloudflared-config.yml, prometheus.yml blackbox-web target, and the
mana-auth CORS list. Old DNS CNAME for whopxl.mana.how stays for now;
no harm.
Plus while we were here: who-api.mana.how/api/decks was the right probe
for who-server's deck catalogue (root /api/decks lives on who-api, not
who.mana.how which is the SSR shell).
Live: status.mana.how shows 58/59 UP; the last 'whopxl' entry will
fall off after VM's TSDB rolls past the probe_success staleness window.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Audit revealed status.mana.how was probing only the unified mana-app
path-routes (mana.how/{module}) plus a couple of GPU services. None
of the standalone deployments were monitored, and three probe targets
were stale.
Changes:
- prometheus.yml blackbox-web: drop mana.how/{context,who} (context
module was dropped 2026-04-29; mana.how/who never existed —
/who is a standalone stack on its own subdomain). Add the eight
hosts that DO have separate deployments today: whopxl, manavoxel,
memoro (landing), cards (Phase-1 spinoff), who.mana.how/cantina,
npm (Verdaccio).
- prometheus.yml blackbox-api: add memoro-api/health,
memoro-audio/health, who-api.mana.how/api/decks,
admin.mana.how/health (admin's root is auth-walled, only /health
returns 200).
- prometheus.yml blackbox-gpu: add gpu-llm.mana.how/health (was
missing; gpu-stt/tts/img/video were in, gpu-llm was somehow not).
- cloudflared-config.yml: restore who.mana.how → :5092 +
who-api.mana.how → :3092. The DNS CNAME points at the Mini tunnel
but the route entries had been lost during a previous compose
cleanup, so every who.* request was hitting the catch-all 404 and
the standalone Bun stack was effectively orphaned at the edge
(PM2 + LaunchAgent all healthy on Mini, just no public route).
Live state after rollout: status.mana.how shows 57/59 services UP,
the two remaining DOWN are pre-existing — photon-self (Phase-2c
cross-LAN routing limitation, documented in PLAN_OPTION_C.md) and
whopxl-web (container not running on the Mini, separate issue).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Arcade lives as its own pnpm workspace at ~/Documents/Code/arcade
now, with no @mana/* coupling. This drops every reference and the
games/ directory from the monorepo.
Removes:
- games/ directory (89 files: web + server + 22 HTML games + screenshots)
- @arcade/web, @arcade/server pnpm workspace entries (games/* globs)
- arcade scripts in root package.json (4 scripts)
- arcade.mana.how from mana-auth trusted origins + CORS_ORIGINS
- arcade entries in mana-apps registry, app-icons, URL overrides
- arcade.mana.how from cloudflared tunnel + prometheus blackbox probes
- arcade-web service block in docker-compose.macmini.yml
- generate-env.mjs entries for arcade server + web
- BRANDING_ONLY 'arcade' entry in registry consistency spec
- dead arcade translation keys in GuestWelcomeModal (DE+EN)
- arcade mention in CLAUDE.md, authentication guideline, MODULE_REGISTRY
Verified:
- services/mana-auth/src/auth/sso-config.spec.ts: 8/8 pass
- pnpm install regenerates lockfile cleanly (-536 lines)
- no remaining 'arcade' refs outside historical snapshot docs
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pelias was retired from the Mac mini on 2026-04-28; photon-self
(self-hosted Photon on mana-gpu) has been the live primary since then.
This removes the now-dead Pelias adapter, config, tests, and the
services/mana-geocoding/pelias/ stack — the entire compose file, the
geojsonify_place_details.js patch, the setup.sh import script.
Provider chain is now `photon-self → photon → nominatim`. The chain
keeps its `privacy: 'local' | 'public'` split, sensitive-query
blocking, coord quantization, and aggressive caching unchanged.
Three direct calls to nominatim.openstreetmap.org that bypassed
mana-geocoding now route through the wrapper:
- citycorners/add-city + citycorners/cities/[slug]/add use the shared
searchAddress() client (browser → same-origin proxy → mana-geocoding
→ photon-self).
- memoro mobile drops its OSM reverse-geocoding fallback entirely;
Expo's on-device reverse-geocoding stays as the sole path. Routing
through the wrapper would require a memoro-server proxy endpoint —
a follow-up if Expo's quality proves insufficient.
Other behavioral changes:
- CACHE_PUBLIC_TTL_MS dropped from 7d → 1h. The long TTL was a
privacy-amplification trick from the Pelias era; with photon-self
serving the bulk of traffic, a transient cross-LAN blip was pinning
cached fallback answers for days. 1h gives quick recovery.
- /health/pelias renamed to /health/photon-self; prometheus blackbox
config + status-page generator updated.
- mana-geocoding container no longer needs `extra_hosts:
host.docker.internal:host-gateway` (was only there for the
Pelias-on-host-network era).
113 tests passing. CLAUDE.md rewritten to reflect the post-Pelias
architecture.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pairs with c94ab01c6 which added the real /metrics endpoint. Without a
scrape job the policy_decisions_total counter has nowhere to go and
the soak period is flying blind.
30s interval to match mana-ai. Same job shape as mana-ai — any Grafana
dashboard that auto-discovers services via labels will pick this up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires mana-ai into the existing observability stack so tick throughput,
plan-failure rates, planner latencies, and snapshot refresh health are
visible in Grafana + Prometheus, and the service's uptime surfaces on
status.mana.how under the "Internal" section.
- `src/metrics.ts` — prom-client Registry with `mana_ai_` prefix.
Counters: ticks_total, plans_produced_total, plans_written_back_total,
parse_failures_total, mission_errors_total, snapshots_new/updated,
snapshot_rows_applied_total, http_requests_total.
Histograms: tick_duration_seconds (0.1–120s), planner_request_
duration_seconds (0.25–60s), http_request_duration_seconds (0.005–10s).
- `src/index.ts` — HTTP middleware labels every request by
method/path/status; `/metrics` serves the Prometheus text format.
- `src/cron/tick.ts` — increments counters + wraps the tick with
`tickDuration.startTimer()`. Snapshot stats fold through.
- `src/planner/client.ts` — wraps `complete()` in a latency histogram
timer so planner tail latency shows up separately from tick duration.
- `docker/prometheus/prometheus.yml` —
1. New `mana-ai` scrape job against `mana-ai:3066/metrics` (30s).
2. `/health` added to the `blackbox-internal` job so uptime shows on
status.mana.how alongside mana-geocoding.
- `scripts/generate-status-page.sh` — friendly label for the new probe:
`mana-ai:3066/health` → "Mana AI Runner" (generator already iterates
`blackbox-internal`, no other changes needed).
- `package.json` — prom-client ^15.1.3
All 17 Bun tests still pass; tsc clean.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
eventstream was confusingly branded "Events" in the app registry,
colliding with the real events calendar module. Renamed to activity
(DE: Aktivität) since it's a live activity feed across all modules.
cycles -> period (DE: Periode) makes the menstrual-tracking module
self-describing. Tables cycles/cycleDayLogs/cycleSymptoms renamed to
periods/periodDayLogs/periodSymptoms; field cycleId -> periodId;
TimeBlockType 'cycle' -> 'period'; domain event CycleDayLogged ->
PeriodDayLogged. Generic "cycle" usages (billing, lifecycle, breath,
bicycle, import cycles) left untouched.
Constant disambiguation: prior DEFAULT_PERIOD_LENGTH (bleeding days)
renamed to DEFAULT_BLEEDING_DAYS; prior DEFAULT_CYCLE_LENGTH (28d full
cycle) is now DEFAULT_PERIOD_LENGTH.
Pre-launch, no data migration needed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
blackbox-exporter can't resolve host.docker.internal on Colima, so
probes of host.docker.internal:4000 and :9200 always fail. Instead,
add a /health/pelias endpoint on the Hono wrapper that proxies to
the Pelias API, and update prometheus.yml to probe the wrapper's
proxied health endpoint.
Also simplifies the status page friendly_name() now that we don't
need to display the host.docker.internal targets.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Production deployment + observability for the self-hosted geocoding stack:
**docker-compose.macmini.yml**
- New mana-geocoding container (port 3018, internal-only — no traefik
labels, no Cloudflare route). Uses host.docker.internal to reach the
Pelias API on the host's pelias compose stack. Dockerfile added under
services/mana-geocoding/ using the same Bun/Hono pattern as mana-events.
**Prometheus**
- New blackbox-internal job probing mana-geocoding:3018/health, the
Pelias API on host.docker.internal:4000/v1/status, and Elasticsearch
at host.docker.internal:9200/_cluster/health. Kept separate from
blackbox-api which is reserved for public HTTPS endpoints.
**status.mana.how (generate-status-page.sh)**
- Include blackbox-internal in the metric query and add an "Interne
Dienste" section with its own summary card, right between Infrastruktur
and GPU Dienste. Summary grid goes from 4 to 5 columns with a
900px breakpoint.
- friendly_name() now handles http:// URLs and rewrites container-name
hosts like mana-geocoding:3018/health → "Mana Geocoding",
host.docker.internal:4000 → "Pelias API",
host.docker.internal:9200 → "Pelias Elasticsearch".
**Grafana uptime dashboard**
- Add an "Internal" series to the "Alle Dienste — Uptime-Verlauf" panel
- New "Interne Dienste Status" table panel showing per-instance up/down
- New "Geocoding Ø Latenz" stat panel for probe_duration_seconds
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Blackbox web probes were missing: body, journal, dreams, firsts,
cycles, events, finance, places, who, news, mail. These modules
exist in mana-apps.ts and are deployed but were never added to
prometheus.yml — so they didn't show on status.mana.how.
Also adds mana-geocoding and mana-events to the internal SvelteKit
status page health checks.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The workbench-registry app id 'inventar' did not match its
@mana/shared-branding MANA_APPS counterpart 'inventory', so the tier-
gating join in apps/web/src/lib/app-registry/registry.ts silently
failed for the inventory module — it fell into the "no MANA_APPS
entry, default visible" fallback and was effectively un-gated. The
codebase had also voted overwhelmingly for 'inventar' (53 files) vs
'inventory' (3 files in shared-branding), so the long-standing
mismatch was just bookkeeping debt waiting to bite.
Pre-release, no live data, so the cleanest fix is to align everything
on the English 'inventory':
- Workbench-registry id, module.config.ts appId, module folder, route
folder and i18n locale folder all renamed via git mv
- Standalone apps/inventar/ workspace package renamed
- All imports, store identifiers (InventarEvents → InventoryEvents,
INVENTAR_GUEST_SEED, inventarModuleConfig), i18n keys and href/goto
paths follow the rename
- The German display label "Inventar" is preserved everywhere it is a
user-visible string (page titles, i18n values, toast labels)
- Dexie table prefixes (invCollections, invItems, …) are unchanged
- Drive-by fix: ListView.svelte was querying non-existent
inventarCollections/inventarItems tables — corrected to the actual
invCollections/invItems names from module.config
- The "inventar ↔ inventory id mismatch" workaround comment in
registry.ts is removed since the mismatch no longer exists
module-registry.ts also picks up the user's parallel newsModuleConfig
addition because both edits land in the same import block — keeping
them split would have left the build in an inconsistent state.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit bundles two unrelated changes that were swept together by an
accidental `git add -A` in another working session. Documented here so the
history reflects what's actually inside.
═══════════════════════════════════════════════════════════════════════
1. fix(mana-auth): /api/v1/auth/login mints JWT via auth.handler instead
of api.signInEmail
═══════════════════════════════════════════════════════════════════════
Previous attempt (commit 55cc75e7d) tried to fix the broken JWT mint in
/api/v1/auth/login by switching the cookie name from `mana.session_token`
to `__Secure-mana.session_token` for production. That was necessary but
not sufficient: Better Auth's session cookie value isn't just the raw
session token, it's `<token>.<HMAC>` where the HMAC is derived from the
better-auth secret. Reconstructing the cookie from auth.api.signInEmail's
JSON response only gave us the raw token, so /api/auth/token's
get-session middleware still couldn't validate it and the JWT mint kept
silently failing.
Real fix: do the sign-in via auth.handler (the HTTP path) rather than
auth.api.signInEmail (the SDK path). The handler returns a real fetch
Response with a Set-Cookie header containing the fully signed cookie
envelope. We capture that header verbatim and forward it as the cookie
on the /api/auth/token request, which now passes validation and mints
the JWT correctly.
Verified end-to-end on auth.mana.how:
$ curl -X POST https://auth.mana.how/api/v1/auth/login \
-d '{"email":"...","password":"..."}'
{
"user": {...},
"token": "<session token>",
"accessToken": "eyJhbGciOiJFZERTQSI...", ← real JWT now
"refreshToken": "<session token>"
}
Side benefits:
- Email-not-verified path is now handled by checking
signInResponse.status === 403 directly, no more catching APIError
with the comment-noted async-stream footgun.
- X-Forwarded-For is forwarded explicitly so Better Auth's rate limiter
and our security log see the real client IP.
- The leftover catch block now only handles unexpected exceptions
(network errors etc); the FORBIDDEN-checking logic in it is dead but
harmless and left in for defense in depth.
═══════════════════════════════════════════════════════════════════════
2. chore: remove the entire self-hosted Matrix stack (Synapse, Element,
Manalink, mana-matrix-bot)
═══════════════════════════════════════════════════════════════════════
The Matrix subsystem ran parallel to the main Mana product without any
load-bearing integration: the unified web app never imported matrix-js-sdk,
the chat module uses mana-sync (local-first), and mana-matrix-bot's
plugins duplicated features the unified app already ships natively.
Keeping it alive cost a Synapse + Element + matrix-web + bot container
quartet, three Cloudflare routes, an OIDC provider plugin in mana-auth,
and a steady drip of devlog/dependency churn.
Removed:
- apps/matrix (Manalink web + mobile, ~150 files)
- services/mana-matrix-bot (Go bot with ~20 plugins)
- docker/matrix configs (Synapse + Element)
- synapse/element-web/matrix-web/mana-matrix-bot services in
docker-compose.macmini.yml
- matrix.mana.how/element.mana.how/link.mana.how Cloudflare tunnel routes
- OIDC provider plugin + matrix-synapse trustedClient + matrixUserLinks
table from mana-auth (oauth_* schema definitions also removed)
- MatrixService import path in mana-media (importFromMatrix endpoint)
- Matrix notification channel in mana-notify (worker, metrics, config,
channel_type enum, MatrixOptions handler)
- Matrix entries from shared-branding (mana-apps + app-icons),
notify-client, the i18n bundle, the observatory map, the credits
app-label list, the landing footer/apps page, the prometheus + alerts
+ promtail tier mappings, and the matrix-related deploy paths in
cd-macmini.yml + ci.yml
Devlog/manascore/blueprint entries that mention Matrix are left intact
as historical record. The oauth_* + matrix_user_links Postgres tables
stay on existing prod databases — code can no longer write to them, drop
them in a follow-up migration if you want them gone for real.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three Mac Mini infrastructure follow-ups bundled:
1. docker-compose.macmini.yml — drop ghost backend env vars from
the mana-app-web service (todo, calendar, contacts, chat, storage,
cards, music, nutriphi `PUBLIC_*_API_URL{,_CLIENT}` plus the memoro
server URLs). The matching consumers were removed in the earlier
ghost-API cleanup commits, so these env entries had been wiring
nothing into the running container for several deploys. Force-
recreating mana-app-web after pulling this commit will pick up
the slimmer env automatically.
2. docker-compose.macmini.yml — bump `mana-mon-blackbox` mem_limit
from 32m to 128m. blackbox-exporter v0.25 sits north of 32m
under load and was OOM-restart-looping every ~90 seconds, which
in turn made `status.mana.how` and the prometheus probe metrics
stale (since the scraper was missing every other window).
3. docker/prometheus/prometheus.yml — split `blackbox-gpu` into two
jobs:
- `blackbox-gpu` now probes `/health` via the http_health
module, because the GPU services (whisper STT, FLUX image
gen, Coqui TTS) return 401/404 on `/` by design (auth or
API-only). The previous http_2xx-on-`/` probe was reporting
all four as down even though they answered `/health` with
200, which inflated the down count on status.mana.how.
- `blackbox-gpu-root` keeps the http_2xx-on-`/` probe for
Ollama, which has no `/health` endpoint but does answer
2xx on its root.
Both jobs share the same blackbox-exporter relabel rewrite so
the targets are routed through the exporter container, not
scraped directly by VictoriaMetrics.
Verified post-fix: status.mana.how reports 41/42 services up (only
`gpu-video` remains down — LTX Video Gen is intentionally not
deployed yet on the Windows GPU box).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All web app subdomains (chat.mana.how, todo.mana.how, etc.) were removed
when the unified app launched, but monitoring configs still referenced them.
Update blackbox targets to use mana.how/route URLs, remove stale API backend
routes from cloudflared, clean up CORS origins, and fix status page generator
to handle route-based URLs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New GPU service for fast text-to-video generation using LTX-Video (~2B params)
on the RTX 3090. Generates 480p clips in 10-30 seconds, uses ~10GB VRAM.
Includes Cloudflare Tunnel route, Prometheus monitoring, and health checks.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract feedback, analytics, and AI modules from mana-core-auth into
standalone mana-analytics service (Hono + Bun, Port 3064).
New service (services/mana-analytics/):
- User feedback CRUD with voting
- AI-powered feedback title generation via mana-llm
- Simplified from DuckDB analytics to pure PostgreSQL
- ~550 LOC
Removed from mana-core-auth:
- feedback/ module (6 files)
- analytics/ module (4 files)
- ai/ module (3 files)
- db/schema/feedback.schema.ts
mana-core-auth now contains ONLY pure auth:
- Better Auth (JWT, Sessions, 2FA, Passkeys, OIDC, Magic Links)
- Organizations/Guilds (membership management)
- API Keys, Security, Me (GDPR), Health, Metrics
- Ready for Phase 5: Hono rewrite
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Photos NestJS backend was a proxy to mana-media that enriched
responses with local album/favorite/tag data. Now:
- Albums store → local-first via albumCollection + albumItemCollection
- Favorites → local-first via favoriteCollection (toggle in IndexedDB)
- Photo tags → local-first via photoTagCollection
- Photo listing/stats → direct mana-media API calls from frontend
- Upload → direct mana-media upload from frontend
- Delete → direct mana-media delete from frontend
Removed 27 TypeScript files, 1 Docker container, 1 port (3039).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Presi NestJS backend (40 source files, 50 deps) was a CRUD wrapper
around decks, slides, and themes — all now handled by local-first sync.
Only the share-link feature requires server-side state (public URLs
without auth), so a minimal Hono + Bun server replaces the entire
NestJS backend:
- apps/presi/apps/server/ — Hono server with share routes + GDPR admin
Uses @manacore/shared-hono for auth (JWKS), health, admin, errors
- Web app API client stripped to share-only (was 270 lines → 90 lines)
- Removed from docker-compose, CI/CD, Prometheus, env generation
- NestJS backend deleted (40 TS files, 8 test specs, 3038 lines)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both apps are fully local-first via Dexie.js + mana-sync. Their NestJS
backends were pure CRUD wrappers (20 + 31 source files) that are no
longer needed.
Changes:
- Add packages/shared-hono: JWT auth via JWKS (jose), Drizzle DB factory,
health route, generic GDPR admin handler, error middleware
- Migrate zitare lists page from fetch() to listsStore (local-first)
- Rewrite clock timers store from API-based to timerCollection (Dexie)
- Update clock +layout.svelte CommandBar search to use local collections
- Remove zitare-backend + clock-backend from docker-compose, CI/CD,
Prometheus, env generation, setup scripts
- Add docs/TECHNOLOGY_AUDIT_2026_03.md with full repo analysis
Net result: -2 Docker containers, -2 ports, -2728 lines of code
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Cloudflare Tunnel: api.mana.how → localhost:3060 (Go API Gateway)
- Prometheus: scrape targets for mana-api-gateway:3060 and mana-matrix-bot:4000
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire mana-llm service into the monitoring stack:
Prometheus (docker/prometheus/prometheus.yml):
- Add mana-llm scrape job (port 3025, 15s interval)
- Include mana-llm in ServiceDown alert expression
Alerts (docker/prometheus/alerts.yml):
- New llm_alerts group with 4 rules:
- LLMServiceDown: mana-llm down > 1 min (critical)
- LLMHighErrorRate: > 10% errors for 5 min (warning)
- OllamaProviderDown: > 50% requests via Google fallback (warning)
- LLMSlowResponses: p95 > 30s for 5 min (warning)
Grafana Dashboard (docker/grafana/dashboards/mana-llm.json):
- 6 stat panels: status, req/min, error rate, fallback rate, latency, tokens/min
- Requests by Provider (stacked area: Ollama vs Google vs OpenRouter)
- Tokens by Type (prompt vs completion)
- Latency Percentiles (p50, p90, p99)
- Latency by Provider comparison
- Requests by Model breakdown
- Errors by Type
- Google Fallback Rate over time (with threshold coloring)
- Provider Distribution pie chart (24h)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add MetricsModule to 8 backends missing it (photos, zitare, mukke,
planta, picture, storage, presi, nutriphi)
- Enable Prometheus scraping for all 15 backends in prometheus.yml
(was only 6, with 3 commented out and 6 missing entirely)
- Update ServiceDown alert rule to cover all 15 backends
- Update Grafana dashboards (backends, master-overview, system-overview)
with all backend services in health panels
- Fix imprecise regex in application-details dashboard
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instrument the CD pipeline to record per-deploy and per-service metrics
(build time, image size, startup time, health status) into PostgreSQL and
push gauges to Pushgateway. Adds a Grafana dashboard with 13 panels covering
deploy frequency, build performance, service health, and history.
New files:
- scripts/mac-mini/init-deploy-tracking.sql (idempotent DDL)
- scripts/deploy-metrics.sh (bash library for CI)
- docker/grafana/provisioning/datasources/deploy-tracking.yml
- docker/grafana/dashboards/deploy-tracking.json
Modified:
- docker/prometheus/prometheus.yml (pushgateway scrape job)
- .github/workflows/cd-macmini.yml (build/health instrumentation)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Medium priority stability improvements:
Alerting:
- Add vmalert for evaluating Prometheus alert rules
- Add alertmanager for alert routing and grouping
- Add alert-notifier service for Telegram/ntfy notifications
- Enable cadvisor scraping in prometheus config
Disk Monitoring:
- Add check-disk-space.sh for hourly disk monitoring
- Alert on 80% (warning) and 90% (critical) thresholds
- Auto-cleanup Docker when disk is critical
- Add com.manacore.disk-check.plist for LaunchD
Weekly Reports:
- Add weekly-report.sh for system health summary
- Includes: backup status, disk usage, container health,
database stats, error log summary
- Runs every Sunday at 10 AM via LaunchD
Health Check Updates:
- Add checks for vmalert, alertmanager, alert-notifier
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add node-exporter service to docker-compose for CPU/Memory/Disk monitoring
- Enable node-exporter scrape target in Prometheus config
- Update System Overview dashboard with Host System section:
- CPU, Memory, Disk usage gauges
- Total RAM, Total Disk, Uptime, Load stats
- CPU & Memory over time graph
- Network I/O graph
- Add Node Exporter to service status panel
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Commented out:
- node-exporter (container not deployed)
- cadvisor (container not deployed)
- storage/presi/nutriphi-backend (no /metrics endpoint yet)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Reverting 618c58c5 which broke the CI workflow.
Will re-add notifications after fixing the issue.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add notify-start job with Telegram notification for build start
- Add notify-complete job with build status and duration notification
- Push CI metrics to Prometheus Pushgateway for Grafana visualization
- Create CI/CD Grafana dashboard with build status, duration, and history
- Add Pushgateway scrape config to Prometheus
Requires TELEGRAM_BOT_TOKEN, TELEGRAM_CHAT_ID, and PUSHGATEWAY_URL secrets.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
New dashboards:
- Application Details: Node.js runtime (heap, event loop, GC),
HTTP details (status codes, methods, top routes), error analysis
- Database Details: PostgreSQL and Redis metrics with detailed breakdowns
Alerting rules (docker/prometheus/alerts.yml):
- Service: down, high/very high error rate, slow response time
- Infrastructure: high CPU/memory/disk usage
- Database: PostgreSQL/Redis down, high connections, low cache hit
- Container: high CPU/memory, restarts
All dashboards include service selector variable for filtering.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>