Arcade lives as its own pnpm workspace at ~/Documents/Code/arcade
now, with no @mana/* coupling. This drops every reference and the
games/ directory from the monorepo.
Removes:
- games/ directory (89 files: web + server + 22 HTML games + screenshots)
- @arcade/web, @arcade/server pnpm workspace entries (games/* globs)
- arcade scripts in root package.json (4 scripts)
- arcade.mana.how from mana-auth trusted origins + CORS_ORIGINS
- arcade entries in mana-apps registry, app-icons, URL overrides
- arcade.mana.how from cloudflared tunnel + prometheus blackbox probes
- arcade-web service block in docker-compose.macmini.yml
- generate-env.mjs entries for arcade server + web
- BRANDING_ONLY 'arcade' entry in registry consistency spec
- dead arcade translation keys in GuestWelcomeModal (DE+EN)
- arcade mention in CLAUDE.md, authentication guideline, MODULE_REGISTRY
Verified:
- services/mana-auth/src/auth/sso-config.spec.ts: 8/8 pass
- pnpm install regenerates lockfile cleanly (-536 lines)
- no remaining 'arcade' refs outside historical snapshot docs
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pelias was retired from the Mac mini on 2026-04-28; photon-self
(self-hosted Photon on mana-gpu) has been the live primary since then.
This removes the now-dead Pelias adapter, config, tests, and the
services/mana-geocoding/pelias/ stack — the entire compose file, the
geojsonify_place_details.js patch, the setup.sh import script.
Provider chain is now `photon-self → photon → nominatim`. The chain
keeps its `privacy: 'local' | 'public'` split, sensitive-query
blocking, coord quantization, and aggressive caching unchanged.
Three direct calls to nominatim.openstreetmap.org that bypassed
mana-geocoding now route through the wrapper:
- citycorners/add-city + citycorners/cities/[slug]/add use the shared
searchAddress() client (browser → same-origin proxy → mana-geocoding
→ photon-self).
- memoro mobile drops its OSM reverse-geocoding fallback entirely;
Expo's on-device reverse-geocoding stays as the sole path. Routing
through the wrapper would require a memoro-server proxy endpoint —
a follow-up if Expo's quality proves insufficient.
Other behavioral changes:
- CACHE_PUBLIC_TTL_MS dropped from 7d → 1h. The long TTL was a
privacy-amplification trick from the Pelias era; with photon-self
serving the bulk of traffic, a transient cross-LAN blip was pinning
cached fallback answers for days. 1h gives quick recovery.
- /health/pelias renamed to /health/photon-self; prometheus blackbox
config + status-page generator updated.
- mana-geocoding container no longer needs `extra_hosts:
host.docker.internal:host-gateway` (was only there for the
Pelias-on-host-network era).
113 tests passing. CLAUDE.md rewritten to reflect the post-Pelias
architecture.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mana-web depends on @mana/shared-privacy (visibility-system) but it was
missing from the base-image COPY list, causing pnpm install to fail
inside the build container with ERR_PNPM_WORKSPACE_PKG_NOT_FOUND.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
One focused dashboard covering the M1+M2 instrumentation in a single
view. Sections top-to-bottom:
1. Service Health — mana-mcp + mana-ai up/down, 1h deny rate,
compactions/h. The deny rate is the single most important
number during POLICY_MODE=log-only soak: a non-zero
deny/min in log-only means real traffic that enforce mode
would reject.
2. Policy Gate (mana-mcp)
- Decisions / sec by outcome (allow/deny/flagged)
- Deny reasons breakdown — the soak signal for flipping to
enforce. If one reason dominates, address it before the flip.
- Tool invocations / sec by outcome (success / handler-error /
input-invalid)
- Top 10 invoked tools (24h) — usage heatmap for prioritising
which tools deserve the best policy-hint tuning.
- Handler p50/p95/p99 latency per tool.
3. Reminder Channel (mana-ai)
- Rate by producer (token-budget, retry-loop, compacted)
- Rate by severity. The interesting signal is whether
warn/escalate trend DOWN over time — it means the LLM is
actually reacting to the hints. If warn stays flat, the
producer wording probably isn't landing.
4. Context Compactor (mana-ai)
- Triggers/h cumulative
- Turns folded per compaction (p50/p95). Values < 3 flag
MANA_AI_COMPACT_MAX_CTX misconfig — the threshold is firing
on already-short histories.
5. Mission Runner Baseline — tick duration + planner rounds for
correlation (e.g. "did enabling the compactor change mean
tick duration?").
Dashboard provisioning already auto-loads anything in /var/lib/grafana/
dashboards (docker/grafana/provisioning/dashboards/default.yml), so
this is live after the next grafana restart. UID agent-loop.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pairs with c94ab01c6 which added the real /metrics endpoint. Without a
scrape job the policy_decisions_total counter has nowhere to go and
the soak period is flying blind.
30s interval to match mana-ai. Same job shape as mana-ai — any Grafana
dashboard that auto-discovers services via labels will pick this up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
packages/subscriptions was deleted in the credits-merge cleanup; the
COPY line in the base Dockerfile broke every subsequent --no-cache
build. Also adds packages/shared-ai which was missing (webapp depends
on it since the Multi-Agent Workbench rollout).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 4 — everything needed to flip the Mission Key-Grant feature on
safely per deployment. No new behaviour; purely operational plumbing.
- PUBLIC_AI_MISSION_GRANTS feature flag (default off). hooks.server.ts
injects window.__PUBLIC_AI_MISSION_GRANTS__, api/config.ts exposes
isMissionGrantsEnabled(). Grant UI (dialog + status box) and the
Workbench "Datenzugriff" tab both hide when the flag is off.
- PUBLIC_MANA_AI_URL added to the injection set so the webapp can reach
the new audit endpoint from production.
- Prometheus alerts (new mana_ai_alerts group):
- ManaAIServiceDown (warning, 2m)
- ManaAIGrantScopeViolation (critical, 0m) — MUST stay at 0; any
increment pages immediately
- ManaAIGrantSkipsHigh (warning, 15m) — flags keypair drift
- ManaAIPlannerParseFailures (warning, 10m) — prompt/LLM drift
- Runbook in docs/plans/ai-mission-key-grant.md: initial keypair gen,
leak-response procedure (rotate + invalidate all grants + audit),
scope-violation triage.
- User-facing doc in apps/docs security.mdx: new "AI Mission Grants"
section with the three hard constraints (ZK users blocked, scope
changes invalidate cryptographically, revocation is one click) plus
an honest threat-model comparison column showing where grants shift
the tradeoff.
Rollout remaining (not code): generate keypair on Mac Mini, provision
MANA_AI_PRIVATE_KEY_PEM + MANA_AI_PUBLIC_KEY_PEM via Docker secrets,
flip PUBLIC_AI_MISSION_GRANTS=true starting with till-only.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wires mana-ai into the existing observability stack so tick throughput,
plan-failure rates, planner latencies, and snapshot refresh health are
visible in Grafana + Prometheus, and the service's uptime surfaces on
status.mana.how under the "Internal" section.
- `src/metrics.ts` — prom-client Registry with `mana_ai_` prefix.
Counters: ticks_total, plans_produced_total, plans_written_back_total,
parse_failures_total, mission_errors_total, snapshots_new/updated,
snapshot_rows_applied_total, http_requests_total.
Histograms: tick_duration_seconds (0.1–120s), planner_request_
duration_seconds (0.25–60s), http_request_duration_seconds (0.005–10s).
- `src/index.ts` — HTTP middleware labels every request by
method/path/status; `/metrics` serves the Prometheus text format.
- `src/cron/tick.ts` — increments counters + wraps the tick with
`tickDuration.startTimer()`. Snapshot stats fold through.
- `src/planner/client.ts` — wraps `complete()` in a latency histogram
timer so planner tail latency shows up separately from tick duration.
- `docker/prometheus/prometheus.yml` —
1. New `mana-ai` scrape job against `mana-ai:3066/metrics` (30s).
2. `/health` added to the `blackbox-internal` job so uptime shows on
status.mana.how alongside mana-geocoding.
- `scripts/generate-status-page.sh` — friendly label for the new probe:
`mana-ai:3066/health` → "Mana AI Runner" (generator already iterates
`blackbox-internal`, no other changes needed).
- `package.json` — prom-client ^15.1.3
All 17 Bun tests still pass; tsc clean.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
eventstream was confusingly branded "Events" in the app registry,
colliding with the real events calendar module. Renamed to activity
(DE: Aktivität) since it's a live activity feed across all modules.
cycles -> period (DE: Periode) makes the menstrual-tracking module
self-describing. Tables cycles/cycleDayLogs/cycleSymptoms renamed to
periods/periodDayLogs/periodSymptoms; field cycleId -> periodId;
TimeBlockType 'cycle' -> 'period'; domain event CycleDayLogged ->
PeriodDayLogged. Generic "cycle" usages (billing, lifecycle, breath,
bicycle, import cycles) left untouched.
Constant disambiguation: prior DEFAULT_PERIOD_LENGTH (bleeding days)
renamed to DEFAULT_BLEEDING_DAYS; prior DEFAULT_CYCLE_LENGTH (28d full
cycle) is now DEFAULT_PERIOD_LENGTH.
Pre-launch, no data migration needed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Backend: Hono/Bun service on port 3042 with JMAP client for Stalwart,
account provisioning (@mana.how addresses on user registration),
thread/message/send/label API endpoints, and JWT + service-key auth.
Frontend: Mail module with 3-column inbox UI (mailboxes, thread list,
detail/compose), local-first encrypted drafts in Dexie, and API-driven
thread fetching. Scoped CSS with theme tokens.
Integration: Dexie v11 schema, mail pgSchema in mana_platform,
mana-auth fire-and-forget hook for account provisioning,
getManaMailUrl() in API config, app registry + branding update.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
blackbox-exporter can't resolve host.docker.internal on Colima, so
probes of host.docker.internal:4000 and :9200 always fail. Instead,
add a /health/pelias endpoint on the Hono wrapper that proxies to
the Pelias API, and update prometheus.yml to probe the wrapper's
proxied health endpoint.
Also simplifies the status page friendly_name() now that we don't
need to display the host.docker.internal targets.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Production deployment + observability for the self-hosted geocoding stack:
**docker-compose.macmini.yml**
- New mana-geocoding container (port 3018, internal-only — no traefik
labels, no Cloudflare route). Uses host.docker.internal to reach the
Pelias API on the host's pelias compose stack. Dockerfile added under
services/mana-geocoding/ using the same Bun/Hono pattern as mana-events.
**Prometheus**
- New blackbox-internal job probing mana-geocoding:3018/health, the
Pelias API on host.docker.internal:4000/v1/status, and Elasticsearch
at host.docker.internal:9200/_cluster/health. Kept separate from
blackbox-api which is reserved for public HTTPS endpoints.
**status.mana.how (generate-status-page.sh)**
- Include blackbox-internal in the metric query and add an "Interne
Dienste" section with its own summary card, right between Infrastruktur
and GPU Dienste. Summary grid goes from 4 to 5 columns with a
900px breakpoint.
- friendly_name() now handles http:// URLs and rewrites container-name
hosts like mana-geocoding:3018/health → "Mana Geocoding",
host.docker.internal:4000 → "Pelias API",
host.docker.internal:9200 → "Pelias Elasticsearch".
**Grafana uptime dashboard**
- Add an "Internal" series to the "Alle Dienste — Uptime-Verlauf" panel
- New "Interne Dienste Status" table panel showing per-instance up/down
- New "Geocoding Ø Latenz" stat panel for probe_duration_seconds
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Blackbox web probes were missing: body, journal, dreams, firsts,
cycles, events, finance, places, who, news, mail. These modules
exist in mana-apps.ts and are deployed but were never added to
prometheus.yml — so they didn't show on status.mana.how.
Also adds mana-geocoding and mana-events to the internal SvelteKit
status page health checks.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The workbench-registry app id 'inventar' did not match its
@mana/shared-branding MANA_APPS counterpart 'inventory', so the tier-
gating join in apps/web/src/lib/app-registry/registry.ts silently
failed for the inventory module — it fell into the "no MANA_APPS
entry, default visible" fallback and was effectively un-gated. The
codebase had also voted overwhelmingly for 'inventar' (53 files) vs
'inventory' (3 files in shared-branding), so the long-standing
mismatch was just bookkeeping debt waiting to bite.
Pre-release, no live data, so the cleanest fix is to align everything
on the English 'inventory':
- Workbench-registry id, module.config.ts appId, module folder, route
folder and i18n locale folder all renamed via git mv
- Standalone apps/inventar/ workspace package renamed
- All imports, store identifiers (InventarEvents → InventoryEvents,
INVENTAR_GUEST_SEED, inventarModuleConfig), i18n keys and href/goto
paths follow the rename
- The German display label "Inventar" is preserved everywhere it is a
user-visible string (page titles, i18n values, toast labels)
- Dexie table prefixes (invCollections, invItems, …) are unchanged
- Drive-by fix: ListView.svelte was querying non-existent
inventarCollections/inventarItems tables — corrected to the actual
invCollections/invItems names from module.config
- The "inventar ↔ inventory id mismatch" workaround comment in
registry.ts is removed since the mismatch no longer exists
module-registry.ts also picks up the user's parallel newsModuleConfig
addition because both edits land in the same import block — keeping
them split would have left the build in an inconsistent state.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fourth stale package COPY in three days. The pattern is unfortunately
predictable: package gets removed in a parallel cleanup commit, the
Dockerfile.sveltekit-base entry stays behind, nobody notices because
nobody runs the base build manually anymore. Then is_base_image_stale
fires the next time something in packages/ changes and the build
falls over.
Long-term: add a pre-flight check to build-app.sh that validates
every COPY-referenced path actually exists before kicking off Docker.
Failing fast is much friendlier than failing 30 seconds into a Docker
layer.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three more removed packages had stale COPY entries in the base
Dockerfile, blocking the build the moment is_base_image_stale tried
to rebuild the image:
- packages/credit-operations (deleted in NestJS→Hono migration)
- packages/shared-api-client (same)
- packages/shared-splitscreen (separate cleanup)
Same shape as the shared-subscription-types/-ui removal earlier
today (commit a9178ec2f). The deletions go in cleanup commits and
the Dockerfile lines stay behind because nobody runs --base
manually anymore — until is_base_image_stale picks up a packages/
change and tries to rebuild, at which point COPY of a non-existent
path bricks the build.
Removed both the COPY lines AND the corresponding `cd /app/packages/
{credit-operations,shared-api-client} && pnpm build` lines from the
post-install build chain so they can't accidentally re-introduce
the references.
Verified by `grep '^COPY packages/' Dockerfile.sveltekit-base | awk
{print $2} | while read pkg; do [ ! -d $pkg ] && echo MISSING: $pkg;
done` returning empty.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adding an app to a workbench scene threw DataCloneError. scenesState
is a $state array, so current.openApps was a Svelte 5 proxy and
spreading it into a new array left proxy entries inside; IndexedDB's
structured clone refuses to serialise those. Snapshot before handing
the array to patchScene / createScene so Dexie sees plain objects.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The base image referenced packages/shared-subscription-types and
packages/shared-subscription-ui, which were consolidated into
packages/subscriptions a while back and no longer exist on disk.
`build-app.sh --base` therefore failed every time with:
failed to compute cache key: "/packages/shared-subscription-ui": not found
That latent failure was harmless until today: the CSP fix for WebLLM
in @mana/shared-utils never made it into the live mana-web container
because shared-utils lives inside sveltekit-base:local (not COPYed by
the per-app Dockerfile), and rebuilding the base was impossible. With
the stale lines removed the base image rebuilds, picks up the current
shared-utils, and downstream apps inherit the fixed CSP automatically.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
A grep audit after the previous matrix removal commits found a handful
of stragglers in non-runtime files that the earlier sweeps missed:
- services/mana-llm/CLAUDE.md: removed matrix-ollama-bot from the
consumer-apps diagram and from the related-services table
- services/mana-video-gen/CLAUDE.md: removed "Matrix Bots" integration
bullet
- packages/notify-client/README.md: removed sendMatrix() doc entry
(the method itself was already gone in the prior cleanup)
- docker/grafana/dashboards/logs-explorer.json: dropped the "Matrix
Stack" log row that queried tier="matrix" (would show no data forever)
- docker/grafana/dashboards/master-overview.json: dropped the "Matrix
Bots" stat panel that counted up{job=~"matrix-.*-bot"}
- apps/mana/apps/landing/src/data/ecosystem-health.json: regenerated via
scripts/ecosystem-audit.mjs to drop matrix from the app list, icon
counts, file analytics, top offenders and authGuard missing list
- .gitignore: removed services/matrix-stt-bot/data/ pattern (the
service itself was deleted long ago)
Production-side stragglers also addressed (not in this commit):
- DROP USER synapse on prod Postgres (the parallel cleanup commit
2514831a3 dropped DATABASE matrix + DATABASE synapse but left the
role behind)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The matrix subsystem was removed in a prior commit. This commit cleans
up the small leftovers that grep found:
- docker-compose.macmini.yml: dropped the "Matrix Stack" port-range
comment, the "matrix" category from the naming convention, and a
stale watchtower comment about Matrix notifications.
- packages/credits/src/operations.ts: removed AI_BOT_CHAT credit
operation type and its definition. It was the billing entry for "Chat
with AI via Matrix bot" — no callers left.
- services/mana-credits gifts schema + service + validation: removed the
targetMatrixId column / param / Zod field. The corresponding
PostgreSQL column was dropped manually with
`ALTER TABLE gifts.gift_codes DROP COLUMN target_matrix_id` on prod.
- docker/grafana/dashboards/{master,system}-overview.json: removed the
`up{job="synapse"}` panel queries — they would have shown No Data
forever now that Synapse is gone.
Production-side cleanup performed in parallel (not in this commit):
- Stopped + removed mana-matrix-{synapse,element,web,bot} containers
- Removed mana-matrix-bot:local, matrix-web:latest,
matrixdotorg/synapse:latest, vectorim/element-web:latest images (~3 GB)
- Removed mana-matrix-bots-data Docker volume
- Removed /Volumes/ManaData/matrix/ media store (4.3 MB)
- DROP DATABASE matrix; DROP DATABASE synapse; on Postgres
Cosmetic leftovers intentionally untouched:
- Eisenhower matrix in todo (LayoutMode 'matrix') — productivity concept
- ${{ matrix.service }} in .github/workflows — GitHub Actions strategy
- services/mana-media/apps/api/dist/.../matrix/* — stale build output
(not in git, regenerated next mana-media build)
This commit bundles two unrelated changes that were swept together by an
accidental `git add -A` in another working session. Documented here so the
history reflects what's actually inside.
═══════════════════════════════════════════════════════════════════════
1. fix(mana-auth): /api/v1/auth/login mints JWT via auth.handler instead
of api.signInEmail
═══════════════════════════════════════════════════════════════════════
Previous attempt (commit 55cc75e7d) tried to fix the broken JWT mint in
/api/v1/auth/login by switching the cookie name from `mana.session_token`
to `__Secure-mana.session_token` for production. That was necessary but
not sufficient: Better Auth's session cookie value isn't just the raw
session token, it's `<token>.<HMAC>` where the HMAC is derived from the
better-auth secret. Reconstructing the cookie from auth.api.signInEmail's
JSON response only gave us the raw token, so /api/auth/token's
get-session middleware still couldn't validate it and the JWT mint kept
silently failing.
Real fix: do the sign-in via auth.handler (the HTTP path) rather than
auth.api.signInEmail (the SDK path). The handler returns a real fetch
Response with a Set-Cookie header containing the fully signed cookie
envelope. We capture that header verbatim and forward it as the cookie
on the /api/auth/token request, which now passes validation and mints
the JWT correctly.
Verified end-to-end on auth.mana.how:
$ curl -X POST https://auth.mana.how/api/v1/auth/login \
-d '{"email":"...","password":"..."}'
{
"user": {...},
"token": "<session token>",
"accessToken": "eyJhbGciOiJFZERTQSI...", ← real JWT now
"refreshToken": "<session token>"
}
Side benefits:
- Email-not-verified path is now handled by checking
signInResponse.status === 403 directly, no more catching APIError
with the comment-noted async-stream footgun.
- X-Forwarded-For is forwarded explicitly so Better Auth's rate limiter
and our security log see the real client IP.
- The leftover catch block now only handles unexpected exceptions
(network errors etc); the FORBIDDEN-checking logic in it is dead but
harmless and left in for defense in depth.
═══════════════════════════════════════════════════════════════════════
2. chore: remove the entire self-hosted Matrix stack (Synapse, Element,
Manalink, mana-matrix-bot)
═══════════════════════════════════════════════════════════════════════
The Matrix subsystem ran parallel to the main Mana product without any
load-bearing integration: the unified web app never imported matrix-js-sdk,
the chat module uses mana-sync (local-first), and mana-matrix-bot's
plugins duplicated features the unified app already ships natively.
Keeping it alive cost a Synapse + Element + matrix-web + bot container
quartet, three Cloudflare routes, an OIDC provider plugin in mana-auth,
and a steady drip of devlog/dependency churn.
Removed:
- apps/matrix (Manalink web + mobile, ~150 files)
- services/mana-matrix-bot (Go bot with ~20 plugins)
- docker/matrix configs (Synapse + Element)
- synapse/element-web/matrix-web/mana-matrix-bot services in
docker-compose.macmini.yml
- matrix.mana.how/element.mana.how/link.mana.how Cloudflare tunnel routes
- OIDC provider plugin + matrix-synapse trustedClient + matrixUserLinks
table from mana-auth (oauth_* schema definitions also removed)
- MatrixService import path in mana-media (importFromMatrix endpoint)
- Matrix notification channel in mana-notify (worker, metrics, config,
channel_type enum, MatrixOptions handler)
- Matrix entries from shared-branding (mana-apps + app-icons),
notify-client, the i18n bundle, the observatory map, the credits
app-label list, the landing footer/apps page, the prometheus + alerts
+ promtail tier mappings, and the matrix-related deploy paths in
cd-macmini.yml + ci.yml
Devlog/manascore/blueprint entries that mention Matrix are left intact
as historical record. The oauth_* + matrix_user_links Postgres tables
stay on existing prod databases — code can no longer write to them, drop
them in a follow-up migration if you want them gone for real.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three Mac Mini infrastructure follow-ups bundled:
1. docker-compose.macmini.yml — drop ghost backend env vars from
the mana-app-web service (todo, calendar, contacts, chat, storage,
cards, music, nutriphi `PUBLIC_*_API_URL{,_CLIENT}` plus the memoro
server URLs). The matching consumers were removed in the earlier
ghost-API cleanup commits, so these env entries had been wiring
nothing into the running container for several deploys. Force-
recreating mana-app-web after pulling this commit will pick up
the slimmer env automatically.
2. docker-compose.macmini.yml — bump `mana-mon-blackbox` mem_limit
from 32m to 128m. blackbox-exporter v0.25 sits north of 32m
under load and was OOM-restart-looping every ~90 seconds, which
in turn made `status.mana.how` and the prometheus probe metrics
stale (since the scraper was missing every other window).
3. docker/prometheus/prometheus.yml — split `blackbox-gpu` into two
jobs:
- `blackbox-gpu` now probes `/health` via the http_health
module, because the GPU services (whisper STT, FLUX image
gen, Coqui TTS) return 401/404 on `/` by design (auth or
API-only). The previous http_2xx-on-`/` probe was reporting
all four as down even though they answered `/health` with
200, which inflated the down count on status.mana.how.
- `blackbox-gpu-root` keeps the http_2xx-on-`/` probe for
Ollama, which has no `/health` endpoint but does answer
2xx on its root.
Both jobs share the same blackbox-exporter relabel rewrite so
the targets are routed through the exporter container, not
scraped directly by VictoriaMetrics.
Verified post-fix: status.mana.how reports 41/42 services up (only
`gpu-video` remains down — LTX Video Gen is intentionally not
deployed yet on the Windows GPU box).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All web app subdomains (chat.mana.how, todo.mana.how, etc.) were removed
when the unified app launched, but monitoring configs still referenced them.
Update blackbox targets to use mana.how/route URLs, remove stale API backend
routes from cloudflared, clean up CORS origins, and fix status page generator
to handle route-based URLs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
shared-hono depends on @manacore/shared-logger but it was missing from
the base image COPY list, causing pnpm install to fail.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Upgrade shared-logger to dual-mode: JSON lines in production, console
in dev. Adds configureLogger() for service name + request ID.
- Add requestLogger middleware to shared-hono with request ID generation
and structured request/response logging.
- Align Promtail config with new JSON field names (requestId, ts, service).
- Add PUBLIC_GLITCHTIP_DSN + PUBLIC_UMAMI_WEBSITE_ID to mana-web docker config.
- Add /status page that polls all backend /health endpoints server-side.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mirrors the frontend unification (single IndexedDB) on the backend.
All services now use pgSchema() for isolation within one shared database,
enabling cross-schema JOINs, simplified ops, and zero DB setup for new apps.
- Migrate 7 services from pgTable() to pgSchema(): mana-user (usr),
mana-media (media), todo, traces, presi, uload, cards
- Update all DATABASE_URLs in .env.development, docker-compose, configs
- Rewrite init-db scripts for 2 databases + 12 schemas
- Rewrite setup-databases.sh for consolidated architecture
- Update shared-drizzle-config default to mana_platform
- Update CLAUDE.md with new database architecture docs
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add @manacore/local-llm to both sveltekit-base and manacore web
Dockerfile so pnpm can resolve the workspace dependency.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New GPU service for fast text-to-video generation using LTX-Video (~2B params)
on the RTX 3090. Generates 480p clips in 10-30 seconds, uses ~10GB VRAM.
Includes Cloudflare Tunnel route, Prometheus monitoring, and health checks.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- generate-status-page.sh now also writes status.json alongside index.html
Format: { updated, summary: {up, total}, services: { appName: bool } }
- nginx status.mana.how serves status.json with CORS headers (public read)
and explicit location block to avoid rewrite to index.html
- ManaScore index page fetches status.json client-side on load and
injects green ● LIVE / red ● DOWN badge next to each app's status chip
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
pnpm skips workspace linking when glob patterns like apps/*/apps/* from
pnpm-workspace.yaml match no directories. This caused @manacore/feedback
and other packages to be copied but not linked in node_modules. Fix adds
a post-install step that creates symlinks for all packages/* entries.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- check-disk-space.sh now pushes mac_disk_used_percent + mac_colima_disk_used_gb
to Pushgateway every hour so vmalert can alert on real macOS disk usage
- alerts.yml: replace broken node-exporter disk alerts with Pushgateway-based ones
- master-overview.json: add "Recent Errors (Loki)" section with live error log
stream, error rate timeseries and top error sources barchart
- move-colima-to-external-ssd.sh: guided script to move 200GB Colima VM
datadisk from internal SSD to /Volumes/ManaData (3.6TB external SSD)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
node-exporter runs in VM and can't see host macOS disks directly.
Use custom mac_disk_used_percent metrics pushed via Pushgateway instead.
Also add ColimaVMDiskLarge alert when datadisk exceeds 150 GB.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
relabel drop removed the entire stream before labels were set, causing
the "at least one label pair required" error. pipeline_stages drop runs
after labels are established, which is correct for filtering by tier.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Containers that don't match any tier regex had no labels, causing Loki to
reject the stream with "at least one label pair is required".
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
todo-web, calendar-web, contacts-web, mana-web all depend on
@manacore/shared-links but it was missing from the base image COPY list.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
vmalert: was copying prometheus.yml into /etc/alerts/ causing parse
failure. Now only copies alerts.yml (the actual rules file).
synapse: mana-auth (Better Auth) has no OIDC discovery endpoint,
so disable OIDC and enable password auth until OIDC is implemented.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Loki was already running but had no log shipper. Adds Promtail to collect
Docker logs from all 66 containers with automatic tier labeling (infra,
auth, core, app, matrix, games) and a Grafana Logs Explorer dashboard.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
build-app.sh now checks available RAM before builds and only stops
monitoring containers when free memory is below 3 GB threshold.
New memory-baseline.sh script measures per-container and per-category
RAM usage for capacity planning.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>