managarten/docs/observability/website.md
Till JS 4fc9d6c59c feat(wardrobe): module foundation — garments + outfits space-scoped data layer (M1)
M1 of docs/plans/wardrobe-module.md — pure data layer + backend plumbing,
zero UI (that's M2). A user can now hold a digital wardrobe per space:
brand merch, club Trikots, family Kleiderschrank, team Kostüme, practice
Dresscode, and personal closet all live as separate pools under the same
Dexie tables, space-scoped like tags/scenes/agents after Phase 2c.

Data model — two tables, no join:

- wardrobeGarments (Dexie v41): single clothing items / accessories.
  Indexed on `category` + `createdAt` + `isArchived`. Encrypted:
  name/brand/color/size/material/tags/notes. Plaintext: category,
  mediaIds, counters, timestamps — all indexed or structural.
  `mediaIds[0]` is the primary photo used for try-on; additional
  ids are alternate views (back, detail) for M7.

- wardrobeOutfits (Dexie v41): named compositions referencing
  garment ids. Encrypted: name/description/tags. Plaintext:
  garmentIds (FK array), occasion (closed enum — useful for
  undecrypted filtering), season, booleans, lastTryOn snapshot.

- picture.images gains `wardrobeOutfitId?: string | null` as a
  plaintext back-reference. Try-on results land in the Picture
  gallery like any other generation; the outfit detail view
  queries them via this id rather than maintaining a third table.

Space scope:

- `wardrobe` added to all five explicit allowlists in shared-types/
  spaces.ts (personal is wildcard, no edit needed). Each space type
  gets a one-line comment explaining the real-world use case.
- App registry: `wardrobe` entry in shared-branding/mana-apps.ts
  with a rose→fuchsia gradient icon (T-shirt on hanger silhouette),
  color #e11d48, tier 'beta', status 'beta'.
- Module registry: wardrobeModuleConfig imported + appended to
  MODULE_CONFIGS so SYNC_APP_MAP picks it up automatically.

Backend:

- MAX_REFERENCE_IMAGES bumped 4 → 8 in picture/generate-with-
  reference (plus the client-side default in ReferenceImagePicker).
  Justified with a comment: face + body + top + bottom + shoes +
  outerwear + 2 accessories = 8. Cost doesn't scale with ref count
  (OpenAI bills per output), so the bump is a pure capability
  expansion with no credit-side risk.
- New POST /api/v1/wardrobe/garments/upload wraps uploadImageToMedia
  with app='wardrobe'. Registered under /api/v1/wardrobe in index.ts.
  Pattern 1:1 with the profile/me-images/upload endpoint; tier-gating
  falls out of wardrobe NOT being in RESOURCE_MODULES (tier='guest'
  works — consistent with picture's plain CRUD).

Stores emit domain events (WardrobeGarmentAdded, WardrobeOutfitCreated,
WardrobeOutfitTryOn, etc.) so later mana-ai missions can observe
activity without polling.

No UI in this commit. M2 (Garments-Grundlayer) wires the route + grid
+ upload-zone; M3 the Outfit composer; M4 the Try-On integration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 18:27:37 +02:00

4.6 KiB
Raw Permalink Blame History

Website Builder — Observability

Shipped 2026-04-23 (M7).

Every metric below lives on mana-api's /metrics scrape endpoint (port 3060, unauthenticated — relies on reverse-proxy to keep it off the public internet).

Metrics

Write path

Metric Type Labels What it tells you
mana_api_website_publish_total Counter result = success | slug_taken | invalid | error Publish-attempt outcome mix.
mana_api_website_publish_duration_seconds Histogram End-to-end publish latency (validation + transaction).
mana_api_website_domain_verify_total Counter result = verified | failed Custom-domain DNS check outcomes.

Public surface

Metric Type Labels What it tells you
mana_api_website_public_reads_total Counter result = hit | not_found Anonymous reads of /public/sites/:slug.
mana_api_website_public_read_age_seconds Histogram Age of the served snapshot at read time. A bimodal distribution (many <10s AND many >1h) tells you the edge cache is working.
mana_api_website_host_resolve_total Counter result = hit | miss | error Custom-host → slug resolutions from the SvelteKit hook.
mana_api_website_submissions_total Counter result = received | spam | rate_limit | not_found | invalid Form submissions received.

Quick PromQL queries

Publish success rate (30 min rolling):

sum(rate(mana_api_website_publish_total{result="success"}[30m]))
/
sum(rate(mana_api_website_publish_total[30m]))

p95 publish latency:

histogram_quantile(0.95, sum by (le) (rate(mana_api_website_publish_duration_seconds_bucket[10m])))

Custom-host resolve hit rate (production target: >98% once bindings stabilise):

sum(rate(mana_api_website_host_resolve_total{result="hit"}[5m]))
/
sum(rate(mana_api_website_host_resolve_total[5m]))

Spam-to-received ratio (form submissions):

sum(rate(mana_api_website_submissions_total{result="spam"}[1h]))
/
sum(rate(mana_api_website_submissions_total{result=~"received|spam"}[1h]))
  • website-publish-failure-spike — fires when rate(mana_api_website_publish_total{result="error"}[10m]) > 0.1/s. Indicates DB trouble or an unhandled exception path.
  • website-public-cold — fires when rate(mana_api_website_public_reads_total[1h]) > 10/s AND rate(mana_api_website_public_read_age_seconds_count{le="10"}[1h]) / rate(mana_api_website_public_read_age_seconds_count[1h]) > 0.5. Half the traffic is hitting fresh snapshots = the edge cache isn't doing its job, usually a CF config drift.
  • website-domain-verify-failed-burst — fires when increase(mana_api_website_domain_verify_total{result="failed"}[1h]) > 20. Either ops broke the DNS target (CNAME not pointing anywhere) or one angry user is thrashing.
  • website-form-spam-storm — fires when rate(mana_api_website_submissions_total{result="spam"}[5m]) > 1/s. Honeypot is holding, but a motivated attacker might move on to CAPTCHA-busting next.

Dashboard

Grafana dashboard lives at grafana.internal/d/website-builder (add it to the existing "Mana Services" folder). Panels: publish volume + outcome mix, publish latency heatmap, submissions/spam split, host-resolve hit ratio, domain-verify trend.

Orphan-asset GC

Read-only scan script at apps/api/scripts/gc-website-assets.ts. Run manually for now:

MANA_SERVICE_KEY=DATABASE_URL=… bun apps/api/scripts/gc-website-assets.ts

The script:

  1. Walks every published_snapshots.blob and submissions.payload to collect referenced mediaIds.
  2. Asks mana-media for everything scoped to app=website.
  3. Reports items older than 30 days that aren't referenced anywhere.

Current status: report-only. No deletion. After 23 weeks of production reports showing the candidate list is stable and doesn't include false positives, we flip a --delete flag in a follow-up commit.

Future (M7.x)

  • Per-site view counts. Would require a cheap counter table (website.site_views { site_id, day, count }) incremented from the public-read handler. Skipped in M7 first-pass because the analytics block already covers the per-visit needs; add when someone asks for a dashboard inside the editor.
  • Cloudflare hostname status reconciliation. Once the CF SaaS API is wired, a periodic poller should compare our custom_domains.status against CF's hostname.ssl.status and flag drift.
  • Submission-payload retention job. Fields are kept indefinitely today; when target-delivery lands (M4.x) the job runs after delivery and nulls the payload, keeping only IDs + status.