managarten/docs/observability/website.md
Till JS 4fc9d6c59c feat(wardrobe): module foundation — garments + outfits space-scoped data layer (M1)
M1 of docs/plans/wardrobe-module.md — pure data layer + backend plumbing,
zero UI (that's M2). A user can now hold a digital wardrobe per space:
brand merch, club Trikots, family Kleiderschrank, team Kostüme, practice
Dresscode, and personal closet all live as separate pools under the same
Dexie tables, space-scoped like tags/scenes/agents after Phase 2c.

Data model — two tables, no join:

- wardrobeGarments (Dexie v41): single clothing items / accessories.
  Indexed on `category` + `createdAt` + `isArchived`. Encrypted:
  name/brand/color/size/material/tags/notes. Plaintext: category,
  mediaIds, counters, timestamps — all indexed or structural.
  `mediaIds[0]` is the primary photo used for try-on; additional
  ids are alternate views (back, detail) for M7.

- wardrobeOutfits (Dexie v41): named compositions referencing
  garment ids. Encrypted: name/description/tags. Plaintext:
  garmentIds (FK array), occasion (closed enum — useful for
  undecrypted filtering), season, booleans, lastTryOn snapshot.

- picture.images gains `wardrobeOutfitId?: string | null` as a
  plaintext back-reference. Try-on results land in the Picture
  gallery like any other generation; the outfit detail view
  queries them via this id rather than maintaining a third table.

Space scope:

- `wardrobe` added to all five explicit allowlists in shared-types/
  spaces.ts (personal is wildcard, no edit needed). Each space type
  gets a one-line comment explaining the real-world use case.
- App registry: `wardrobe` entry in shared-branding/mana-apps.ts
  with a rose→fuchsia gradient icon (T-shirt on hanger silhouette),
  color #e11d48, tier 'beta', status 'beta'.
- Module registry: wardrobeModuleConfig imported + appended to
  MODULE_CONFIGS so SYNC_APP_MAP picks it up automatically.

Backend:

- MAX_REFERENCE_IMAGES bumped 4 → 8 in picture/generate-with-
  reference (plus the client-side default in ReferenceImagePicker).
  Justified with a comment: face + body + top + bottom + shoes +
  outerwear + 2 accessories = 8. Cost doesn't scale with ref count
  (OpenAI bills per output), so the bump is a pure capability
  expansion with no credit-side risk.
- New POST /api/v1/wardrobe/garments/upload wraps uploadImageToMedia
  with app='wardrobe'. Registered under /api/v1/wardrobe in index.ts.
  Pattern 1:1 with the profile/me-images/upload endpoint; tier-gating
  falls out of wardrobe NOT being in RESOURCE_MODULES (tier='guest'
  works — consistent with picture's plain CRUD).

Stores emit domain events (WardrobeGarmentAdded, WardrobeOutfitCreated,
WardrobeOutfitTryOn, etc.) so later mana-ai missions can observe
activity without polling.

No UI in this commit. M2 (Garments-Grundlayer) wires the route + grid
+ upload-zone; M3 the Outfit composer; M4 the Try-On integration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 18:27:37 +02:00

84 lines
4.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Website Builder — Observability
_Shipped 2026-04-23 (M7)._
Every metric below lives on mana-api's `/metrics` scrape endpoint (port 3060, unauthenticated — relies on reverse-proxy to keep it off the public internet).
## Metrics
### Write path
| Metric | Type | Labels | What it tells you |
|---|---|---|---|
| `mana_api_website_publish_total` | Counter | `result` = success \| slug_taken \| invalid \| error | Publish-attempt outcome mix. |
| `mana_api_website_publish_duration_seconds` | Histogram | — | End-to-end publish latency (validation + transaction). |
| `mana_api_website_domain_verify_total` | Counter | `result` = verified \| failed | Custom-domain DNS check outcomes. |
### Public surface
| Metric | Type | Labels | What it tells you |
|---|---|---|---|
| `mana_api_website_public_reads_total` | Counter | `result` = hit \| not_found | Anonymous reads of `/public/sites/:slug`. |
| `mana_api_website_public_read_age_seconds` | Histogram | — | Age of the served snapshot at read time. A bimodal distribution (many <10s AND many >1h) tells you the edge cache is working. |
| `mana_api_website_host_resolve_total` | Counter | `result` = hit \| miss \| error | Custom-host → slug resolutions from the SvelteKit hook. |
| `mana_api_website_submissions_total` | Counter | `result` = received \| spam \| rate_limit \| not_found \| invalid | Form submissions received. |
## Quick PromQL queries
**Publish success rate (30 min rolling):**
```promql
sum(rate(mana_api_website_publish_total{result="success"}[30m]))
/
sum(rate(mana_api_website_publish_total[30m]))
```
**p95 publish latency:**
```promql
histogram_quantile(0.95, sum by (le) (rate(mana_api_website_publish_duration_seconds_bucket[10m])))
```
**Custom-host resolve hit rate (production target: >98% once bindings stabilise):**
```promql
sum(rate(mana_api_website_host_resolve_total{result="hit"}[5m]))
/
sum(rate(mana_api_website_host_resolve_total[5m]))
```
**Spam-to-received ratio (form submissions):**
```promql
sum(rate(mana_api_website_submissions_total{result="spam"}[1h]))
/
sum(rate(mana_api_website_submissions_total{result=~"received|spam"}[1h]))
```
## Alerts (recommended)
- **`website-publish-failure-spike`** — fires when `rate(mana_api_website_publish_total{result="error"}[10m]) > 0.1/s`. Indicates DB trouble or an unhandled exception path.
- **`website-public-cold`** — fires when `rate(mana_api_website_public_reads_total[1h]) > 10/s AND rate(mana_api_website_public_read_age_seconds_count{le="10"}[1h]) / rate(mana_api_website_public_read_age_seconds_count[1h]) > 0.5`. Half the traffic is hitting fresh snapshots = the edge cache isn't doing its job, usually a CF config drift.
- **`website-domain-verify-failed-burst`** — fires when `increase(mana_api_website_domain_verify_total{result="failed"}[1h]) > 20`. Either ops broke the DNS target (CNAME not pointing anywhere) or one angry user is thrashing.
- **`website-form-spam-storm`** — fires when `rate(mana_api_website_submissions_total{result="spam"}[5m]) > 1/s`. Honeypot is holding, but a motivated attacker might move on to CAPTCHA-busting next.
## Dashboard
Grafana dashboard lives at `grafana.internal/d/website-builder` (add it to the existing "Mana Services" folder). Panels: publish volume + outcome mix, publish latency heatmap, submissions/spam split, host-resolve hit ratio, domain-verify trend.
## Orphan-asset GC
Read-only scan script at `apps/api/scripts/gc-website-assets.ts`. Run manually for now:
```bash
MANA_SERVICE_KEY=DATABASE_URL=… bun apps/api/scripts/gc-website-assets.ts
```
The script:
1. Walks every `published_snapshots.blob` and `submissions.payload` to collect referenced `mediaId`s.
2. Asks mana-media for everything scoped to `app=website`.
3. Reports items older than 30 days that aren't referenced anywhere.
**Current status: report-only.** No deletion. After 23 weeks of production reports showing the candidate list is stable and doesn't include false positives, we flip a `--delete` flag in a follow-up commit.
## Future (M7.x)
- Per-site view counts. Would require a cheap counter table (`website.site_views { site_id, day, count }`) incremented from the public-read handler. Skipped in M7 first-pass because the analytics block already covers the per-visit needs; add when someone asks for a dashboard inside the editor.
- Cloudflare hostname status reconciliation. Once the CF SaaS API is wired, a periodic poller should compare our `custom_domains.status` against CF's `hostname.ssl.status` and flag drift.
- Submission-payload retention job. Fields are kept indefinitely today; when target-delivery lands (M4.x) the job runs after delivery and nulls the payload, keeping only IDs + status.