feat(mana-ai): OpenTelemetry tracing + Grafana Tempo backend

Add distributed tracing to the mana-ai background runner so mission
execution can be visualized end-to-end in Grafana.

Instrumentation (services/mana-ai/):
- tracing.ts: OTel provider setup with OTLP/HTTP exporter, withSpan() helper
- tick.ts: tick.planMission span with mission/agent/user attributes
- client.ts: planner.complete span with LLM model, tokens, latency

Infrastructure:
- docker/tempo/tempo.yaml: Grafana Tempo config (OTLP HTTP on 4318)
- docker-compose: tempo service + tempo_data volume + mana-ai env var
- docker/grafana/provisioning/datasources/tempo.yml: auto-provisioned

Trace flow:
  tick.planMission (root span)
    └── planner.complete (child span)
        ├── llm.model = "gpt-4o-mini"
        ├── llm.tokens.total = 1234
        └── llm.response.length = 567

Enable: set OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
View: Grafana → Explore → Tempo datasource

Also fixes: removed broken @mana/subscriptions workspace ref from arcade.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Till JS 2026-04-16 15:21:23 +02:00
parent 8def989ed9
commit 76577869e1
9 changed files with 456 additions and 259 deletions

View file

@ -332,6 +332,7 @@ services:
# mana-auth. Used to unwrap per-mission data keys at tick time.
# Absent → all grants skip silently with reason="not-configured".
MANA_AI_PRIVATE_KEY_PEM: ${MANA_AI_PRIVATE_KEY_PEM:-}
OTEL_EXPORTER_OTLP_ENDPOINT: http://tempo:4318
ports:
- "3067:3067"
healthcheck:
@ -1230,6 +1231,25 @@ services:
retries: 3
start_period: 10s
tempo:
image: grafana/tempo:2.6.1
container_name: mana-mon-tempo
restart: always
mem_limit: 256m
command: ["-config.file=/etc/tempo/tempo.yaml"]
volumes:
- ./docker/tempo:/etc/tempo:ro
- tempo_data:/var/tempo
ports:
- "4318:4318" # OTLP HTTP receiver
- "3200:3200" # Tempo API (for Grafana)
healthcheck:
test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://127.0.0.1:3200/ready"]
interval: 300s
timeout: 10s
retries: 3
start_period: 10s
loki:
image: grafana/loki:3.0.0
container_name: mana-mon-loki
@ -1666,3 +1686,5 @@ volumes:
name: mana-loki-data
stalwart_data:
name: mana-stalwart-data
tempo_data:
name: mana-tempo-data