🐛 fix(mana-core-auth): use EdDSA for OIDC id_token signing

Set useJWTPlugin: true so id_tokens are signed with EdDSA keys
from JWKS instead of HS256. This fixes Synapse OIDC integration
which verifies tokens via JWKS endpoint.
This commit is contained in:
Till-JS 2026-02-01 13:24:55 +01:00
parent 5c61a4ed0f
commit efb077b9ea
22 changed files with 1605 additions and 142 deletions

View file

@ -0,0 +1,306 @@
# Mana Core Auth - Disaster Recovery
## Overview
This document describes backup, recovery, and disaster recovery procedures for the Mana Core Auth service.
## Data Assets
### Critical Data
| Data | Location | Recovery Priority |
|------|----------|-------------------|
| User accounts | `auth.users` table | Critical |
| Sessions | `auth.sessions` table | High (can regenerate) |
| JWKS keys | `auth.jwks` table | Critical |
| Organizations | `auth.organizations` table | Critical |
| Credit balances | `credits.balances` table | Critical |
### Non-Critical Data (Can Regenerate)
- Sessions (users can re-login)
- Verification tokens (users can request new ones)
- Rate limit counters (stored in Redis)
## Backup Strategy
### Database Backups
#### Automated Daily Backups
```bash
#!/bin/bash
# backup-database.sh
BACKUP_DIR="/backups/mana-core-auth"
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/manacore_auth_${DATE}.sql.gz"
# Create backup
pg_dump "$DATABASE_URL" | gzip > "$BACKUP_FILE"
# Keep last 30 days
find "$BACKUP_DIR" -name "*.sql.gz" -mtime +30 -delete
# Upload to S3 (optional)
aws s3 cp "$BACKUP_FILE" "s3://your-backup-bucket/mana-core-auth/"
```
#### Before Major Changes
Always create a manual backup before:
- Database migrations
- Schema changes
- Bulk data operations
```bash
pg_dump "$DATABASE_URL" > pre_migration_backup.sql
```
### Redis Backups (if used)
Redis data is ephemeral (sessions). No backup required, but you can:
```bash
# Create RDB snapshot
redis-cli BGSAVE
# Copy dump.rdb to backup location
cp /var/lib/redis/dump.rdb /backups/redis/
```
### JWKS Key Backup
The JWKS keys are critical for JWT validation. Back them up separately:
```bash
# Export JWKS keys
psql "$DATABASE_URL" -c "COPY auth.jwks TO '/backups/jwks_backup.csv' CSV HEADER;"
```
## Recovery Procedures
### Scenario 1: Database Corruption
1. **Stop the service**
```bash
docker stop mana-core-auth
```
2. **Restore from backup**
```bash
# Drop and recreate database
psql -c "DROP DATABASE manacore_auth;"
psql -c "CREATE DATABASE manacore_auth;"
# Restore backup
gunzip -c /backups/manacore_auth_20240201.sql.gz | psql manacore_auth
```
3. **Verify data integrity**
```bash
psql manacore_auth -c "SELECT COUNT(*) FROM auth.users;"
psql manacore_auth -c "SELECT COUNT(*) FROM auth.jwks;"
```
4. **Restart the service**
```bash
docker start mana-core-auth
```
5. **Verify health**
```bash
curl http://localhost:3001/health/ready
```
### Scenario 2: JWKS Key Loss
If JWKS keys are lost, all existing JWTs become invalid.
1. **Option A: Restore from backup**
```bash
psql "$DATABASE_URL" -c "COPY auth.jwks FROM '/backups/jwks_backup.csv' CSV HEADER;"
```
2. **Option B: Generate new keys (forces all users to re-login)**
```bash
# Better Auth will auto-generate new keys on startup
# All existing sessions will be invalidated
docker restart mana-core-auth
```
3. **Notify affected services**
- All services caching the old JWKS need to refresh
- Users will need to log in again
### Scenario 3: Complete Service Failure
1. **Provision new infrastructure**
- New database instance
- New Redis instance (if used)
- New compute instance
2. **Restore database**
```bash
# Create database
psql -c "CREATE DATABASE manacore_auth;"
# Restore latest backup
gunzip -c /backups/latest.sql.gz | psql manacore_auth
```
3. **Update DNS/Load Balancer**
- Point to new service instance
4. **Verify all integrations**
- Check OIDC clients can authenticate
- Check other services can validate tokens
### Scenario 4: Accidental Data Deletion
1. **Identify affected data**
```sql
-- Check what's missing
SELECT COUNT(*) FROM auth.users WHERE deleted_at IS NOT NULL;
```
2. **Restore from point-in-time backup**
```bash
# If using PostgreSQL with WAL archiving
pg_restore --target-time="2024-02-01 10:00:00" backup.dump
```
3. **Selective restore**
```sql
-- Restore specific users from backup database
INSERT INTO auth.users
SELECT * FROM backup_db.auth.users
WHERE id IN ('user1', 'user2');
```
## Key Rotation
### Scheduled Key Rotation
JWKS keys should be rotated periodically (recommended: every 90 days).
1. **Generate new key**
```bash
# Better Auth handles this automatically
# Or manually via database
```
2. **Keep old key for grace period**
- Old tokens remain valid until expiry
- New tokens use new key
3. **Remove old key after grace period**
```sql
DELETE FROM auth.jwks
WHERE created_at < NOW() - INTERVAL '7 days'
AND id != (SELECT id FROM auth.jwks ORDER BY created_at DESC LIMIT 1);
```
### Emergency Key Rotation
If keys are compromised:
1. **Immediately revoke old keys**
```sql
DELETE FROM auth.jwks;
```
2. **Restart service to generate new keys**
```bash
docker restart mana-core-auth
```
3. **Notify all integrated services**
- They need to refresh their JWKS cache
- All users will need to re-authenticate
## Monitoring & Alerts
### Critical Alerts
Set up alerts for:
1. **Backup failures**
- Backup script exit code != 0
- Backup file size = 0
2. **Database health**
- Connection failures
- Replication lag (if applicable)
3. **Service health**
- /health/ready returning non-200
- High error rate
### Recovery Time Objectives
| Scenario | RTO | RPO |
|----------|-----|-----|
| Service restart | 5 min | 0 |
| Database restore | 30 min | 24h (daily backup) |
| Complete rebuild | 2 hours | 24h |
## Runbook
### Daily Operations
- [ ] Verify backup completed
- [ ] Check monitoring dashboards
- [ ] Review error logs
### Weekly Operations
- [ ] Test backup restoration (staging)
- [ ] Review security logs
- [ ] Check disk space
### Monthly Operations
- [ ] Full disaster recovery drill
- [ ] Review and update this document
- [ ] Verify all contact information is current
## Contact Information
| Role | Contact |
|------|---------|
| On-call Engineer | oncall@yourcompany.com |
| Database Admin | dba@yourcompany.com |
| Security Team | security@yourcompany.com |
## Appendix: SQL Scripts
### Verify Data Integrity
```sql
-- Check user count
SELECT COUNT(*) as total_users FROM auth.users;
-- Check for orphaned data
SELECT COUNT(*) as orphaned_sessions
FROM auth.sessions s
LEFT JOIN auth.users u ON s.user_id = u.id
WHERE u.id IS NULL;
-- Check JWKS keys
SELECT id, created_at FROM auth.jwks ORDER BY created_at DESC;
-- Check credit balances
SELECT COUNT(*) as users_with_balance
FROM credits.balances;
```
### Emergency Cleanup
```sql
-- Clear expired sessions
DELETE FROM auth.sessions WHERE expires_at < NOW();
-- Clear expired verification tokens
DELETE FROM auth.verification_tokens WHERE expires_at < NOW();
```

View file

@ -0,0 +1,299 @@
# Mana Core Auth - Production Deployment Guide
## Prerequisites
Before deploying to production, ensure you have:
1. **PostgreSQL Database** - Version 14+ recommended
2. **Redis** (optional but recommended) - For session storage
3. **SMTP Server** - For email verification and password reset
4. **Stripe Account** - For credit system (optional)
5. **Domain with SSL** - HTTPS is required for secure cookies
## Environment Variables
### Required in Production
```env
NODE_ENV=production
PORT=3001
# Database (REQUIRED)
DATABASE_URL=postgresql://user:password@host:5432/manacore_auth
# Public URL (REQUIRED)
# Used for email verification links, OIDC callbacks
BASE_URL=https://auth.yourdomain.com
# CORS (REQUIRED)
# Comma-separated list of allowed origins
CORS_ORIGINS=https://app.yourdomain.com,https://admin.yourdomain.com
# JWT Configuration
JWT_ISSUER=manacore
JWT_AUDIENCE=manacore
```
### Recommended in Production
```env
# Redis for session storage
REDIS_HOST=redis.yourdomain.com
REDIS_PORT=6379
REDIS_PASSWORD=your-redis-password
# SMTP for emails
SMTP_HOST=smtp.brevo.com
SMTP_PORT=587
SMTP_USER=your-smtp-user
SMTP_PASSWORD=your-smtp-password
SMTP_FROM=ManaCore <noreply@yourdomain.com>
# Stripe for credits
STRIPE_SECRET_KEY=sk_live_...
STRIPE_PUBLISHABLE_KEY=pk_live_...
STRIPE_WEBHOOK_SECRET=whsec_...
# Error tracking
SENTRY_DSN=https://...@sentry.io/...
# Logging
LOG_LEVEL=info
```
## Deployment Options
### Option 1: Docker (Recommended)
```bash
# Build the image
docker build -t mana-core-auth:latest -f services/mana-core-auth/Dockerfile .
# Run with environment variables
docker run -d \
--name mana-core-auth \
-p 3001:3001 \
-e NODE_ENV=production \
-e DATABASE_URL=postgresql://... \
-e BASE_URL=https://auth.yourdomain.com \
-e CORS_ORIGINS=https://app.yourdomain.com \
-e REDIS_HOST=redis \
mana-core-auth:latest
```
### Option 2: Docker Compose
```yaml
version: '3.8'
services:
auth:
build:
context: .
dockerfile: services/mana-core-auth/Dockerfile
ports:
- "3001:3001"
environment:
NODE_ENV: production
DATABASE_URL: postgresql://manacore:${DB_PASSWORD}@db:5432/manacore_auth
BASE_URL: https://auth.yourdomain.com
CORS_ORIGINS: https://app.yourdomain.com
REDIS_HOST: redis
REDIS_PORT: 6379
depends_on:
db:
condition: service_healthy
redis:
condition: service_started
healthcheck:
test: ["CMD", "node", "-e", "require('http').get('http://localhost:3001/health/ready', (r) => process.exit(r.statusCode === 200 ? 0 : 1))"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
db:
image: postgres:16-alpine
environment:
POSTGRES_USER: manacore
POSTGRES_PASSWORD: ${DB_PASSWORD}
POSTGRES_DB: manacore_auth
volumes:
- postgres_data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U manacore"]
interval: 10s
timeout: 5s
retries: 5
redis:
image: redis:7-alpine
command: redis-server --requirepass ${REDIS_PASSWORD}
volumes:
- redis_data:/data
volumes:
postgres_data:
redis_data:
```
### Option 3: Kubernetes
See `k8s/` directory for Kubernetes manifests (if available).
## Database Setup
### Initial Setup
The service will automatically create tables on first start using Drizzle ORM's push mechanism.
```bash
# For manual schema push (development)
pnpm db:push
# For production migrations
pnpm db:migrate
```
### Migration Strategy
1. **Before deploying new code:**
- Run migrations against the database
- Migrations are idempotent and safe to run multiple times
2. **Rolling deployments:**
- Ensure migrations are backwards-compatible
- Deploy migration first, then new code
- Use advisory locks to prevent concurrent migrations
```bash
# Run migrations manually
DATABASE_URL=postgresql://... pnpm db:migrate
```
### Rollback Strategy
1. **Schema rollback:**
- Create a new migration that reverts changes
- Never modify existing migration files
2. **Data rollback:**
- Take database backups before major changes
- Use point-in-time recovery if available
## Health Checks
The service exposes three health check endpoints:
| Endpoint | Purpose | Use Case |
|----------|---------|----------|
| `/health` | Basic health | Load balancer health check |
| `/health/live` | Liveness probe | Kubernetes liveness probe |
| `/health/ready` | Readiness probe | Kubernetes readiness probe |
### Kubernetes Probes
```yaml
livenessProbe:
httpGet:
path: /health/live
port: 3001
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/ready
port: 3001
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
```
## Monitoring
### Prometheus Metrics
Metrics are exposed at `/metrics`:
- `http_requests_total` - Total HTTP requests
- `http_request_duration_seconds` - Request duration histogram
### Grafana Dashboard
Import the dashboard from `monitoring/grafana/dashboards/mana-core-auth.json`.
### Alerting
Recommended alerts:
1. **High error rate**: >5% 5xx responses
2. **Slow response time**: p99 > 2s
3. **Database connection failures**: health check failures
4. **Rate limiting triggered**: high 429 responses
## Security Checklist
Before going live:
- [ ] HTTPS is configured (required for secure cookies)
- [ ] CORS_ORIGINS only includes trusted domains
- [ ] Database password is strong and not in code
- [ ] Redis password is set
- [ ] SMTP credentials are production credentials
- [ ] Stripe keys are live (not test) keys
- [ ] LOG_LEVEL is set to 'info' or 'warn' (not 'debug')
- [ ] Rate limiting is enabled
- [ ] Health checks are configured in load balancer
## Troubleshooting
### Service won't start
1. Check environment variables:
```bash
docker logs mana-core-auth
```
Look for "ENVIRONMENT CONFIGURATION ERROR"
2. Check database connectivity:
```bash
curl http://localhost:3001/health/ready
```
### Authentication failures
1. Check JWKS endpoint:
```bash
curl http://localhost:3001/api/v1/auth/jwks
```
2. Verify JWT issuer/audience match between services
### Email not sending
1. Check SMTP configuration
2. Look for email logs (emails are logged in development)
3. Verify sender domain is authorized
## Scaling
### Horizontal Scaling
The service is stateless and can be horizontally scaled:
1. Use Redis for session storage (required for multi-instance)
2. Use a load balancer with sticky sessions (optional)
3. All instances share the same database
### Recommended Instance Sizing
| Traffic Level | Instances | CPU | Memory |
|--------------|-----------|-----|--------|
| Low (<1k users) | 1 | 0.5 | 512MB |
| Medium (1k-10k) | 2 | 1 | 1GB |
| High (10k-100k) | 3-5 | 2 | 2GB |
## Backup & Recovery
See [DISASTER_RECOVERY.md](./DISASTER_RECOVERY.md) for backup and recovery procedures.