mirror of
https://github.com/Memo-2023/mana-monorepo.git
synced 2026-05-25 02:24:38 +02:00
🐛 fix(mana-core-auth): use EdDSA for OIDC id_token signing
Set useJWTPlugin: true so id_tokens are signed with EdDSA keys from JWKS instead of HS256. This fixes Synapse OIDC integration which verifies tokens via JWKS endpoint.
This commit is contained in:
parent
5c61a4ed0f
commit
efb077b9ea
22 changed files with 1605 additions and 142 deletions
306
services/mana-core-auth/docs/DISASTER_RECOVERY.md
Normal file
306
services/mana-core-auth/docs/DISASTER_RECOVERY.md
Normal file
|
|
@ -0,0 +1,306 @@
|
|||
# Mana Core Auth - Disaster Recovery
|
||||
|
||||
## Overview
|
||||
|
||||
This document describes backup, recovery, and disaster recovery procedures for the Mana Core Auth service.
|
||||
|
||||
## Data Assets
|
||||
|
||||
### Critical Data
|
||||
|
||||
| Data | Location | Recovery Priority |
|
||||
|------|----------|-------------------|
|
||||
| User accounts | `auth.users` table | Critical |
|
||||
| Sessions | `auth.sessions` table | High (can regenerate) |
|
||||
| JWKS keys | `auth.jwks` table | Critical |
|
||||
| Organizations | `auth.organizations` table | Critical |
|
||||
| Credit balances | `credits.balances` table | Critical |
|
||||
|
||||
### Non-Critical Data (Can Regenerate)
|
||||
|
||||
- Sessions (users can re-login)
|
||||
- Verification tokens (users can request new ones)
|
||||
- Rate limit counters (stored in Redis)
|
||||
|
||||
## Backup Strategy
|
||||
|
||||
### Database Backups
|
||||
|
||||
#### Automated Daily Backups
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# backup-database.sh
|
||||
|
||||
BACKUP_DIR="/backups/mana-core-auth"
|
||||
DATE=$(date +%Y%m%d_%H%M%S)
|
||||
BACKUP_FILE="${BACKUP_DIR}/manacore_auth_${DATE}.sql.gz"
|
||||
|
||||
# Create backup
|
||||
pg_dump "$DATABASE_URL" | gzip > "$BACKUP_FILE"
|
||||
|
||||
# Keep last 30 days
|
||||
find "$BACKUP_DIR" -name "*.sql.gz" -mtime +30 -delete
|
||||
|
||||
# Upload to S3 (optional)
|
||||
aws s3 cp "$BACKUP_FILE" "s3://your-backup-bucket/mana-core-auth/"
|
||||
```
|
||||
|
||||
#### Before Major Changes
|
||||
|
||||
Always create a manual backup before:
|
||||
- Database migrations
|
||||
- Schema changes
|
||||
- Bulk data operations
|
||||
|
||||
```bash
|
||||
pg_dump "$DATABASE_URL" > pre_migration_backup.sql
|
||||
```
|
||||
|
||||
### Redis Backups (if used)
|
||||
|
||||
Redis data is ephemeral (sessions). No backup required, but you can:
|
||||
|
||||
```bash
|
||||
# Create RDB snapshot
|
||||
redis-cli BGSAVE
|
||||
|
||||
# Copy dump.rdb to backup location
|
||||
cp /var/lib/redis/dump.rdb /backups/redis/
|
||||
```
|
||||
|
||||
### JWKS Key Backup
|
||||
|
||||
The JWKS keys are critical for JWT validation. Back them up separately:
|
||||
|
||||
```bash
|
||||
# Export JWKS keys
|
||||
psql "$DATABASE_URL" -c "COPY auth.jwks TO '/backups/jwks_backup.csv' CSV HEADER;"
|
||||
```
|
||||
|
||||
## Recovery Procedures
|
||||
|
||||
### Scenario 1: Database Corruption
|
||||
|
||||
1. **Stop the service**
|
||||
```bash
|
||||
docker stop mana-core-auth
|
||||
```
|
||||
|
||||
2. **Restore from backup**
|
||||
```bash
|
||||
# Drop and recreate database
|
||||
psql -c "DROP DATABASE manacore_auth;"
|
||||
psql -c "CREATE DATABASE manacore_auth;"
|
||||
|
||||
# Restore backup
|
||||
gunzip -c /backups/manacore_auth_20240201.sql.gz | psql manacore_auth
|
||||
```
|
||||
|
||||
3. **Verify data integrity**
|
||||
```bash
|
||||
psql manacore_auth -c "SELECT COUNT(*) FROM auth.users;"
|
||||
psql manacore_auth -c "SELECT COUNT(*) FROM auth.jwks;"
|
||||
```
|
||||
|
||||
4. **Restart the service**
|
||||
```bash
|
||||
docker start mana-core-auth
|
||||
```
|
||||
|
||||
5. **Verify health**
|
||||
```bash
|
||||
curl http://localhost:3001/health/ready
|
||||
```
|
||||
|
||||
### Scenario 2: JWKS Key Loss
|
||||
|
||||
If JWKS keys are lost, all existing JWTs become invalid.
|
||||
|
||||
1. **Option A: Restore from backup**
|
||||
```bash
|
||||
psql "$DATABASE_URL" -c "COPY auth.jwks FROM '/backups/jwks_backup.csv' CSV HEADER;"
|
||||
```
|
||||
|
||||
2. **Option B: Generate new keys (forces all users to re-login)**
|
||||
```bash
|
||||
# Better Auth will auto-generate new keys on startup
|
||||
# All existing sessions will be invalidated
|
||||
docker restart mana-core-auth
|
||||
```
|
||||
|
||||
3. **Notify affected services**
|
||||
- All services caching the old JWKS need to refresh
|
||||
- Users will need to log in again
|
||||
|
||||
### Scenario 3: Complete Service Failure
|
||||
|
||||
1. **Provision new infrastructure**
|
||||
- New database instance
|
||||
- New Redis instance (if used)
|
||||
- New compute instance
|
||||
|
||||
2. **Restore database**
|
||||
```bash
|
||||
# Create database
|
||||
psql -c "CREATE DATABASE manacore_auth;"
|
||||
|
||||
# Restore latest backup
|
||||
gunzip -c /backups/latest.sql.gz | psql manacore_auth
|
||||
```
|
||||
|
||||
3. **Update DNS/Load Balancer**
|
||||
- Point to new service instance
|
||||
|
||||
4. **Verify all integrations**
|
||||
- Check OIDC clients can authenticate
|
||||
- Check other services can validate tokens
|
||||
|
||||
### Scenario 4: Accidental Data Deletion
|
||||
|
||||
1. **Identify affected data**
|
||||
```sql
|
||||
-- Check what's missing
|
||||
SELECT COUNT(*) FROM auth.users WHERE deleted_at IS NOT NULL;
|
||||
```
|
||||
|
||||
2. **Restore from point-in-time backup**
|
||||
```bash
|
||||
# If using PostgreSQL with WAL archiving
|
||||
pg_restore --target-time="2024-02-01 10:00:00" backup.dump
|
||||
```
|
||||
|
||||
3. **Selective restore**
|
||||
```sql
|
||||
-- Restore specific users from backup database
|
||||
INSERT INTO auth.users
|
||||
SELECT * FROM backup_db.auth.users
|
||||
WHERE id IN ('user1', 'user2');
|
||||
```
|
||||
|
||||
## Key Rotation
|
||||
|
||||
### Scheduled Key Rotation
|
||||
|
||||
JWKS keys should be rotated periodically (recommended: every 90 days).
|
||||
|
||||
1. **Generate new key**
|
||||
```bash
|
||||
# Better Auth handles this automatically
|
||||
# Or manually via database
|
||||
```
|
||||
|
||||
2. **Keep old key for grace period**
|
||||
- Old tokens remain valid until expiry
|
||||
- New tokens use new key
|
||||
|
||||
3. **Remove old key after grace period**
|
||||
```sql
|
||||
DELETE FROM auth.jwks
|
||||
WHERE created_at < NOW() - INTERVAL '7 days'
|
||||
AND id != (SELECT id FROM auth.jwks ORDER BY created_at DESC LIMIT 1);
|
||||
```
|
||||
|
||||
### Emergency Key Rotation
|
||||
|
||||
If keys are compromised:
|
||||
|
||||
1. **Immediately revoke old keys**
|
||||
```sql
|
||||
DELETE FROM auth.jwks;
|
||||
```
|
||||
|
||||
2. **Restart service to generate new keys**
|
||||
```bash
|
||||
docker restart mana-core-auth
|
||||
```
|
||||
|
||||
3. **Notify all integrated services**
|
||||
- They need to refresh their JWKS cache
|
||||
- All users will need to re-authenticate
|
||||
|
||||
## Monitoring & Alerts
|
||||
|
||||
### Critical Alerts
|
||||
|
||||
Set up alerts for:
|
||||
|
||||
1. **Backup failures**
|
||||
- Backup script exit code != 0
|
||||
- Backup file size = 0
|
||||
|
||||
2. **Database health**
|
||||
- Connection failures
|
||||
- Replication lag (if applicable)
|
||||
|
||||
3. **Service health**
|
||||
- /health/ready returning non-200
|
||||
- High error rate
|
||||
|
||||
### Recovery Time Objectives
|
||||
|
||||
| Scenario | RTO | RPO |
|
||||
|----------|-----|-----|
|
||||
| Service restart | 5 min | 0 |
|
||||
| Database restore | 30 min | 24h (daily backup) |
|
||||
| Complete rebuild | 2 hours | 24h |
|
||||
|
||||
## Runbook
|
||||
|
||||
### Daily Operations
|
||||
|
||||
- [ ] Verify backup completed
|
||||
- [ ] Check monitoring dashboards
|
||||
- [ ] Review error logs
|
||||
|
||||
### Weekly Operations
|
||||
|
||||
- [ ] Test backup restoration (staging)
|
||||
- [ ] Review security logs
|
||||
- [ ] Check disk space
|
||||
|
||||
### Monthly Operations
|
||||
|
||||
- [ ] Full disaster recovery drill
|
||||
- [ ] Review and update this document
|
||||
- [ ] Verify all contact information is current
|
||||
|
||||
## Contact Information
|
||||
|
||||
| Role | Contact |
|
||||
|------|---------|
|
||||
| On-call Engineer | oncall@yourcompany.com |
|
||||
| Database Admin | dba@yourcompany.com |
|
||||
| Security Team | security@yourcompany.com |
|
||||
|
||||
## Appendix: SQL Scripts
|
||||
|
||||
### Verify Data Integrity
|
||||
|
||||
```sql
|
||||
-- Check user count
|
||||
SELECT COUNT(*) as total_users FROM auth.users;
|
||||
|
||||
-- Check for orphaned data
|
||||
SELECT COUNT(*) as orphaned_sessions
|
||||
FROM auth.sessions s
|
||||
LEFT JOIN auth.users u ON s.user_id = u.id
|
||||
WHERE u.id IS NULL;
|
||||
|
||||
-- Check JWKS keys
|
||||
SELECT id, created_at FROM auth.jwks ORDER BY created_at DESC;
|
||||
|
||||
-- Check credit balances
|
||||
SELECT COUNT(*) as users_with_balance
|
||||
FROM credits.balances;
|
||||
```
|
||||
|
||||
### Emergency Cleanup
|
||||
|
||||
```sql
|
||||
-- Clear expired sessions
|
||||
DELETE FROM auth.sessions WHERE expires_at < NOW();
|
||||
|
||||
-- Clear expired verification tokens
|
||||
DELETE FROM auth.verification_tokens WHERE expires_at < NOW();
|
||||
```
|
||||
299
services/mana-core-auth/docs/PRODUCTION_DEPLOYMENT.md
Normal file
299
services/mana-core-auth/docs/PRODUCTION_DEPLOYMENT.md
Normal file
|
|
@ -0,0 +1,299 @@
|
|||
# Mana Core Auth - Production Deployment Guide
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Before deploying to production, ensure you have:
|
||||
|
||||
1. **PostgreSQL Database** - Version 14+ recommended
|
||||
2. **Redis** (optional but recommended) - For session storage
|
||||
3. **SMTP Server** - For email verification and password reset
|
||||
4. **Stripe Account** - For credit system (optional)
|
||||
5. **Domain with SSL** - HTTPS is required for secure cookies
|
||||
|
||||
## Environment Variables
|
||||
|
||||
### Required in Production
|
||||
|
||||
```env
|
||||
NODE_ENV=production
|
||||
PORT=3001
|
||||
|
||||
# Database (REQUIRED)
|
||||
DATABASE_URL=postgresql://user:password@host:5432/manacore_auth
|
||||
|
||||
# Public URL (REQUIRED)
|
||||
# Used for email verification links, OIDC callbacks
|
||||
BASE_URL=https://auth.yourdomain.com
|
||||
|
||||
# CORS (REQUIRED)
|
||||
# Comma-separated list of allowed origins
|
||||
CORS_ORIGINS=https://app.yourdomain.com,https://admin.yourdomain.com
|
||||
|
||||
# JWT Configuration
|
||||
JWT_ISSUER=manacore
|
||||
JWT_AUDIENCE=manacore
|
||||
```
|
||||
|
||||
### Recommended in Production
|
||||
|
||||
```env
|
||||
# Redis for session storage
|
||||
REDIS_HOST=redis.yourdomain.com
|
||||
REDIS_PORT=6379
|
||||
REDIS_PASSWORD=your-redis-password
|
||||
|
||||
# SMTP for emails
|
||||
SMTP_HOST=smtp.brevo.com
|
||||
SMTP_PORT=587
|
||||
SMTP_USER=your-smtp-user
|
||||
SMTP_PASSWORD=your-smtp-password
|
||||
SMTP_FROM=ManaCore <noreply@yourdomain.com>
|
||||
|
||||
# Stripe for credits
|
||||
STRIPE_SECRET_KEY=sk_live_...
|
||||
STRIPE_PUBLISHABLE_KEY=pk_live_...
|
||||
STRIPE_WEBHOOK_SECRET=whsec_...
|
||||
|
||||
# Error tracking
|
||||
SENTRY_DSN=https://...@sentry.io/...
|
||||
|
||||
# Logging
|
||||
LOG_LEVEL=info
|
||||
```
|
||||
|
||||
## Deployment Options
|
||||
|
||||
### Option 1: Docker (Recommended)
|
||||
|
||||
```bash
|
||||
# Build the image
|
||||
docker build -t mana-core-auth:latest -f services/mana-core-auth/Dockerfile .
|
||||
|
||||
# Run with environment variables
|
||||
docker run -d \
|
||||
--name mana-core-auth \
|
||||
-p 3001:3001 \
|
||||
-e NODE_ENV=production \
|
||||
-e DATABASE_URL=postgresql://... \
|
||||
-e BASE_URL=https://auth.yourdomain.com \
|
||||
-e CORS_ORIGINS=https://app.yourdomain.com \
|
||||
-e REDIS_HOST=redis \
|
||||
mana-core-auth:latest
|
||||
```
|
||||
|
||||
### Option 2: Docker Compose
|
||||
|
||||
```yaml
|
||||
version: '3.8'
|
||||
|
||||
services:
|
||||
auth:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: services/mana-core-auth/Dockerfile
|
||||
ports:
|
||||
- "3001:3001"
|
||||
environment:
|
||||
NODE_ENV: production
|
||||
DATABASE_URL: postgresql://manacore:${DB_PASSWORD}@db:5432/manacore_auth
|
||||
BASE_URL: https://auth.yourdomain.com
|
||||
CORS_ORIGINS: https://app.yourdomain.com
|
||||
REDIS_HOST: redis
|
||||
REDIS_PORT: 6379
|
||||
depends_on:
|
||||
db:
|
||||
condition: service_healthy
|
||||
redis:
|
||||
condition: service_started
|
||||
healthcheck:
|
||||
test: ["CMD", "node", "-e", "require('http').get('http://localhost:3001/health/ready', (r) => process.exit(r.statusCode === 200 ? 0 : 1))"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
start_period: 60s
|
||||
|
||||
db:
|
||||
image: postgres:16-alpine
|
||||
environment:
|
||||
POSTGRES_USER: manacore
|
||||
POSTGRES_PASSWORD: ${DB_PASSWORD}
|
||||
POSTGRES_DB: manacore_auth
|
||||
volumes:
|
||||
- postgres_data:/var/lib/postgresql/data
|
||||
healthcheck:
|
||||
test: ["CMD-SHELL", "pg_isready -U manacore"]
|
||||
interval: 10s
|
||||
timeout: 5s
|
||||
retries: 5
|
||||
|
||||
redis:
|
||||
image: redis:7-alpine
|
||||
command: redis-server --requirepass ${REDIS_PASSWORD}
|
||||
volumes:
|
||||
- redis_data:/data
|
||||
|
||||
volumes:
|
||||
postgres_data:
|
||||
redis_data:
|
||||
```
|
||||
|
||||
### Option 3: Kubernetes
|
||||
|
||||
See `k8s/` directory for Kubernetes manifests (if available).
|
||||
|
||||
## Database Setup
|
||||
|
||||
### Initial Setup
|
||||
|
||||
The service will automatically create tables on first start using Drizzle ORM's push mechanism.
|
||||
|
||||
```bash
|
||||
# For manual schema push (development)
|
||||
pnpm db:push
|
||||
|
||||
# For production migrations
|
||||
pnpm db:migrate
|
||||
```
|
||||
|
||||
### Migration Strategy
|
||||
|
||||
1. **Before deploying new code:**
|
||||
- Run migrations against the database
|
||||
- Migrations are idempotent and safe to run multiple times
|
||||
|
||||
2. **Rolling deployments:**
|
||||
- Ensure migrations are backwards-compatible
|
||||
- Deploy migration first, then new code
|
||||
- Use advisory locks to prevent concurrent migrations
|
||||
|
||||
```bash
|
||||
# Run migrations manually
|
||||
DATABASE_URL=postgresql://... pnpm db:migrate
|
||||
```
|
||||
|
||||
### Rollback Strategy
|
||||
|
||||
1. **Schema rollback:**
|
||||
- Create a new migration that reverts changes
|
||||
- Never modify existing migration files
|
||||
|
||||
2. **Data rollback:**
|
||||
- Take database backups before major changes
|
||||
- Use point-in-time recovery if available
|
||||
|
||||
## Health Checks
|
||||
|
||||
The service exposes three health check endpoints:
|
||||
|
||||
| Endpoint | Purpose | Use Case |
|
||||
|----------|---------|----------|
|
||||
| `/health` | Basic health | Load balancer health check |
|
||||
| `/health/live` | Liveness probe | Kubernetes liveness probe |
|
||||
| `/health/ready` | Readiness probe | Kubernetes readiness probe |
|
||||
|
||||
### Kubernetes Probes
|
||||
|
||||
```yaml
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /health/live
|
||||
port: 3001
|
||||
initialDelaySeconds: 30
|
||||
periodSeconds: 10
|
||||
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /health/ready
|
||||
port: 3001
|
||||
initialDelaySeconds: 10
|
||||
periodSeconds: 5
|
||||
failureThreshold: 3
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Prometheus Metrics
|
||||
|
||||
Metrics are exposed at `/metrics`:
|
||||
|
||||
- `http_requests_total` - Total HTTP requests
|
||||
- `http_request_duration_seconds` - Request duration histogram
|
||||
|
||||
### Grafana Dashboard
|
||||
|
||||
Import the dashboard from `monitoring/grafana/dashboards/mana-core-auth.json`.
|
||||
|
||||
### Alerting
|
||||
|
||||
Recommended alerts:
|
||||
|
||||
1. **High error rate**: >5% 5xx responses
|
||||
2. **Slow response time**: p99 > 2s
|
||||
3. **Database connection failures**: health check failures
|
||||
4. **Rate limiting triggered**: high 429 responses
|
||||
|
||||
## Security Checklist
|
||||
|
||||
Before going live:
|
||||
|
||||
- [ ] HTTPS is configured (required for secure cookies)
|
||||
- [ ] CORS_ORIGINS only includes trusted domains
|
||||
- [ ] Database password is strong and not in code
|
||||
- [ ] Redis password is set
|
||||
- [ ] SMTP credentials are production credentials
|
||||
- [ ] Stripe keys are live (not test) keys
|
||||
- [ ] LOG_LEVEL is set to 'info' or 'warn' (not 'debug')
|
||||
- [ ] Rate limiting is enabled
|
||||
- [ ] Health checks are configured in load balancer
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Service won't start
|
||||
|
||||
1. Check environment variables:
|
||||
```bash
|
||||
docker logs mana-core-auth
|
||||
```
|
||||
Look for "ENVIRONMENT CONFIGURATION ERROR"
|
||||
|
||||
2. Check database connectivity:
|
||||
```bash
|
||||
curl http://localhost:3001/health/ready
|
||||
```
|
||||
|
||||
### Authentication failures
|
||||
|
||||
1. Check JWKS endpoint:
|
||||
```bash
|
||||
curl http://localhost:3001/api/v1/auth/jwks
|
||||
```
|
||||
|
||||
2. Verify JWT issuer/audience match between services
|
||||
|
||||
### Email not sending
|
||||
|
||||
1. Check SMTP configuration
|
||||
2. Look for email logs (emails are logged in development)
|
||||
3. Verify sender domain is authorized
|
||||
|
||||
## Scaling
|
||||
|
||||
### Horizontal Scaling
|
||||
|
||||
The service is stateless and can be horizontally scaled:
|
||||
|
||||
1. Use Redis for session storage (required for multi-instance)
|
||||
2. Use a load balancer with sticky sessions (optional)
|
||||
3. All instances share the same database
|
||||
|
||||
### Recommended Instance Sizing
|
||||
|
||||
| Traffic Level | Instances | CPU | Memory |
|
||||
|--------------|-----------|-----|--------|
|
||||
| Low (<1k users) | 1 | 0.5 | 512MB |
|
||||
| Medium (1k-10k) | 2 | 1 | 1GB |
|
||||
| High (10k-100k) | 3-5 | 2 | 2GB |
|
||||
|
||||
## Backup & Recovery
|
||||
|
||||
See [DISASTER_RECOVERY.md](./DISASTER_RECOVERY.md) for backup and recovery procedures.
|
||||
Loading…
Add table
Add a link
Reference in a new issue