managarten/services/mana-core-auth/docs/DISASTER_RECOVERY.md
Till-JS efb077b9ea 🐛 fix(mana-core-auth): use EdDSA for OIDC id_token signing
Set useJWTPlugin: true so id_tokens are signed with EdDSA keys
from JWKS instead of HS256. This fixes Synapse OIDC integration
which verifies tokens via JWKS endpoint.
2026-02-01 13:24:55 +01:00

6.7 KiB

Mana Core Auth - Disaster Recovery

Overview

This document describes backup, recovery, and disaster recovery procedures for the Mana Core Auth service.

Data Assets

Critical Data

Data Location Recovery Priority
User accounts auth.users table Critical
Sessions auth.sessions table High (can regenerate)
JWKS keys auth.jwks table Critical
Organizations auth.organizations table Critical
Credit balances credits.balances table Critical

Non-Critical Data (Can Regenerate)

  • Sessions (users can re-login)
  • Verification tokens (users can request new ones)
  • Rate limit counters (stored in Redis)

Backup Strategy

Database Backups

Automated Daily Backups

#!/bin/bash
# backup-database.sh

BACKUP_DIR="/backups/mana-core-auth"
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/manacore_auth_${DATE}.sql.gz"

# Create backup
pg_dump "$DATABASE_URL" | gzip > "$BACKUP_FILE"

# Keep last 30 days
find "$BACKUP_DIR" -name "*.sql.gz" -mtime +30 -delete

# Upload to S3 (optional)
aws s3 cp "$BACKUP_FILE" "s3://your-backup-bucket/mana-core-auth/"

Before Major Changes

Always create a manual backup before:

  • Database migrations
  • Schema changes
  • Bulk data operations
pg_dump "$DATABASE_URL" > pre_migration_backup.sql

Redis Backups (if used)

Redis data is ephemeral (sessions). No backup required, but you can:

# Create RDB snapshot
redis-cli BGSAVE

# Copy dump.rdb to backup location
cp /var/lib/redis/dump.rdb /backups/redis/

JWKS Key Backup

The JWKS keys are critical for JWT validation. Back them up separately:

# Export JWKS keys
psql "$DATABASE_URL" -c "COPY auth.jwks TO '/backups/jwks_backup.csv' CSV HEADER;"

Recovery Procedures

Scenario 1: Database Corruption

  1. Stop the service

    docker stop mana-core-auth
    
  2. Restore from backup

    # Drop and recreate database
    psql -c "DROP DATABASE manacore_auth;"
    psql -c "CREATE DATABASE manacore_auth;"
    
    # Restore backup
    gunzip -c /backups/manacore_auth_20240201.sql.gz | psql manacore_auth
    
  3. Verify data integrity

    psql manacore_auth -c "SELECT COUNT(*) FROM auth.users;"
    psql manacore_auth -c "SELECT COUNT(*) FROM auth.jwks;"
    
  4. Restart the service

    docker start mana-core-auth
    
  5. Verify health

    curl http://localhost:3001/health/ready
    

Scenario 2: JWKS Key Loss

If JWKS keys are lost, all existing JWTs become invalid.

  1. Option A: Restore from backup

    psql "$DATABASE_URL" -c "COPY auth.jwks FROM '/backups/jwks_backup.csv' CSV HEADER;"
    
  2. Option B: Generate new keys (forces all users to re-login)

    # Better Auth will auto-generate new keys on startup
    # All existing sessions will be invalidated
    docker restart mana-core-auth
    
  3. Notify affected services

    • All services caching the old JWKS need to refresh
    • Users will need to log in again

Scenario 3: Complete Service Failure

  1. Provision new infrastructure

    • New database instance
    • New Redis instance (if used)
    • New compute instance
  2. Restore database

    # Create database
    psql -c "CREATE DATABASE manacore_auth;"
    
    # Restore latest backup
    gunzip -c /backups/latest.sql.gz | psql manacore_auth
    
  3. Update DNS/Load Balancer

    • Point to new service instance
  4. Verify all integrations

    • Check OIDC clients can authenticate
    • Check other services can validate tokens

Scenario 4: Accidental Data Deletion

  1. Identify affected data

    -- Check what's missing
    SELECT COUNT(*) FROM auth.users WHERE deleted_at IS NOT NULL;
    
  2. Restore from point-in-time backup

    # If using PostgreSQL with WAL archiving
    pg_restore --target-time="2024-02-01 10:00:00" backup.dump
    
  3. Selective restore

    -- Restore specific users from backup database
    INSERT INTO auth.users
    SELECT * FROM backup_db.auth.users
    WHERE id IN ('user1', 'user2');
    

Key Rotation

Scheduled Key Rotation

JWKS keys should be rotated periodically (recommended: every 90 days).

  1. Generate new key

    # Better Auth handles this automatically
    # Or manually via database
    
  2. Keep old key for grace period

    • Old tokens remain valid until expiry
    • New tokens use new key
  3. Remove old key after grace period

    DELETE FROM auth.jwks
    WHERE created_at < NOW() - INTERVAL '7 days'
    AND id != (SELECT id FROM auth.jwks ORDER BY created_at DESC LIMIT 1);
    

Emergency Key Rotation

If keys are compromised:

  1. Immediately revoke old keys

    DELETE FROM auth.jwks;
    
  2. Restart service to generate new keys

    docker restart mana-core-auth
    
  3. Notify all integrated services

    • They need to refresh their JWKS cache
    • All users will need to re-authenticate

Monitoring & Alerts

Critical Alerts

Set up alerts for:

  1. Backup failures

    • Backup script exit code != 0
    • Backup file size = 0
  2. Database health

    • Connection failures
    • Replication lag (if applicable)
  3. Service health

    • /health/ready returning non-200
    • High error rate

Recovery Time Objectives

Scenario RTO RPO
Service restart 5 min 0
Database restore 30 min 24h (daily backup)
Complete rebuild 2 hours 24h

Runbook

Daily Operations

  • Verify backup completed
  • Check monitoring dashboards
  • Review error logs

Weekly Operations

  • Test backup restoration (staging)
  • Review security logs
  • Check disk space

Monthly Operations

  • Full disaster recovery drill
  • Review and update this document
  • Verify all contact information is current

Contact Information

Role Contact
On-call Engineer oncall@yourcompany.com
Database Admin dba@yourcompany.com
Security Team security@yourcompany.com

Appendix: SQL Scripts

Verify Data Integrity

-- Check user count
SELECT COUNT(*) as total_users FROM auth.users;

-- Check for orphaned data
SELECT COUNT(*) as orphaned_sessions
FROM auth.sessions s
LEFT JOIN auth.users u ON s.user_id = u.id
WHERE u.id IS NULL;

-- Check JWKS keys
SELECT id, created_at FROM auth.jwks ORDER BY created_at DESC;

-- Check credit balances
SELECT COUNT(*) as users_with_balance
FROM credits.balances;

Emergency Cleanup

-- Clear expired sessions
DELETE FROM auth.sessions WHERE expires_at < NOW();

-- Clear expired verification tokens
DELETE FROM auth.verification_tokens WHERE expires_at < NOW();