managarten/apps-archived/uload/docs/features/downtime-prevention-plan.md
Till-JS ee42b6cc76 feat: major update with network graphs, themes, todo extensions, and more
## New Features

### Network Graph Visualization (Contacts, Calendar, Todo)
- D3.js force simulation for physics-based layout
- Zoom & pan with mouse/touchpad
- Keyboard shortcuts: +/- zoom, 0 reset, Esc deselect, / search, F focus
- Filtering by tags, company/location/project, connection strength
- Shared components in @manacore/shared-ui

### Central Tags API (mana-core-auth)
- CRUD endpoints for tags
- Schema: tags table with userId, name, color, app
- Shared tag components in @manacore/shared-ui

### Custom Themes System
- Theme editor with live preview and color picker
- Community theme gallery
- Theme sharing (public, unlisted, private)
- Backend API in mana-core-auth

### Todo App Extensions
- Glass-pill design for task input and items
- Settings page with 20+ preferences
- Task edit modal with inline editing
- Statistics page with visualizations
- PWA support with offline capabilities
- Multiple kanban boards

### Contacts App Features
- Duplicate detection
- Photo upload
- Batch operations
- Enhanced favorites page with multiple view modes
- Alphabet view improvements
- Search modal

### Help System
- @manacore/shared-help-content
- @manacore/shared-help-ui
- @manacore/shared-help-types

### Other Features
- Themes page for all apps
- Referral system frontend
- CommandBar (global search)
- Skeleton loaders
- Settings page improvements

## Bug Fixes
- Network graph simulation initialization
- Database schema TEXT for user_id columns (Better Auth compatibility)
- Various styling fixes

## Documentation
- Daily report for 2025-12-10
- CI/CD deployment guide

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-10 02:37:46 +01:00

16 KiB

Downtime Prevention Plan für uLoad

Problemanalyse

Das uLoad-Projekt war kürzlich komplett down, was zu kritischen Problemen geführt hat:

  • Hauptanwendung nicht erreichbar
  • Weiterleitungen funktionierten nicht
  • Benutzererfahrung stark beeinträchtigt
  • Potentieller Datenverlust/Inkonsistenz

Aktuelle Architektur-Analyse

Technology Stack

  • Frontend: SvelteKit 2.22 mit Svelte 5.0
  • Backend: PocketBase (https://pb.ulo.ad)
  • Hosting: Hetzner VPS mit Coolify
  • Database: PocketBase SQLite mit persistentem Volume
  • Deployment: Docker mit Supervisor (Multi-Service Container)

Kritische Single Points of Failure

  1. PocketBase Dependency

    • Gesamte Anwendung abhängig von PocketBase Verfügbarkeit
    • Keine Fallback-Mechanismen implementiert
    • Timeout-Konfiguration zu aggressiv (5 Sekunden)
  2. Single Server Setup

    • Ein Hetzner VPS für gesamte Infrastruktur
    • Keine Redundanz oder Load Balancing
    • Coolify als Single Point of Failure
  3. Container Architecture

    • SvelteKit und PocketBase in einem Container
    • Supervisor als Process Manager
    • Keine Health Checks zwischen Services
  4. Rate Limiting

    • In-Memory Store (verliert Daten bei Restart)
    • Keine Redis-Backend für Persistenz
    • Potentielle Blockierung legitimer Traffic

Sofortmaßnahmen (Quick Wins)

1. Verbesserte Error Handling & Fallbacks

PocketBase Connection Resilience

// src/lib/pocketbase-resilient.ts
class ResilientPocketBase {
	private retryCount = 0;
	private maxRetries = 3;
	private backoffMs = 1000;

	async withRetry<T>(operation: () => Promise<T>): Promise<T> {
		for (let i = 0; i <= this.maxRetries; i++) {
			try {
				return await operation();
			} catch (error) {
				if (i === this.maxRetries) throw error;
				await this.delay(this.backoffMs * Math.pow(2, i));
			}
		}
		throw new Error('Max retries exceeded');
	}

	private delay(ms: number): Promise<void> {
		return new Promise((resolve) => setTimeout(resolve, ms));
	}
}

Graceful Degradation

  • Cache-basierte Fallbacks für kritische Daten
  • Offline-Mode für Basis-Funktionalität
  • Error Boundaries in allen Komponenten

2. Enhanced Monitoring

Health Check Verbesserung

// src/routes/health/+server.ts erweitern
export const GET: RequestHandler = async () => {
	const health = {
		status: 'healthy',
		timestamp: new Date().toISOString(),
		environment: building ? 'build' : 'runtime',
		services: {
			sveltekit: 'running',
			pocketbase: await checkPocketBaseDetailed(),
			database: await checkDatabaseHealth(),
			memory: process.memoryUsage(),
			uptime: process.uptime(),
		},
		checks: {
			canCreateLink: await testLinkCreation(),
			canAuthenticate: await testAuthentication(),
			canServeStatic: await testStaticFiles(),
		},
	};

	const overallStatus = Object.values(health.services).every((s) => s === 'running')
		? 'healthy'
		: 'degraded';

	return json(health, {
		status: overallStatus === 'healthy' ? 200 : 503,
	});
};

Externes Monitoring Setup

  • Uptime Robot/Pingdom für externe Überwachung
  • Slack/Discord Webhooks für Alerts
  • Grafana Dashboard für Metriken

3. Improved Rate Limiting

Redis-Backend für Rate Limiting

// src/lib/server/redis-rate-limiter.ts
import Redis from 'ioredis';

class RedisRateLimiter {
	private redis: Redis;

	constructor() {
		this.redis = new Redis(process.env.REDIS_URL || 'redis://localhost:6379');
	}

	async checkLimit(key: string, limit: number, windowMs: number): Promise<boolean> {
		const multi = this.redis.multi();
		multi.incr(key);
		multi.expire(key, Math.ceil(windowMs / 1000));
		const results = await multi.exec();

		const count = results?.[0]?.[1] as number;
		return count <= limit;
	}
}

4. Database Backup Strategy

Automatisierte PocketBase Backups

#!/bin/bash
# scripts/backup-pocketbase.sh
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/app/backups"
PB_DATA="/app/pb_data"

mkdir -p $BACKUP_DIR

# Create backup
tar -czf "$BACKUP_DIR/pb_backup_$DATE.tar.gz" -C $PB_DATA .

# Keep only last 7 days
find $BACKUP_DIR -name "pb_backup_*.tar.gz" -mtime +7 -delete

# Upload to S3/Object Storage (optional)
if [ -n "$S3_BUCKET" ]; then
  aws s3 cp "$BACKUP_DIR/pb_backup_$DATE.tar.gz" "s3://$S3_BUCKET/backups/"
fi

Mittelfristige Maßnahmen (1-4 Wochen)

1. Infrastructure Redundancy

Load Balancer Setup

# docker-compose.prod-ha.yml
version: '3.8'
services:
  nginx-lb:
    image: nginx:alpine
    ports:
      - '80:80'
      - '443:443'
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
    depends_on:
      - app1
      - app2

  app1:
    build: .
    environment:
      - INSTANCE_ID=app1
    volumes:
      - pb_data:/app/pb_data

  app2:
    build: .
    environment:
      - INSTANCE_ID=app2
    volumes:
      - pb_data:/app/pb_data

volumes:
  pb_data:

Multi-Region Deployment

  • Hauptserver: Hetzner Deutschland
  • Backup Server: AWS/DigitalOcean anderer Region
  • DNS Failover mit niedrigem TTL (60 Sekunden)

2. Separated Services Architecture

PocketBase als separater Service

# docker-compose.services.yml
services:
  pocketbase:
    image: spectado/pocketbase:latest
    volumes:
      - ./pb_data:/pb/pb_data
    ports:
      - '8090:8090'
    restart: unless-stopped
    healthcheck:
      test: ['CMD', 'curl', '-f', 'http://localhost:8090/api/health']
      interval: 30s
      timeout: 10s
      retries: 3

  app:
    build: .
    environment:
      - POCKETBASE_URL=http://pocketbase:8090
    depends_on:
      pocketbase:
        condition: service_healthy
    restart: unless-stopped

3. Caching Layer

Redis für Caching & Sessions

// src/lib/cache/redis-cache.ts
export class CacheManager {
	private redis: Redis;

	async get<T>(key: string): Promise<T | null> {
		const cached = await this.redis.get(key);
		return cached ? JSON.parse(cached) : null;
	}

	async set(key: string, value: any, ttlSeconds: number = 3600): Promise<void> {
		await this.redis.setex(key, ttlSeconds, JSON.stringify(value));
	}

	// Cache häufig abgerufene Daten
	async getCachedLinks(userId: string): Promise<Link[]> {
		const cacheKey = `user:${userId}:links`;
		let links = await this.get<Link[]>(cacheKey);

		if (!links) {
			links = await pb.collection('links').getFullList({ filter: `user_id="${userId}"` });
			await this.set(cacheKey, links, 300); // 5 Minuten Cache
		}

		return links;
	}
}

Langfristige Maßnahmen (1-3 Monate)

1. Database Migration Strategy

PostgreSQL als Primary Database

// Alternative zu PocketBase für bessere Skalierbarkeit
// src/lib/database/postgresql.ts
import { drizzle } from 'drizzle-orm/postgres-js';
import postgres from 'postgres';

const connectionString = process.env.DATABASE_URL!;
const client = postgres(connectionString);
export const db = drizzle(client);

// Migration zu PostgreSQL mit:
// - Bessere Performance bei hoher Last
// - Replication Support
// - Backup/Recovery Tools
// - Connection Pooling

Database Cluster Setup

  • Master-Slave Replication
  • Read Replicas für Analytics
  • Automated Failover

2. CDN Integration

Cloudflare Setup

// src/app.html erweitern
// DNS-Level Protection gegen DDoS
// Edge Caching für statische Assets
// SSL/TLS Termination
// Rate Limiting auf Edge-Level

3. Microservices Architecture

Service Separation

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Frontend      │    │   API Gateway   │    │   Auth Service  │
│   (SvelteKit)   │    │   (Kong/Nginx)  │    │   (Custom)      │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         └───────────────────────┼───────────────────────┘
                                 │
         ┌───────────────────────┼───────────────────────┐
         │                       │                       │
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Link Service  │    │  Analytics Svc  │    │  Redirect Svc   │
│   (Create/CRUD) │    │  (Tracking)     │    │  (Core Feature) │
└─────────────────┘    └─────────────────┘    └─────────────────┘

Deployment & Operations

1. CI/CD Pipeline Verbesserung

GitHub Actions Workflow

# .github/workflows/deploy.yml
name: Deploy to Production
on:
  push:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Run Tests
        run: |
          npm ci
          npm run test
          npm run test:e2e

  deploy:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to Primary
        run: |
          # Coolify Deployment
          curl -X POST ${{ secrets.COOLIFY_WEBHOOK }}

      - name: Health Check
        run: |
          # Warte auf erfolgreiche Deployment
          for i in {1..30}; do
            if curl -f https://ulo.ad/health; then
              echo "Deployment successful"
              exit 0
            fi
            sleep 10
          done
          exit 1

      - name: Rollback on Failure
        if: failure()
        run: |
          # Automatisches Rollback bei Fehler
          curl -X POST ${{ secrets.COOLIFY_ROLLBACK_WEBHOOK }}

2. Monitoring & Alerting

Prometheus + Grafana Setup

# monitoring/docker-compose.yml
services:
  prometheus:
    image: prom/prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - '9090:9090'

  grafana:
    image: grafana/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    ports:
      - '3001:3000'
    volumes:
      - grafana-storage:/var/lib/grafana

  alertmanager:
    image: prom/alertmanager
    ports:
      - '9093:9093'
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml

Custom Metrics

// src/lib/metrics/prometheus.ts
import { register, collectDefaultMetrics, Counter, Histogram } from 'prom-client';

collectDefaultMetrics();

export const httpRequestsTotal = new Counter({
	name: 'http_requests_total',
	help: 'Total number of HTTP requests',
	labelNames: ['method', 'route', 'status_code'],
});

export const httpRequestDuration = new Histogram({
	name: 'http_request_duration_seconds',
	help: 'Duration of HTTP requests in seconds',
	labelNames: ['method', 'route'],
});

export const linkRedirects = new Counter({
	name: 'link_redirects_total',
	help: 'Total number of link redirects',
	labelNames: ['short_code', 'success'],
});

3. Disaster Recovery Plan

Automated Recovery Scripts

#!/bin/bash
# scripts/disaster-recovery.sh

# 1. Check service status
check_services() {
  if ! curl -f https://ulo.ad/health; then
    echo "Primary service down, starting recovery..."
    return 1
  fi
}

# 2. Switch to backup server
activate_backup() {
  # Update DNS to point to backup server
  curl -X PUT "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$DNS_RECORD_ID" \
    -H "Authorization: Bearer $CF_API_TOKEN" \
    -H "Content-Type: application/json" \
    --data '{"content":"'$BACKUP_SERVER_IP'"}'
}

# 3. Restore from backup
restore_from_backup() {
  # Download latest backup
  aws s3 cp s3://$BACKUP_BUCKET/latest.tar.gz /tmp/restore.tar.gz

  # Extract and restore
  tar -xzf /tmp/restore.tar.gz -C /app/pb_data/

  # Restart services
  docker-compose restart
}

# Main recovery flow
if ! check_services; then
  activate_backup
  restore_from_backup

  # Send alert
  curl -X POST $SLACK_WEBHOOK -d '{"text":"Disaster recovery activated for ulo.ad"}'
fi

Testing & Validation

1. Chaos Engineering

Fault Injection Tests

// tests/chaos/network-failures.test.ts
describe('Network Failure Scenarios', () => {
	test('should handle PocketBase timeout gracefully', async () => {
		// Simulate PocketBase timeout
		const mockPb = mockPocketBaseTimeout();

		const response = await app.request('/api/links', {
			method: 'POST',
			body: JSON.stringify({ url: 'https://example.com' }),
		});

		// Should return cached response or graceful error
		expect(response.status).toBeLessThan(500);
	});

	test('should fallback to cached data when database is unavailable', async () => {
		// Simulate database outage
		mockDatabaseDown();

		const response = await app.request('/my/links');

		// Should serve from cache
		expect(response.status).toBe(200);
		expect(response.headers.get('x-served-from')).toBe('cache');
	});
});

2. Load Testing

Performance Benchmarks

# scripts/load-test.sh
#!/bin/bash

# Test link creation under load
ab -n 1000 -c 10 -H "Authorization: Bearer $TOKEN" \
   -p link-payload.json -T application/json \
   https://ulo.ad/api/links

# Test redirect performance
ab -n 10000 -c 50 https://ulo.ad/test-link

# Test concurrent user scenarios
k6 run performance-tests/user-journey.js

Implementation Roadmap

Phase 1 (Sofort - 1 Woche)

  • Analyse der aktuellen Architektur
  • Verbesserte Error Handling implementieren
  • Health Check Endpoints erweitern
  • Monitoring Setup (Uptime Robot)
  • Backup-Scripts erstellen

Phase 2 (2-4 Wochen)

  • Redis für Rate Limiting & Caching
  • Load Balancer Setup
  • Service Separation (PocketBase)
  • CI/CD Pipeline mit Health Checks
  • Disaster Recovery Scripts

Phase 3 (1-3 Monate)

  • PostgreSQL Migration evaluieren
  • CDN Integration (Cloudflare)
  • Microservices Architecture
  • Chaos Engineering Tests
  • Multi-Region Deployment

Kosten-Nutzen-Analyse

Zusätzliche Infrastruktur-Kosten

  • Redis Server: €5-10/Monat
  • Backup Server: €5-15/Monat
  • Monitoring Tools: €0-20/Monat (Uptime Robot Free, Grafana Cloud)
  • CDN: €0-50/Monat (Cloudflare Free Tier)

Gesamtkosten: €10-95/Monat zusätzlich

Nutzen

  • 99.9% Uptime (vs. aktuell ~95%)
  • Automatische Recovery bei Ausfällen
  • Bessere Performance durch Caching
  • Proaktive Überwachung vor Problemen
  • Datenintegrität durch Backups

Metriken & KPIs

Verfügbarkeit

  • Target: 99.9% Uptime
  • MTTR (Mean Time To Recovery): < 5 Minuten
  • MTBF (Mean Time Between Failures): > 30 Tage

Performance

  • Response Time: < 200ms (95th percentile)
  • Redirect Time: < 50ms
  • Error Rate: < 0.1%

Monitoring

  • Alert Response Time: < 2 Minuten
  • Backup Success Rate: 100%
  • Health Check Success: > 99.5%

Fazit

Das aktuelle Single-Point-of-Failure Setup birgt erhebliche Risiken für die Verfügbarkeit von uLoad. Mit den vorgeschlagenen Maßnahmen kann die Infrastruktur deutlich robuster und ausfallsicherer gestaltet werden.

Empfohlene Prioritäten:

  1. Sofort: Verbessertes Error Handling und Monitoring
  2. Kurzfristig: Service Separation und Backup-Strategie
  3. Mittelfristig: Load Balancing und Caching
  4. Langfristig: Microservices und Multi-Region

Der Plan bietet einen gestuften Ansatz, um die Ausfallsicherheit schrittweise zu erhöhen, ohne die laufende Entwicklung zu blockieren.