mirror of
https://github.com/Memo-2023/mana-monorepo.git
synced 2026-05-14 20:01:09 +02:00
## New Features ### Network Graph Visualization (Contacts, Calendar, Todo) - D3.js force simulation for physics-based layout - Zoom & pan with mouse/touchpad - Keyboard shortcuts: +/- zoom, 0 reset, Esc deselect, / search, F focus - Filtering by tags, company/location/project, connection strength - Shared components in @manacore/shared-ui ### Central Tags API (mana-core-auth) - CRUD endpoints for tags - Schema: tags table with userId, name, color, app - Shared tag components in @manacore/shared-ui ### Custom Themes System - Theme editor with live preview and color picker - Community theme gallery - Theme sharing (public, unlisted, private) - Backend API in mana-core-auth ### Todo App Extensions - Glass-pill design for task input and items - Settings page with 20+ preferences - Task edit modal with inline editing - Statistics page with visualizations - PWA support with offline capabilities - Multiple kanban boards ### Contacts App Features - Duplicate detection - Photo upload - Batch operations - Enhanced favorites page with multiple view modes - Alphabet view improvements - Search modal ### Help System - @manacore/shared-help-content - @manacore/shared-help-ui - @manacore/shared-help-types ### Other Features - Themes page for all apps - Referral system frontend - CommandBar (global search) - Skeleton loaders - Settings page improvements ## Bug Fixes - Network graph simulation initialization - Database schema TEXT for user_id columns (Better Auth compatibility) - Various styling fixes ## Documentation - Daily report for 2025-12-10 - CI/CD deployment guide 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
16 KiB
16 KiB
Downtime Prevention Plan für uLoad
Problemanalyse
Das uLoad-Projekt war kürzlich komplett down, was zu kritischen Problemen geführt hat:
- Hauptanwendung nicht erreichbar
- Weiterleitungen funktionierten nicht
- Benutzererfahrung stark beeinträchtigt
- Potentieller Datenverlust/Inkonsistenz
Aktuelle Architektur-Analyse
Technology Stack
- Frontend: SvelteKit 2.22 mit Svelte 5.0
- Backend: PocketBase (https://pb.ulo.ad)
- Hosting: Hetzner VPS mit Coolify
- Database: PocketBase SQLite mit persistentem Volume
- Deployment: Docker mit Supervisor (Multi-Service Container)
Kritische Single Points of Failure
-
PocketBase Dependency
- Gesamte Anwendung abhängig von PocketBase Verfügbarkeit
- Keine Fallback-Mechanismen implementiert
- Timeout-Konfiguration zu aggressiv (5 Sekunden)
-
Single Server Setup
- Ein Hetzner VPS für gesamte Infrastruktur
- Keine Redundanz oder Load Balancing
- Coolify als Single Point of Failure
-
Container Architecture
- SvelteKit und PocketBase in einem Container
- Supervisor als Process Manager
- Keine Health Checks zwischen Services
-
Rate Limiting
- In-Memory Store (verliert Daten bei Restart)
- Keine Redis-Backend für Persistenz
- Potentielle Blockierung legitimer Traffic
Sofortmaßnahmen (Quick Wins)
1. Verbesserte Error Handling & Fallbacks
PocketBase Connection Resilience
// src/lib/pocketbase-resilient.ts
class ResilientPocketBase {
private retryCount = 0;
private maxRetries = 3;
private backoffMs = 1000;
async withRetry<T>(operation: () => Promise<T>): Promise<T> {
for (let i = 0; i <= this.maxRetries; i++) {
try {
return await operation();
} catch (error) {
if (i === this.maxRetries) throw error;
await this.delay(this.backoffMs * Math.pow(2, i));
}
}
throw new Error('Max retries exceeded');
}
private delay(ms: number): Promise<void> {
return new Promise((resolve) => setTimeout(resolve, ms));
}
}
Graceful Degradation
- Cache-basierte Fallbacks für kritische Daten
- Offline-Mode für Basis-Funktionalität
- Error Boundaries in allen Komponenten
2. Enhanced Monitoring
Health Check Verbesserung
// src/routes/health/+server.ts erweitern
export const GET: RequestHandler = async () => {
const health = {
status: 'healthy',
timestamp: new Date().toISOString(),
environment: building ? 'build' : 'runtime',
services: {
sveltekit: 'running',
pocketbase: await checkPocketBaseDetailed(),
database: await checkDatabaseHealth(),
memory: process.memoryUsage(),
uptime: process.uptime(),
},
checks: {
canCreateLink: await testLinkCreation(),
canAuthenticate: await testAuthentication(),
canServeStatic: await testStaticFiles(),
},
};
const overallStatus = Object.values(health.services).every((s) => s === 'running')
? 'healthy'
: 'degraded';
return json(health, {
status: overallStatus === 'healthy' ? 200 : 503,
});
};
Externes Monitoring Setup
- Uptime Robot/Pingdom für externe Überwachung
- Slack/Discord Webhooks für Alerts
- Grafana Dashboard für Metriken
3. Improved Rate Limiting
Redis-Backend für Rate Limiting
// src/lib/server/redis-rate-limiter.ts
import Redis from 'ioredis';
class RedisRateLimiter {
private redis: Redis;
constructor() {
this.redis = new Redis(process.env.REDIS_URL || 'redis://localhost:6379');
}
async checkLimit(key: string, limit: number, windowMs: number): Promise<boolean> {
const multi = this.redis.multi();
multi.incr(key);
multi.expire(key, Math.ceil(windowMs / 1000));
const results = await multi.exec();
const count = results?.[0]?.[1] as number;
return count <= limit;
}
}
4. Database Backup Strategy
Automatisierte PocketBase Backups
#!/bin/bash
# scripts/backup-pocketbase.sh
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/app/backups"
PB_DATA="/app/pb_data"
mkdir -p $BACKUP_DIR
# Create backup
tar -czf "$BACKUP_DIR/pb_backup_$DATE.tar.gz" -C $PB_DATA .
# Keep only last 7 days
find $BACKUP_DIR -name "pb_backup_*.tar.gz" -mtime +7 -delete
# Upload to S3/Object Storage (optional)
if [ -n "$S3_BUCKET" ]; then
aws s3 cp "$BACKUP_DIR/pb_backup_$DATE.tar.gz" "s3://$S3_BUCKET/backups/"
fi
Mittelfristige Maßnahmen (1-4 Wochen)
1. Infrastructure Redundancy
Load Balancer Setup
# docker-compose.prod-ha.yml
version: '3.8'
services:
nginx-lb:
image: nginx:alpine
ports:
- '80:80'
- '443:443'
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf
depends_on:
- app1
- app2
app1:
build: .
environment:
- INSTANCE_ID=app1
volumes:
- pb_data:/app/pb_data
app2:
build: .
environment:
- INSTANCE_ID=app2
volumes:
- pb_data:/app/pb_data
volumes:
pb_data:
Multi-Region Deployment
- Hauptserver: Hetzner Deutschland
- Backup Server: AWS/DigitalOcean anderer Region
- DNS Failover mit niedrigem TTL (60 Sekunden)
2. Separated Services Architecture
PocketBase als separater Service
# docker-compose.services.yml
services:
pocketbase:
image: spectado/pocketbase:latest
volumes:
- ./pb_data:/pb/pb_data
ports:
- '8090:8090'
restart: unless-stopped
healthcheck:
test: ['CMD', 'curl', '-f', 'http://localhost:8090/api/health']
interval: 30s
timeout: 10s
retries: 3
app:
build: .
environment:
- POCKETBASE_URL=http://pocketbase:8090
depends_on:
pocketbase:
condition: service_healthy
restart: unless-stopped
3. Caching Layer
Redis für Caching & Sessions
// src/lib/cache/redis-cache.ts
export class CacheManager {
private redis: Redis;
async get<T>(key: string): Promise<T | null> {
const cached = await this.redis.get(key);
return cached ? JSON.parse(cached) : null;
}
async set(key: string, value: any, ttlSeconds: number = 3600): Promise<void> {
await this.redis.setex(key, ttlSeconds, JSON.stringify(value));
}
// Cache häufig abgerufene Daten
async getCachedLinks(userId: string): Promise<Link[]> {
const cacheKey = `user:${userId}:links`;
let links = await this.get<Link[]>(cacheKey);
if (!links) {
links = await pb.collection('links').getFullList({ filter: `user_id="${userId}"` });
await this.set(cacheKey, links, 300); // 5 Minuten Cache
}
return links;
}
}
Langfristige Maßnahmen (1-3 Monate)
1. Database Migration Strategy
PostgreSQL als Primary Database
// Alternative zu PocketBase für bessere Skalierbarkeit
// src/lib/database/postgresql.ts
import { drizzle } from 'drizzle-orm/postgres-js';
import postgres from 'postgres';
const connectionString = process.env.DATABASE_URL!;
const client = postgres(connectionString);
export const db = drizzle(client);
// Migration zu PostgreSQL mit:
// - Bessere Performance bei hoher Last
// - Replication Support
// - Backup/Recovery Tools
// - Connection Pooling
Database Cluster Setup
- Master-Slave Replication
- Read Replicas für Analytics
- Automated Failover
2. CDN Integration
Cloudflare Setup
// src/app.html erweitern
// DNS-Level Protection gegen DDoS
// Edge Caching für statische Assets
// SSL/TLS Termination
// Rate Limiting auf Edge-Level
3. Microservices Architecture
Service Separation
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Frontend │ │ API Gateway │ │ Auth Service │
│ (SvelteKit) │ │ (Kong/Nginx) │ │ (Custom) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
└───────────────────────┼───────────────────────┘
│
┌───────────────────────┼───────────────────────┐
│ │ │
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Link Service │ │ Analytics Svc │ │ Redirect Svc │
│ (Create/CRUD) │ │ (Tracking) │ │ (Core Feature) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
Deployment & Operations
1. CI/CD Pipeline Verbesserung
GitHub Actions Workflow
# .github/workflows/deploy.yml
name: Deploy to Production
on:
push:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run Tests
run: |
npm ci
npm run test
npm run test:e2e
deploy:
needs: test
runs-on: ubuntu-latest
steps:
- name: Deploy to Primary
run: |
# Coolify Deployment
curl -X POST ${{ secrets.COOLIFY_WEBHOOK }}
- name: Health Check
run: |
# Warte auf erfolgreiche Deployment
for i in {1..30}; do
if curl -f https://ulo.ad/health; then
echo "Deployment successful"
exit 0
fi
sleep 10
done
exit 1
- name: Rollback on Failure
if: failure()
run: |
# Automatisches Rollback bei Fehler
curl -X POST ${{ secrets.COOLIFY_ROLLBACK_WEBHOOK }}
2. Monitoring & Alerting
Prometheus + Grafana Setup
# monitoring/docker-compose.yml
services:
prometheus:
image: prom/prometheus
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- '9090:9090'
grafana:
image: grafana/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
ports:
- '3001:3000'
volumes:
- grafana-storage:/var/lib/grafana
alertmanager:
image: prom/alertmanager
ports:
- '9093:9093'
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
Custom Metrics
// src/lib/metrics/prometheus.ts
import { register, collectDefaultMetrics, Counter, Histogram } from 'prom-client';
collectDefaultMetrics();
export const httpRequestsTotal = new Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code'],
});
export const httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route'],
});
export const linkRedirects = new Counter({
name: 'link_redirects_total',
help: 'Total number of link redirects',
labelNames: ['short_code', 'success'],
});
3. Disaster Recovery Plan
Automated Recovery Scripts
#!/bin/bash
# scripts/disaster-recovery.sh
# 1. Check service status
check_services() {
if ! curl -f https://ulo.ad/health; then
echo "Primary service down, starting recovery..."
return 1
fi
}
# 2. Switch to backup server
activate_backup() {
# Update DNS to point to backup server
curl -X PUT "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$DNS_RECORD_ID" \
-H "Authorization: Bearer $CF_API_TOKEN" \
-H "Content-Type: application/json" \
--data '{"content":"'$BACKUP_SERVER_IP'"}'
}
# 3. Restore from backup
restore_from_backup() {
# Download latest backup
aws s3 cp s3://$BACKUP_BUCKET/latest.tar.gz /tmp/restore.tar.gz
# Extract and restore
tar -xzf /tmp/restore.tar.gz -C /app/pb_data/
# Restart services
docker-compose restart
}
# Main recovery flow
if ! check_services; then
activate_backup
restore_from_backup
# Send alert
curl -X POST $SLACK_WEBHOOK -d '{"text":"Disaster recovery activated for ulo.ad"}'
fi
Testing & Validation
1. Chaos Engineering
Fault Injection Tests
// tests/chaos/network-failures.test.ts
describe('Network Failure Scenarios', () => {
test('should handle PocketBase timeout gracefully', async () => {
// Simulate PocketBase timeout
const mockPb = mockPocketBaseTimeout();
const response = await app.request('/api/links', {
method: 'POST',
body: JSON.stringify({ url: 'https://example.com' }),
});
// Should return cached response or graceful error
expect(response.status).toBeLessThan(500);
});
test('should fallback to cached data when database is unavailable', async () => {
// Simulate database outage
mockDatabaseDown();
const response = await app.request('/my/links');
// Should serve from cache
expect(response.status).toBe(200);
expect(response.headers.get('x-served-from')).toBe('cache');
});
});
2. Load Testing
Performance Benchmarks
# scripts/load-test.sh
#!/bin/bash
# Test link creation under load
ab -n 1000 -c 10 -H "Authorization: Bearer $TOKEN" \
-p link-payload.json -T application/json \
https://ulo.ad/api/links
# Test redirect performance
ab -n 10000 -c 50 https://ulo.ad/test-link
# Test concurrent user scenarios
k6 run performance-tests/user-journey.js
Implementation Roadmap
Phase 1 (Sofort - 1 Woche)
- Analyse der aktuellen Architektur
- Verbesserte Error Handling implementieren
- Health Check Endpoints erweitern
- Monitoring Setup (Uptime Robot)
- Backup-Scripts erstellen
Phase 2 (2-4 Wochen)
- Redis für Rate Limiting & Caching
- Load Balancer Setup
- Service Separation (PocketBase)
- CI/CD Pipeline mit Health Checks
- Disaster Recovery Scripts
Phase 3 (1-3 Monate)
- PostgreSQL Migration evaluieren
- CDN Integration (Cloudflare)
- Microservices Architecture
- Chaos Engineering Tests
- Multi-Region Deployment
Kosten-Nutzen-Analyse
Zusätzliche Infrastruktur-Kosten
- Redis Server: €5-10/Monat
- Backup Server: €5-15/Monat
- Monitoring Tools: €0-20/Monat (Uptime Robot Free, Grafana Cloud)
- CDN: €0-50/Monat (Cloudflare Free Tier)
Gesamtkosten: €10-95/Monat zusätzlich
Nutzen
- 99.9% Uptime (vs. aktuell ~95%)
- Automatische Recovery bei Ausfällen
- Bessere Performance durch Caching
- Proaktive Überwachung vor Problemen
- Datenintegrität durch Backups
Metriken & KPIs
Verfügbarkeit
- Target: 99.9% Uptime
- MTTR (Mean Time To Recovery): < 5 Minuten
- MTBF (Mean Time Between Failures): > 30 Tage
Performance
- Response Time: < 200ms (95th percentile)
- Redirect Time: < 50ms
- Error Rate: < 0.1%
Monitoring
- Alert Response Time: < 2 Minuten
- Backup Success Rate: 100%
- Health Check Success: > 99.5%
Fazit
Das aktuelle Single-Point-of-Failure Setup birgt erhebliche Risiken für die Verfügbarkeit von uLoad. Mit den vorgeschlagenen Maßnahmen kann die Infrastruktur deutlich robuster und ausfallsicherer gestaltet werden.
Empfohlene Prioritäten:
- Sofort: Verbessertes Error Handling und Monitoring
- Kurzfristig: Service Separation und Backup-Strategie
- Mittelfristig: Load Balancing und Caching
- Langfristig: Microservices und Multi-Region
Der Plan bietet einen gestuften Ansatz, um die Ausfallsicherheit schrittweise zu erhöhen, ohne die laufende Entwicklung zu blockieren.