mirror of https://github.com/Memo-2023/mana-monorepo.git synced 2026-05-14 21:41:09 +02:00

2025-12-01 13:30:58 +01:00

18 KiB

Raw Blame History

CI/CD Implementation Plan

Last Updated: 2025-11-27 Status: Design Complete → Implementation Pending Estimated Timeline: 5-7 days (2-person team)

📋 Plan Overview

This document outlines the complete plan for implementing CI/CD infrastructure for the manacore-monorepo, from initial setup to production deployment.

🎯 Goals & Success Criteria

Primary Goals

Automate deployments - Deploy with a single commit to main
Zero-downtime updates - Blue-green deployment strategy
Enforce quality - Automated testing with 80% coverage
Cost efficiency - 92% savings vs traditional PaaS ($56/month vs $300+)
Team productivity - Reduce deployment time from 2+ hours to < 10 minutes

Success Criteria

✅ Staging auto-deploys on merge to main
✅ Production deploys take < 10 minutes
✅ Rollback can be executed in < 5 minutes
✅ Test coverage enforced at 80% minimum
✅ All 39 services deployed and healthy
✅ Monitoring and alerting operational
✅ Team can confidently deploy without assistance

🏗️ Architecture Overview

Infrastructure Stack

Platform: Docker Compose orchestration
Hosting: Hetzner Cloud VPS (German data centers)
Container Runtime: Docker + Docker Compose
CI/CD: GitHub Actions
Monitoring: Prometheus + Grafana + Loki
Error Tracking: Sentry
CDN: Cloudflare

Service Inventory (39 Services Total)

Authentication:

mana-core-auth (NestJS) - Central authentication service

Chat Project (4 services):

chat-backend (NestJS)
chat-web (SvelteKit)
chat-mobile (Expo - OTA updates)
chat-landing (Astro)

Maerchenzauber Project (4 services):

maerchenzauber-backend (NestJS)
maerchenzauber-web (SvelteKit)
maerchenzauber-mobile (Expo)
maerchenzauber-landing (Astro)

Manadeck Project (4 services):

manadeck-backend (NestJS)
manadeck-web (SvelteKit)
manadeck-mobile (Expo)
manadeck-landing (Astro)

Memoro Project (3 services):

memoro-web (SvelteKit)
memoro-mobile (Expo)
memoro-landing (Astro)

Picture Project (3 services):

picture-web (SvelteKit)
picture-mobile (Expo)
picture-landing (Astro)

Wisekeep Project (4 services):

wisekeep-backend (NestJS)
wisekeep-web (SvelteKit)
wisekeep-mobile (Expo)
wisekeep-landing (Astro)

Quote Project (4 services):

quote-backend (NestJS)
quote-web (SvelteKit)
quote-mobile (Expo)
quote-landing (Astro)

Nutriphi Project (2 services):

nutriphi-backend (NestJS)
nutriphi-web (SvelteKit)

Uload Project (1 service):

uload-web (SvelteKit)

Bauntown Project (1 service):

bauntown-landing (Astro)

Manacore Project (2 services):

manacore-web (SvelteKit)
manacore-mobile (Expo)

Shared Infrastructure (2 services):

postgres (PostgreSQL 16)
redis (Redis 7)

📅 Implementation Timeline

Week 1: Foundation (Days 1-2)

Goal: Infrastructure setup and first deployment

Day 1 Morning (2-3 hours):

Set up Hetzner account
Provision staging server (CCX32)
Install Docker & Docker Compose
Configure GitHub Container Registry

Day 1 Afternoon (3-4 hours):

Configure GitHub secrets (staging)
Create first Dockerfile (mana-core-auth)
Test CI/CD pipeline with test PR
Deploy mana-core-auth to staging

Day 2 (6-8 hours):

Create Dockerfiles for remaining backends (6 services)
Deploy all backends to staging
Verify health checks
Test inter-service communication

Week 1: Web Apps (Days 3-4)

Goal: Deploy web apps and landing pages

Day 3 (6-8 hours):

Create SvelteKit Dockerfiles (9 services)
Test builds locally
Deploy to staging
Configure reverse proxy/domains

Day 4 (6-8 hours):

Create Astro Dockerfiles (9 services)
Deploy landing pages
Set up SSL/TLS (Let's Encrypt)
Test all web apps end-to-end

Week 2: Testing & Production (Days 5-7)

Goal: Implement testing and deploy to production

Day 5 (6-8 hours):

Write critical path tests (auth, payments) - 100% coverage
Configure test frameworks
Enable coverage enforcement in CI
Fix any failing tests

Day 6 (6-8 hours):

Provision production server
Configure production secrets
Set up GitHub environments (approval gates)
Deploy mana-core-auth to production

Day 7 (6-8 hours):

Deploy all services to production
Configure DNS for all domains
Set up monitoring (Prometheus + Grafana)
Verify everything works in production

Week 2-3: Monitoring & Optimization (Days 8-10+)

Goal: Set up monitoring and optimize

Day 8 (4-6 hours):

Install Loki for logging
Configure Grafana dashboards
Set up alerting (Prometheus Alertmanager)
Integrate Sentry for error tracking

Day 9 (4-6 hours):

Set up automated backups
Test backup restoration
Perform disaster recovery drill
Document procedures

Day 10+ (ongoing):

Write remaining tests (80% coverage target)
Performance optimization (caching, CDN)
Team training
Documentation updates

🔄 Development Workflow

Developer Workflow

1. Create feature branch
   ↓
2. Write code + tests
   ↓
3. Push to GitHub
   ↓
4. GitHub Actions runs:
   - Lint
   - Type check
   - Build
   - Tests (with coverage)
   ↓
5. PR approved + merged to main
   ↓
6. GitHub Actions builds Docker images
   ↓
7. Images pushed to ghcr.io
   ↓
8. Auto-deploy to staging
   ↓
9. (Optional) Manual deploy to production

Deployment Workflow

Staging (Automatic):
  Merge to main → Build → Push → Deploy → Health Check → Done

Production (Manual Approval):
  Manual trigger → Approval gate → Backup → Deploy → Health Check →
  Monitor 5 min → Done (or Rollback)

🐳 Docker Strategy

Multi-Stage Builds

All Dockerfiles use multi-stage builds for optimization:

Stage 1: Dependencies

Install pnpm and dependencies
Uses layer caching

Stage 2: Build

Build application
Generate production artifacts

Stage 3: Runtime

Alpine Linux base (minimal)
Copy only production artifacts
Non-root user
Health checks configured

Image Naming Convention

ghcr.io/wuesteon/mana-core-auth:latest
ghcr.io/wuesteon/mana-core-auth:main
ghcr.io/wuesteon/mana-core-auth:main-abc1234

ghcr.io/wuesteon/chat-backend:latest
ghcr.io/wuesteon/chat-backend:main
ghcr.io/wuesteon/chat-backend:main-abc1234

Tags:

latest - Most recent build from main
main - Branch-based tag
main-abc1234 - Git commit SHA (for rollbacks)

🧪 Testing Strategy

Coverage Targets

Critical Paths: 100% coverage required
- Authentication (@manacore/shared-auth)
- Payment/credit system
- Data integrity (migrations, RLS)
General Code: 80% coverage minimum
- Backend services
- Frontend apps
- Shared packages

Test Types

Unit Tests:

All services and components
Frameworks: Jest (backend/mobile), Vitest (web/shared)

Integration Tests:

API endpoints with test database
Service interactions

E2E Tests (Phase 2):

Playwright for web apps
Detox/Maestro for mobile apps

CI/CD Integration

Run on every PR
Enforce coverage thresholds
Block merge if tests fail or coverage below 80%
Parallel execution for speed

🚀 Deployment Strategy

Blue-Green Deployment

Current (Blue):    New (Green):
    v1.0    →      v1.1 (deploying)
                      ↓
                   Health check
                      ↓
                   Tests pass
                      ↓
Traffic → Blue → Switch traffic → Green
                      ↓
                   Monitor 1 hour
                      ↓
              Decommission Blue

Benefits:

Zero downtime
Instant rollback (switch back to blue)
Test new version before full cutover

Rollback Procedure

Detect issue (monitoring alerts or manual detection)
Run scripts/deploy/rollback.sh
Switch traffic back to previous version
Restore database from backup (if needed)
Total time: < 5 minutes

📊 Monitoring Strategy

Metrics Collection (Prometheus)

Application Metrics:

Request rate (requests/second)
Error rate (% of failed requests)
Response time (p50, p95, p99)
Active connections

Infrastructure Metrics:

CPU usage per service
Memory usage per service
Disk usage
Network I/O

Logging (Loki + Grafana)

Log Aggregation:

All containers → stdout/stderr → Loki → Grafana
Structured JSON logs
Correlation IDs for tracing

Log Retention:

7 days online (searchable)
30 days archived (backup)

Error Tracking (Sentry)

What's Tracked:

Application errors and exceptions
Source maps for better stack traces
User context (anonymized)
Performance metrics

Alerting (Prometheus Alertmanager)

Alert Rules:

Service down (health check fails for 2 minutes)
High error rate (> 5% of requests failing)
High CPU usage (> 80% for 5 minutes)
High memory usage (> 90% for 5 minutes)
Disk space low (< 10% free)

Notification Channels:

Slack (all alerts)
PagerDuty (critical alerts only)
Email (daily summary)

💰 Cost Breakdown

Infrastructure Costs (Monthly)

Phase 1: Single Server (Recommended Start)

Item	Cost	Notes
Hetzner CCX32	$50	8 vCPU, 32 GB RAM, 240 GB SSD
Domains (6x)	$6	$12/year each
Cloudflare CDN	$0	Free tier
GitHub Actions	$0	Within free tier
GitHub Container Registry	$0	500 MB free
Total	$56

Phase 2: Multi-Server (Production Scale)

Item	Cost	Notes
Staging (CCX22)	$25	4 vCPU, 16 GB RAM
Production (CCX42)	$100	16 vCPU, 64 GB RAM
Monitoring (CX32)	$15	4 vCPU, 8 GB RAM
Domains	$6	Same as above
CDN, GitHub	$0	Free tiers
Total	$146

Cost Savings:

vs AWS/Azure: $500-1,000/month (89-95% savings)
vs Heroku/Railway: $300-500/month (71-83% savings)
vs DigitalOcean: $150-300/month (51-71% savings)

Resource Allocation (Per Service)

Service Type	CPU	RAM	Instances	Total
NestJS Backend	0.5	512 MB	10	5 CPU, 5 GB RAM
SvelteKit Web	0.25	256 MB	9	2.25 CPU, 2.25 GB RAM
Astro Landing	0.1	128 MB	9	0.9 CPU, 1.1 GB RAM
PostgreSQL	1	2 GB	1	1 CPU, 2 GB RAM
Redis	0.25	256 MB	1	0.25 CPU, 256 MB RAM
Monitoring	1	2 GB	1	1 CPU, 2 GB RAM
Total				~10.5 CPU, ~12.5 GB RAM

Conclusion: CCX32 (8 vCPU, 32 GB RAM) is sufficient for all services with headroom for growth.

🔐 Security Measures

Infrastructure Security

Firewall rules (only ports 22, 80, 443 exposed)
SSH key-based authentication (no passwords)
Non-root Docker containers
Read-only filesystems where possible
Network segmentation (frontend, backend, data layers)
Automatic security updates

Application Security

Environment variable encryption (GitHub Secrets)
SSL/TLS for all services (Let's Encrypt)
JWT-based authentication (@manacore/shared-auth)
Row-Level Security (Supabase RLS policies)
Input validation and sanitization
CORS policies enforced

CI/CD Security

Weekly dependency audits (Dependabot)
Docker image scanning (Trivy)
No secrets in code
Branch protection rules
Required code reviews
Signed commits (recommended)

Compliance

GDPR compliance (Hetzner EU data centers)
ISO 27001 certified infrastructure
SOC 2 Type II (Supabase)
Automated backup retention policies
Audit logs (GitHub Actions, Coolify, Supabase)

🔄 Backup & Disaster Recovery

Backup Strategy

What's Backed Up:

PostgreSQL databases (daily)
Redis data (daily)
Docker volumes
Environment configurations
Deployment manifests

Backup Schedule:

Daily automated backups at 2 AM UTC
Retention: 30 days for databases, 7 days for Redis
Storage: Cloudflare R2 or Hetzner Storage Box

Backup Verification:

Weekly automated restoration tests
Monthly manual restoration drills

Disaster Recovery

Recovery Time Objective (RTO):

Service restart: < 1 hour
Full server restore: < 2 hours

Recovery Point Objective (RPO):

< 24 hours (daily backups)
Supabase PITR available for point-in-time recovery

Recovery Procedures:

Service Failure: Restart container (automated)
Data Corruption: Restore from latest backup
Server Failure: Provision new server, restore from backup
Region Failure: Failover to secondary region (future phase)

📚 Documentation Strategy

For Developers

Quick start guide (30 minutes to first deployment)
Testing guide (how to write and run tests)
Troubleshooting guide (common issues)
Contributing guide (standards and patterns)

For DevOps

Architecture documentation (complete system design)
Deployment runbooks (step-by-step procedures)
Monitoring guide (dashboards and alerts)
Incident response playbooks

For Management

Cost analysis and projections
Success metrics and KPIs
Timeline and milestones
Risk assessment and mitigation

🎯 Phase Gates

Phase 1 Complete When:

Hetzner account created
Staging server provisioned and Docker installed
GitHub secrets configured
First service deployed to staging
CI/CD pipeline tested end-to-end

Phase 2 Complete When:

All backend services deployed
All web apps deployed
All landing pages deployed
SSL/TLS configured for all domains
Health checks passing for all services

Phase 3 Complete When:

Critical path tests at 100% coverage
General code at 80% coverage
Coverage enforcement in CI
All tests passing consistently

Phase 4 Complete When:

Production server provisioned
All services deployed to production
Monitoring operational (Prometheus + Grafana + Loki)
Alerting configured and tested
Backups automated and verified

🚧 Risk Management

Identified Risks

Risk 1: Budget Overruns

Likelihood: Low
Impact: Medium
Mitigation: Start with single server ($56/month), scale only when needed
Contingency: Downgrade server size, optimize resource usage

Risk 2: Deployment Failures

Likelihood: Medium (during initial rollout)
Impact: High
Mitigation: Blue-green deployment, automated rollback, comprehensive testing
Contingency: Rollback procedures documented and tested

Risk 3: Service Outages

Likelihood: Low
Impact: High
Mitigation: Health checks, monitoring, automated restarts
Contingency: Incident response playbooks, 24/7 monitoring

Risk 4: Data Loss

Likelihood: Very Low
Impact: Critical
Mitigation: Daily backups, Supabase PITR, backup verification
Contingency: Multiple backup locations, disaster recovery drills

Risk 5: Security Breaches

Likelihood: Low
Impact: Critical
Mitigation: Security best practices, automated audits, minimal attack surface
Contingency: Incident response plan, security patches, audit logs

Risk 6: Migration Complexity

Likelihood: Medium (now addressed - migration complete)
Impact: Medium
Mitigation: Completed migration from Coolify to Docker Compose, removed legacy artifacts
Contingency: Docker Compose provides simpler, more maintainable deployment

📈 Success Metrics & KPIs

Deployment Metrics

Deployment Frequency: Target > 5/week (currently < 1/week)
Deployment Duration: Target < 10 minutes (currently 2+ hours manual)
Deployment Success Rate: Target > 95%
Rollback Time: Target < 5 minutes

Quality Metrics

Test Coverage: Target 80% minimum (currently ~5%)
Critical Path Coverage: Target 100% (currently ~0%)
Build Success Rate: Target > 95%
Code Review Turnaround: Target < 24 hours

Reliability Metrics

Uptime: Target 99.9% (43 minutes downtime/month)
Mean Time to Recovery (MTTR): Target < 1 hour
Mean Time Between Failures (MTBF): Target > 30 days
Backup Success Rate: Target 100%

Cost Metrics

Infrastructure Cost: Target < $100/month (achieved: $56/month)
Cost per Service: Target < $5/month
Cost Reduction: 92% vs traditional PaaS

🎓 Training & Knowledge Transfer

Developer Training (2-3 hours)

Session 1: CI/CD basics and GitHub Actions
Session 2: Writing and running tests
Session 3: Docker and deployment
Session 4: Troubleshooting and debugging

DevOps Training (4-8 hours)

Session 1: Architecture deep dive
Session 2: Infrastructure setup (hands-on)
Session 3: CI/CD operations
Session 4: Incident response and recovery

Documentation

All procedures documented in cicd/ folder
Video tutorials (optional, future)
Regular knowledge sharing sessions

🔮 Future Enhancements

Short-Term (3-6 months)

Canary deployments (gradual traffic shifting)
Feature flags (LaunchDarkly/Unleash)
Visual regression testing (Percy/Chromatic)
Load testing (k6/Artillery)
Mobile E2E testing (Detox/Maestro)

Long-Term (6-12 months)

Kubernetes migration (when scale demands)
Multi-region deployment
Global load balancing
Database replication
Advanced observability (distributed tracing)

✅ Plan Approval

Created by: Hive Mind Collective Intelligence Reviewed by: **_** Approved by: **_** Approval Date: **_**

Next Steps:

Review this plan with the team
Get budget approval ($56-146/month)
Start implementation following TODO.md
Track progress in CHANGELOG.md

Last Updated: 2025-11-27 Version: 1.0 Status: Ready for Implementation ✅

18 KiB Raw Blame History