18 KiB
CI/CD Implementation Plan
Last Updated: 2025-11-27 Status: Design Complete → Implementation Pending Estimated Timeline: 5-7 days (2-person team)
📋 Plan Overview
This document outlines the complete plan for implementing CI/CD infrastructure for the manacore-monorepo, from initial setup to production deployment.
🎯 Goals & Success Criteria
Primary Goals
- Automate deployments - Deploy with a single commit to main
- Zero-downtime updates - Blue-green deployment strategy
- Enforce quality - Automated testing with 80% coverage
- Cost efficiency - 92% savings vs traditional PaaS ($56/month vs $300+)
- Team productivity - Reduce deployment time from 2+ hours to < 10 minutes
Success Criteria
- ✅ Staging auto-deploys on merge to main
- ✅ Production deploys take < 10 minutes
- ✅ Rollback can be executed in < 5 minutes
- ✅ Test coverage enforced at 80% minimum
- ✅ All 39 services deployed and healthy
- ✅ Monitoring and alerting operational
- ✅ Team can confidently deploy without assistance
🏗️ Architecture Overview
Infrastructure Stack
- Platform: Docker Compose orchestration
- Hosting: Hetzner Cloud VPS (German data centers)
- Container Runtime: Docker + Docker Compose
- CI/CD: GitHub Actions
- Monitoring: Prometheus + Grafana + Loki
- Error Tracking: Sentry
- CDN: Cloudflare
Service Inventory (39 Services Total)
Authentication:
- mana-core-auth (NestJS) - Central authentication service
Chat Project (4 services):
- chat-backend (NestJS)
- chat-web (SvelteKit)
- chat-mobile (Expo - OTA updates)
- chat-landing (Astro)
Maerchenzauber Project (4 services):
- maerchenzauber-backend (NestJS)
- maerchenzauber-web (SvelteKit)
- maerchenzauber-mobile (Expo)
- maerchenzauber-landing (Astro)
Manadeck Project (4 services):
- manadeck-backend (NestJS)
- manadeck-web (SvelteKit)
- manadeck-mobile (Expo)
- manadeck-landing (Astro)
Memoro Project (3 services):
- memoro-web (SvelteKit)
- memoro-mobile (Expo)
- memoro-landing (Astro)
Picture Project (3 services):
- picture-web (SvelteKit)
- picture-mobile (Expo)
- picture-landing (Astro)
Wisekeep Project (4 services):
- wisekeep-backend (NestJS)
- wisekeep-web (SvelteKit)
- wisekeep-mobile (Expo)
- wisekeep-landing (Astro)
Quote Project (4 services):
- quote-backend (NestJS)
- quote-web (SvelteKit)
- quote-mobile (Expo)
- quote-landing (Astro)
Nutriphi Project (2 services):
- nutriphi-backend (NestJS)
- nutriphi-web (SvelteKit)
Uload Project (1 service):
- uload-web (SvelteKit)
Bauntown Project (1 service):
- bauntown-landing (Astro)
Manacore Project (2 services):
- manacore-web (SvelteKit)
- manacore-mobile (Expo)
Shared Infrastructure (2 services):
- postgres (PostgreSQL 16)
- redis (Redis 7)
📅 Implementation Timeline
Week 1: Foundation (Days 1-2)
Goal: Infrastructure setup and first deployment
Day 1 Morning (2-3 hours):
- Set up Hetzner account
- Provision staging server (CCX32)
- Install Docker & Docker Compose
- Configure GitHub Container Registry
Day 1 Afternoon (3-4 hours):
- Configure GitHub secrets (staging)
- Create first Dockerfile (mana-core-auth)
- Test CI/CD pipeline with test PR
- Deploy mana-core-auth to staging
Day 2 (6-8 hours):
- Create Dockerfiles for remaining backends (6 services)
- Deploy all backends to staging
- Verify health checks
- Test inter-service communication
Week 1: Web Apps (Days 3-4)
Goal: Deploy web apps and landing pages
Day 3 (6-8 hours):
- Create SvelteKit Dockerfiles (9 services)
- Test builds locally
- Deploy to staging
- Configure reverse proxy/domains
Day 4 (6-8 hours):
- Create Astro Dockerfiles (9 services)
- Deploy landing pages
- Set up SSL/TLS (Let's Encrypt)
- Test all web apps end-to-end
Week 2: Testing & Production (Days 5-7)
Goal: Implement testing and deploy to production
Day 5 (6-8 hours):
- Write critical path tests (auth, payments) - 100% coverage
- Configure test frameworks
- Enable coverage enforcement in CI
- Fix any failing tests
Day 6 (6-8 hours):
- Provision production server
- Configure production secrets
- Set up GitHub environments (approval gates)
- Deploy mana-core-auth to production
Day 7 (6-8 hours):
- Deploy all services to production
- Configure DNS for all domains
- Set up monitoring (Prometheus + Grafana)
- Verify everything works in production
Week 2-3: Monitoring & Optimization (Days 8-10+)
Goal: Set up monitoring and optimize
Day 8 (4-6 hours):
- Install Loki for logging
- Configure Grafana dashboards
- Set up alerting (Prometheus Alertmanager)
- Integrate Sentry for error tracking
Day 9 (4-6 hours):
- Set up automated backups
- Test backup restoration
- Perform disaster recovery drill
- Document procedures
Day 10+ (ongoing):
- Write remaining tests (80% coverage target)
- Performance optimization (caching, CDN)
- Team training
- Documentation updates
🔄 Development Workflow
Developer Workflow
1. Create feature branch
↓
2. Write code + tests
↓
3. Push to GitHub
↓
4. GitHub Actions runs:
- Lint
- Type check
- Build
- Tests (with coverage)
↓
5. PR approved + merged to main
↓
6. GitHub Actions builds Docker images
↓
7. Images pushed to ghcr.io
↓
8. Auto-deploy to staging
↓
9. (Optional) Manual deploy to production
Deployment Workflow
Staging (Automatic):
Merge to main → Build → Push → Deploy → Health Check → Done
Production (Manual Approval):
Manual trigger → Approval gate → Backup → Deploy → Health Check →
Monitor 5 min → Done (or Rollback)
🐳 Docker Strategy
Multi-Stage Builds
All Dockerfiles use multi-stage builds for optimization:
Stage 1: Dependencies
- Install pnpm and dependencies
- Uses layer caching
Stage 2: Build
- Build application
- Generate production artifacts
Stage 3: Runtime
- Alpine Linux base (minimal)
- Copy only production artifacts
- Non-root user
- Health checks configured
Image Naming Convention
ghcr.io/wuesteon/mana-core-auth:latest
ghcr.io/wuesteon/mana-core-auth:main
ghcr.io/wuesteon/mana-core-auth:main-abc1234
ghcr.io/wuesteon/chat-backend:latest
ghcr.io/wuesteon/chat-backend:main
ghcr.io/wuesteon/chat-backend:main-abc1234
Tags:
latest- Most recent build from mainmain- Branch-based tagmain-abc1234- Git commit SHA (for rollbacks)
🧪 Testing Strategy
Coverage Targets
-
Critical Paths: 100% coverage required
- Authentication (
@manacore/shared-auth) - Payment/credit system
- Data integrity (migrations, RLS)
- Authentication (
-
General Code: 80% coverage minimum
- Backend services
- Frontend apps
- Shared packages
Test Types
Unit Tests:
- All services and components
- Frameworks: Jest (backend/mobile), Vitest (web/shared)
Integration Tests:
- API endpoints with test database
- Service interactions
E2E Tests (Phase 2):
- Playwright for web apps
- Detox/Maestro for mobile apps
CI/CD Integration
- Run on every PR
- Enforce coverage thresholds
- Block merge if tests fail or coverage below 80%
- Parallel execution for speed
🚀 Deployment Strategy
Blue-Green Deployment
Current (Blue): New (Green):
v1.0 → v1.1 (deploying)
↓
Health check
↓
Tests pass
↓
Traffic → Blue → Switch traffic → Green
↓
Monitor 1 hour
↓
Decommission Blue
Benefits:
- Zero downtime
- Instant rollback (switch back to blue)
- Test new version before full cutover
Rollback Procedure
- Detect issue (monitoring alerts or manual detection)
- Run
scripts/deploy/rollback.sh - Switch traffic back to previous version
- Restore database from backup (if needed)
- Total time: < 5 minutes
📊 Monitoring Strategy
Metrics Collection (Prometheus)
Application Metrics:
- Request rate (requests/second)
- Error rate (% of failed requests)
- Response time (p50, p95, p99)
- Active connections
Infrastructure Metrics:
- CPU usage per service
- Memory usage per service
- Disk usage
- Network I/O
Logging (Loki + Grafana)
Log Aggregation:
- All containers → stdout/stderr → Loki → Grafana
- Structured JSON logs
- Correlation IDs for tracing
Log Retention:
- 7 days online (searchable)
- 30 days archived (backup)
Error Tracking (Sentry)
What's Tracked:
- Application errors and exceptions
- Source maps for better stack traces
- User context (anonymized)
- Performance metrics
Alerting (Prometheus Alertmanager)
Alert Rules:
- Service down (health check fails for 2 minutes)
- High error rate (> 5% of requests failing)
- High CPU usage (> 80% for 5 minutes)
- High memory usage (> 90% for 5 minutes)
- Disk space low (< 10% free)
Notification Channels:
- Slack (all alerts)
- PagerDuty (critical alerts only)
- Email (daily summary)
💰 Cost Breakdown
Infrastructure Costs (Monthly)
Phase 1: Single Server (Recommended Start)
| Item | Cost | Notes |
|---|---|---|
| Hetzner CCX32 | $50 | 8 vCPU, 32 GB RAM, 240 GB SSD |
| Domains (6x) | $6 | $12/year each |
| Cloudflare CDN | $0 | Free tier |
| GitHub Actions | $0 | Within free tier |
| GitHub Container Registry | $0 | 500 MB free |
| Total | $56 |
Phase 2: Multi-Server (Production Scale)
| Item | Cost | Notes |
|---|---|---|
| Staging (CCX22) | $25 | 4 vCPU, 16 GB RAM |
| Production (CCX42) | $100 | 16 vCPU, 64 GB RAM |
| Monitoring (CX32) | $15 | 4 vCPU, 8 GB RAM |
| Domains | $6 | Same as above |
| CDN, GitHub | $0 | Free tiers |
| Total | $146 |
Cost Savings:
- vs AWS/Azure: $500-1,000/month (89-95% savings)
- vs Heroku/Railway: $300-500/month (71-83% savings)
- vs DigitalOcean: $150-300/month (51-71% savings)
Resource Allocation (Per Service)
| Service Type | CPU | RAM | Instances | Total |
|---|---|---|---|---|
| NestJS Backend | 0.5 | 512 MB | 10 | 5 CPU, 5 GB RAM |
| SvelteKit Web | 0.25 | 256 MB | 9 | 2.25 CPU, 2.25 GB RAM |
| Astro Landing | 0.1 | 128 MB | 9 | 0.9 CPU, 1.1 GB RAM |
| PostgreSQL | 1 | 2 GB | 1 | 1 CPU, 2 GB RAM |
| Redis | 0.25 | 256 MB | 1 | 0.25 CPU, 256 MB RAM |
| Monitoring | 1 | 2 GB | 1 | 1 CPU, 2 GB RAM |
| Total | ~10.5 CPU, ~12.5 GB RAM |
Conclusion: CCX32 (8 vCPU, 32 GB RAM) is sufficient for all services with headroom for growth.
🔐 Security Measures
Infrastructure Security
- Firewall rules (only ports 22, 80, 443 exposed)
- SSH key-based authentication (no passwords)
- Non-root Docker containers
- Read-only filesystems where possible
- Network segmentation (frontend, backend, data layers)
- Automatic security updates
Application Security
- Environment variable encryption (GitHub Secrets)
- SSL/TLS for all services (Let's Encrypt)
- JWT-based authentication (@manacore/shared-auth)
- Row-Level Security (Supabase RLS policies)
- Input validation and sanitization
- CORS policies enforced
CI/CD Security
- Weekly dependency audits (Dependabot)
- Docker image scanning (Trivy)
- No secrets in code
- Branch protection rules
- Required code reviews
- Signed commits (recommended)
Compliance
- GDPR compliance (Hetzner EU data centers)
- ISO 27001 certified infrastructure
- SOC 2 Type II (Supabase)
- Automated backup retention policies
- Audit logs (GitHub Actions, Coolify, Supabase)
🔄 Backup & Disaster Recovery
Backup Strategy
What's Backed Up:
- PostgreSQL databases (daily)
- Redis data (daily)
- Docker volumes
- Environment configurations
- Deployment manifests
Backup Schedule:
- Daily automated backups at 2 AM UTC
- Retention: 30 days for databases, 7 days for Redis
- Storage: Cloudflare R2 or Hetzner Storage Box
Backup Verification:
- Weekly automated restoration tests
- Monthly manual restoration drills
Disaster Recovery
Recovery Time Objective (RTO):
- Service restart: < 1 hour
- Full server restore: < 2 hours
Recovery Point Objective (RPO):
- < 24 hours (daily backups)
- Supabase PITR available for point-in-time recovery
Recovery Procedures:
- Service Failure: Restart container (automated)
- Data Corruption: Restore from latest backup
- Server Failure: Provision new server, restore from backup
- Region Failure: Failover to secondary region (future phase)
📚 Documentation Strategy
For Developers
- Quick start guide (30 minutes to first deployment)
- Testing guide (how to write and run tests)
- Troubleshooting guide (common issues)
- Contributing guide (standards and patterns)
For DevOps
- Architecture documentation (complete system design)
- Deployment runbooks (step-by-step procedures)
- Monitoring guide (dashboards and alerts)
- Incident response playbooks
For Management
- Cost analysis and projections
- Success metrics and KPIs
- Timeline and milestones
- Risk assessment and mitigation
🎯 Phase Gates
Phase 1 Complete When:
- Hetzner account created
- Staging server provisioned and Docker installed
- GitHub secrets configured
- First service deployed to staging
- CI/CD pipeline tested end-to-end
Phase 2 Complete When:
- All backend services deployed
- All web apps deployed
- All landing pages deployed
- SSL/TLS configured for all domains
- Health checks passing for all services
Phase 3 Complete When:
- Critical path tests at 100% coverage
- General code at 80% coverage
- Coverage enforcement in CI
- All tests passing consistently
Phase 4 Complete When:
- Production server provisioned
- All services deployed to production
- Monitoring operational (Prometheus + Grafana + Loki)
- Alerting configured and tested
- Backups automated and verified
🚧 Risk Management
Identified Risks
Risk 1: Budget Overruns
- Likelihood: Low
- Impact: Medium
- Mitigation: Start with single server ($56/month), scale only when needed
- Contingency: Downgrade server size, optimize resource usage
Risk 2: Deployment Failures
- Likelihood: Medium (during initial rollout)
- Impact: High
- Mitigation: Blue-green deployment, automated rollback, comprehensive testing
- Contingency: Rollback procedures documented and tested
Risk 3: Service Outages
- Likelihood: Low
- Impact: High
- Mitigation: Health checks, monitoring, automated restarts
- Contingency: Incident response playbooks, 24/7 monitoring
Risk 4: Data Loss
- Likelihood: Very Low
- Impact: Critical
- Mitigation: Daily backups, Supabase PITR, backup verification
- Contingency: Multiple backup locations, disaster recovery drills
Risk 5: Security Breaches
- Likelihood: Low
- Impact: Critical
- Mitigation: Security best practices, automated audits, minimal attack surface
- Contingency: Incident response plan, security patches, audit logs
Risk 6: Migration Complexity
- Likelihood: Medium (now addressed - migration complete)
- Impact: Medium
- Mitigation: Completed migration from Coolify to Docker Compose, removed legacy artifacts
- Contingency: Docker Compose provides simpler, more maintainable deployment
📈 Success Metrics & KPIs
Deployment Metrics
- Deployment Frequency: Target > 5/week (currently < 1/week)
- Deployment Duration: Target < 10 minutes (currently 2+ hours manual)
- Deployment Success Rate: Target > 95%
- Rollback Time: Target < 5 minutes
Quality Metrics
- Test Coverage: Target 80% minimum (currently ~5%)
- Critical Path Coverage: Target 100% (currently ~0%)
- Build Success Rate: Target > 95%
- Code Review Turnaround: Target < 24 hours
Reliability Metrics
- Uptime: Target 99.9% (43 minutes downtime/month)
- Mean Time to Recovery (MTTR): Target < 1 hour
- Mean Time Between Failures (MTBF): Target > 30 days
- Backup Success Rate: Target 100%
Cost Metrics
- Infrastructure Cost: Target < $100/month (achieved: $56/month)
- Cost per Service: Target < $5/month
- Cost Reduction: 92% vs traditional PaaS
🎓 Training & Knowledge Transfer
Developer Training (2-3 hours)
- Session 1: CI/CD basics and GitHub Actions
- Session 2: Writing and running tests
- Session 3: Docker and deployment
- Session 4: Troubleshooting and debugging
DevOps Training (4-8 hours)
- Session 1: Architecture deep dive
- Session 2: Infrastructure setup (hands-on)
- Session 3: CI/CD operations
- Session 4: Incident response and recovery
Documentation
- All procedures documented in
cicd/folder - Video tutorials (optional, future)
- Regular knowledge sharing sessions
🔮 Future Enhancements
Short-Term (3-6 months)
- Canary deployments (gradual traffic shifting)
- Feature flags (LaunchDarkly/Unleash)
- Visual regression testing (Percy/Chromatic)
- Load testing (k6/Artillery)
- Mobile E2E testing (Detox/Maestro)
Long-Term (6-12 months)
- Kubernetes migration (when scale demands)
- Multi-region deployment
- Global load balancing
- Database replication
- Advanced observability (distributed tracing)
✅ Plan Approval
Created by: Hive Mind Collective Intelligence Reviewed by: **_** Approved by: **_** Approval Date: **_**
Next Steps:
- Review this plan with the team
- Get budget approval ($56-146/month)
- Start implementation following
TODO.md - Track progress in
CHANGELOG.md
Last Updated: 2025-11-27 Version: 1.0 Status: Ready for Implementation ✅