# CI/CD Implementation Plan **Last Updated**: 2025-11-27 **Status**: Design Complete โ†’ Implementation Pending **Estimated Timeline**: 5-7 days (2-person team) --- ## ๐Ÿ“‹ Plan Overview This document outlines the complete plan for implementing CI/CD infrastructure for the manacore-monorepo, from initial setup to production deployment. --- ## ๐ŸŽฏ Goals & Success Criteria ### Primary Goals 1. **Automate deployments** - Deploy with a single commit to main 2. **Zero-downtime updates** - Blue-green deployment strategy 3. **Enforce quality** - Automated testing with 80% coverage 4. **Cost efficiency** - 92% savings vs traditional PaaS ($56/month vs $300+) 5. **Team productivity** - Reduce deployment time from 2+ hours to < 10 minutes ### Success Criteria - โœ… Staging auto-deploys on merge to main - โœ… Production deploys take < 10 minutes - โœ… Rollback can be executed in < 5 minutes - โœ… Test coverage enforced at 80% minimum - โœ… All 39 services deployed and healthy - โœ… Monitoring and alerting operational - โœ… Team can confidently deploy without assistance --- ## ๐Ÿ—๏ธ Architecture Overview ### Infrastructure Stack - **Platform**: Coolify (open-source PaaS) - **Hosting**: Hetzner Cloud (German data centers) - **Container Runtime**: Docker + Docker Compose - **CI/CD**: GitHub Actions - **Monitoring**: Prometheus + Grafana + Loki - **Error Tracking**: Sentry - **CDN**: Cloudflare ### Service Inventory (39 Services Total) **Authentication**: - mana-core-auth (NestJS) - Central authentication service **Chat Project** (4 services): - chat-backend (NestJS) - chat-web (SvelteKit) - chat-mobile (Expo - OTA updates) - chat-landing (Astro) **Maerchenzauber Project** (4 services): - maerchenzauber-backend (NestJS) - maerchenzauber-web (SvelteKit) - maerchenzauber-mobile (Expo) - maerchenzauber-landing (Astro) **Manadeck Project** (4 services): - manadeck-backend (NestJS) - manadeck-web (SvelteKit) - manadeck-mobile (Expo) - manadeck-landing (Astro) **Memoro Project** (3 services): - memoro-web (SvelteKit) - memoro-mobile (Expo) - memoro-landing (Astro) **Picture Project** (3 services): - picture-web (SvelteKit) - picture-mobile (Expo) - picture-landing (Astro) **Wisekeep Project** (4 services): - wisekeep-backend (NestJS) - wisekeep-web (SvelteKit) - wisekeep-mobile (Expo) - wisekeep-landing (Astro) **Quote Project** (4 services): - quote-backend (NestJS) - quote-web (SvelteKit) - quote-mobile (Expo) - quote-landing (Astro) **Nutriphi Project** (2 services): - nutriphi-backend (NestJS) - nutriphi-web (SvelteKit) **Uload Project** (1 service): - uload-web (SvelteKit) **Bauntown Project** (1 service): - bauntown-landing (Astro) **Manacore Project** (2 services): - manacore-web (SvelteKit) - manacore-mobile (Expo) **Shared Infrastructure** (2 services): - postgres (PostgreSQL 16) - redis (Redis 7) --- ## ๐Ÿ“… Implementation Timeline ### Week 1: Foundation (Days 1-2) **Goal**: Infrastructure setup and first deployment **Day 1 Morning** (2-3 hours): - Set up Hetzner account - Provision staging server (CCX32) - Install Coolify - Configure GitHub Container Registry **Day 1 Afternoon** (3-4 hours): - Configure GitHub secrets (staging) - Create first Dockerfile (mana-core-auth) - Test CI/CD pipeline with test PR - Deploy mana-core-auth to staging **Day 2** (6-8 hours): - Create Dockerfiles for remaining backends (6 services) - Deploy all backends to staging - Verify health checks - Test inter-service communication --- ### Week 1: Web Apps (Days 3-4) **Goal**: Deploy web apps and landing pages **Day 3** (6-8 hours): - Create SvelteKit Dockerfiles (9 services) - Test builds locally - Deploy to staging - Configure reverse proxy/domains **Day 4** (6-8 hours): - Create Astro Dockerfiles (9 services) - Deploy landing pages - Set up SSL/TLS (Let's Encrypt) - Test all web apps end-to-end --- ### Week 2: Testing & Production (Days 5-7) **Goal**: Implement testing and deploy to production **Day 5** (6-8 hours): - Write critical path tests (auth, payments) - 100% coverage - Configure test frameworks - Enable coverage enforcement in CI - Fix any failing tests **Day 6** (6-8 hours): - Provision production server - Configure production secrets - Set up GitHub environments (approval gates) - Deploy mana-core-auth to production **Day 7** (6-8 hours): - Deploy all services to production - Configure DNS for all domains - Set up monitoring (Prometheus + Grafana) - Verify everything works in production --- ### Week 2-3: Monitoring & Optimization (Days 8-10+) **Goal**: Set up monitoring and optimize **Day 8** (4-6 hours): - Install Loki for logging - Configure Grafana dashboards - Set up alerting (Prometheus Alertmanager) - Integrate Sentry for error tracking **Day 9** (4-6 hours): - Set up automated backups - Test backup restoration - Perform disaster recovery drill - Document procedures **Day 10+** (ongoing): - Write remaining tests (80% coverage target) - Performance optimization (caching, CDN) - Team training - Documentation updates --- ## ๐Ÿ”„ Development Workflow ### Developer Workflow ``` 1. Create feature branch โ†“ 2. Write code + tests โ†“ 3. Push to GitHub โ†“ 4. GitHub Actions runs: - Lint - Type check - Build - Tests (with coverage) โ†“ 5. PR approved + merged to main โ†“ 6. GitHub Actions builds Docker images โ†“ 7. Images pushed to ghcr.io โ†“ 8. Auto-deploy to staging โ†“ 9. (Optional) Manual deploy to production ``` ### Deployment Workflow ``` Staging (Automatic): Merge to main โ†’ Build โ†’ Push โ†’ Deploy โ†’ Health Check โ†’ Done Production (Manual Approval): Manual trigger โ†’ Approval gate โ†’ Backup โ†’ Deploy โ†’ Health Check โ†’ Monitor 5 min โ†’ Done (or Rollback) ``` --- ## ๐Ÿณ Docker Strategy ### Multi-Stage Builds All Dockerfiles use multi-stage builds for optimization: **Stage 1: Dependencies** - Install pnpm and dependencies - Uses layer caching **Stage 2: Build** - Build application - Generate production artifacts **Stage 3: Runtime** - Alpine Linux base (minimal) - Copy only production artifacts - Non-root user - Health checks configured ### Image Naming Convention ``` ghcr.io/wuesteon/mana-core-auth:latest ghcr.io/wuesteon/mana-core-auth:main ghcr.io/wuesteon/mana-core-auth:main-abc1234 ghcr.io/wuesteon/chat-backend:latest ghcr.io/wuesteon/chat-backend:main ghcr.io/wuesteon/chat-backend:main-abc1234 ``` **Tags**: - `latest` - Most recent build from main - `main` - Branch-based tag - `main-abc1234` - Git commit SHA (for rollbacks) --- ## ๐Ÿงช Testing Strategy ### Coverage Targets - **Critical Paths**: 100% coverage required - Authentication (`@manacore/shared-auth`) - Payment/credit system - Data integrity (migrations, RLS) - **General Code**: 80% coverage minimum - Backend services - Frontend apps - Shared packages ### Test Types **Unit Tests**: - All services and components - Frameworks: Jest (backend/mobile), Vitest (web/shared) **Integration Tests**: - API endpoints with test database - Service interactions **E2E Tests** (Phase 2): - Playwright for web apps - Detox/Maestro for mobile apps ### CI/CD Integration - Run on every PR - Enforce coverage thresholds - Block merge if tests fail or coverage below 80% - Parallel execution for speed --- ## ๐Ÿš€ Deployment Strategy ### Blue-Green Deployment ``` Current (Blue): New (Green): v1.0 โ†’ v1.1 (deploying) โ†“ Health check โ†“ Tests pass โ†“ Traffic โ†’ Blue โ†’ Switch traffic โ†’ Green โ†“ Monitor 1 hour โ†“ Decommission Blue ``` **Benefits**: - Zero downtime - Instant rollback (switch back to blue) - Test new version before full cutover ### Rollback Procedure 1. Detect issue (monitoring alerts or manual detection) 2. Run `scripts/deploy/rollback.sh` 3. Switch traffic back to previous version 4. Restore database from backup (if needed) 5. Total time: < 5 minutes --- ## ๐Ÿ“Š Monitoring Strategy ### Metrics Collection (Prometheus) **Application Metrics**: - Request rate (requests/second) - Error rate (% of failed requests) - Response time (p50, p95, p99) - Active connections **Infrastructure Metrics**: - CPU usage per service - Memory usage per service - Disk usage - Network I/O ### Logging (Loki + Grafana) **Log Aggregation**: - All containers โ†’ stdout/stderr โ†’ Loki โ†’ Grafana - Structured JSON logs - Correlation IDs for tracing **Log Retention**: - 7 days online (searchable) - 30 days archived (backup) ### Error Tracking (Sentry) **What's Tracked**: - Application errors and exceptions - Source maps for better stack traces - User context (anonymized) - Performance metrics ### Alerting (Prometheus Alertmanager) **Alert Rules**: - Service down (health check fails for 2 minutes) - High error rate (> 5% of requests failing) - High CPU usage (> 80% for 5 minutes) - High memory usage (> 90% for 5 minutes) - Disk space low (< 10% free) **Notification Channels**: - Slack (all alerts) - PagerDuty (critical alerts only) - Email (daily summary) --- ## ๐Ÿ’ฐ Cost Breakdown ### Infrastructure Costs (Monthly) **Phase 1: Single Server (Recommended Start)** | Item | Cost | Notes | |------|------|-------| | Hetzner CCX32 | $50 | 8 vCPU, 32 GB RAM, 240 GB SSD | | Domains (6x) | $6 | $12/year each | | Cloudflare CDN | $0 | Free tier | | GitHub Actions | $0 | Within free tier | | GitHub Container Registry | $0 | 500 MB free | | **Total** | **$56** | | **Phase 2: Multi-Server (Production Scale)** | Item | Cost | Notes | |------|------|-------| | Staging (CCX22) | $25 | 4 vCPU, 16 GB RAM | | Production (CCX42) | $100 | 16 vCPU, 64 GB RAM | | Monitoring (CX32) | $15 | 4 vCPU, 8 GB RAM | | Domains | $6 | Same as above | | CDN, GitHub | $0 | Free tiers | | **Total** | **$146** | | **Cost Savings**: - vs AWS/Azure: $500-1,000/month (89-95% savings) - vs Heroku/Railway: $300-500/month (71-83% savings) - vs DigitalOcean: $150-300/month (51-71% savings) ### Resource Allocation (Per Service) | Service Type | CPU | RAM | Instances | Total | | -------------- | ---- | ------ | --------- | --------------------------- | | NestJS Backend | 0.5 | 512 MB | 10 | 5 CPU, 5 GB RAM | | SvelteKit Web | 0.25 | 256 MB | 9 | 2.25 CPU, 2.25 GB RAM | | Astro Landing | 0.1 | 128 MB | 9 | 0.9 CPU, 1.1 GB RAM | | PostgreSQL | 1 | 2 GB | 1 | 1 CPU, 2 GB RAM | | Redis | 0.25 | 256 MB | 1 | 0.25 CPU, 256 MB RAM | | Monitoring | 1 | 2 GB | 1 | 1 CPU, 2 GB RAM | | **Total** | | | | **~10.5 CPU, ~12.5 GB RAM** | **Conclusion**: CCX32 (8 vCPU, 32 GB RAM) is sufficient for all services with headroom for growth. --- ## ๐Ÿ” Security Measures ### Infrastructure Security - [x] Firewall rules (only ports 22, 80, 443 exposed) - [x] SSH key-based authentication (no passwords) - [x] Non-root Docker containers - [x] Read-only filesystems where possible - [x] Network segmentation (frontend, backend, data layers) - [x] Automatic security updates ### Application Security - [x] Environment variable encryption (GitHub Secrets) - [x] SSL/TLS for all services (Let's Encrypt) - [x] JWT-based authentication (@manacore/shared-auth) - [x] Row-Level Security (Supabase RLS policies) - [x] Input validation and sanitization - [x] CORS policies enforced ### CI/CD Security - [x] Weekly dependency audits (Dependabot) - [x] Docker image scanning (Trivy) - [x] No secrets in code - [x] Branch protection rules - [x] Required code reviews - [x] Signed commits (recommended) ### Compliance - [x] GDPR compliance (Hetzner EU data centers) - [x] ISO 27001 certified infrastructure - [x] SOC 2 Type II (Supabase) - [x] Automated backup retention policies - [x] Audit logs (GitHub Actions, Coolify, Supabase) --- ## ๐Ÿ”„ Backup & Disaster Recovery ### Backup Strategy **What's Backed Up**: - PostgreSQL databases (daily) - Redis data (daily) - Docker volumes - Environment configurations - Deployment manifests **Backup Schedule**: - Daily automated backups at 2 AM UTC - Retention: 30 days for databases, 7 days for Redis - Storage: Cloudflare R2 or Hetzner Storage Box **Backup Verification**: - Weekly automated restoration tests - Monthly manual restoration drills ### Disaster Recovery **Recovery Time Objective (RTO)**: - Service restart: < 1 hour - Full server restore: < 2 hours **Recovery Point Objective (RPO)**: - < 24 hours (daily backups) - Supabase PITR available for point-in-time recovery **Recovery Procedures**: 1. **Service Failure**: Restart container (automated) 2. **Data Corruption**: Restore from latest backup 3. **Server Failure**: Provision new server, restore from backup 4. **Region Failure**: Failover to secondary region (future phase) --- ## ๐Ÿ“š Documentation Strategy ### For Developers - Quick start guide (30 minutes to first deployment) - Testing guide (how to write and run tests) - Troubleshooting guide (common issues) - Contributing guide (standards and patterns) ### For DevOps - Architecture documentation (complete system design) - Deployment runbooks (step-by-step procedures) - Monitoring guide (dashboards and alerts) - Incident response playbooks ### For Management - Cost analysis and projections - Success metrics and KPIs - Timeline and milestones - Risk assessment and mitigation --- ## ๐ŸŽฏ Phase Gates ### Phase 1 Complete When: - [x] Hetzner account created - [x] Staging server provisioned and Coolify installed - [x] GitHub secrets configured - [x] First service deployed to staging - [x] CI/CD pipeline tested end-to-end ### Phase 2 Complete When: - [x] All backend services deployed - [x] All web apps deployed - [x] All landing pages deployed - [x] SSL/TLS configured for all domains - [x] Health checks passing for all services ### Phase 3 Complete When: - [x] Critical path tests at 100% coverage - [x] General code at 80% coverage - [x] Coverage enforcement in CI - [x] All tests passing consistently ### Phase 4 Complete When: - [x] Production server provisioned - [x] All services deployed to production - [x] Monitoring operational (Prometheus + Grafana + Loki) - [x] Alerting configured and tested - [x] Backups automated and verified --- ## ๐Ÿšง Risk Management ### Identified Risks **Risk 1: Budget Overruns** - **Likelihood**: Low - **Impact**: Medium - **Mitigation**: Start with single server ($56/month), scale only when needed - **Contingency**: Downgrade server size, optimize resource usage **Risk 2: Deployment Failures** - **Likelihood**: Medium (during initial rollout) - **Impact**: High - **Mitigation**: Blue-green deployment, automated rollback, comprehensive testing - **Contingency**: Rollback procedures documented and tested **Risk 3: Service Outages** - **Likelihood**: Low - **Impact**: High - **Mitigation**: Health checks, monitoring, automated restarts - **Contingency**: Incident response playbooks, 24/7 monitoring **Risk 4: Data Loss** - **Likelihood**: Very Low - **Impact**: Critical - **Mitigation**: Daily backups, Supabase PITR, backup verification - **Contingency**: Multiple backup locations, disaster recovery drills **Risk 5: Security Breaches** - **Likelihood**: Low - **Impact**: Critical - **Mitigation**: Security best practices, automated audits, minimal attack surface - **Contingency**: Incident response plan, security patches, audit logs --- ## ๐Ÿ“ˆ Success Metrics & KPIs ### Deployment Metrics - **Deployment Frequency**: Target > 5/week (currently < 1/week) - **Deployment Duration**: Target < 10 minutes (currently 2+ hours manual) - **Deployment Success Rate**: Target > 95% - **Rollback Time**: Target < 5 minutes ### Quality Metrics - **Test Coverage**: Target 80% minimum (currently ~5%) - **Critical Path Coverage**: Target 100% (currently ~0%) - **Build Success Rate**: Target > 95% - **Code Review Turnaround**: Target < 24 hours ### Reliability Metrics - **Uptime**: Target 99.9% (43 minutes downtime/month) - **Mean Time to Recovery (MTTR)**: Target < 1 hour - **Mean Time Between Failures (MTBF)**: Target > 30 days - **Backup Success Rate**: Target 100% ### Cost Metrics - **Infrastructure Cost**: Target < $100/month (achieved: $56/month) - **Cost per Service**: Target < $5/month - **Cost Reduction**: 92% vs traditional PaaS --- ## ๐ŸŽ“ Training & Knowledge Transfer ### Developer Training (2-3 hours) - **Session 1**: CI/CD basics and GitHub Actions - **Session 2**: Writing and running tests - **Session 3**: Docker and deployment - **Session 4**: Troubleshooting and debugging ### DevOps Training (4-8 hours) - **Session 1**: Architecture deep dive - **Session 2**: Infrastructure setup (hands-on) - **Session 3**: CI/CD operations - **Session 4**: Incident response and recovery ### Documentation - All procedures documented in `cicd/` folder - Video tutorials (optional, future) - Regular knowledge sharing sessions --- ## ๐Ÿ”ฎ Future Enhancements ### Short-Term (3-6 months) - [ ] Canary deployments (gradual traffic shifting) - [ ] Feature flags (LaunchDarkly/Unleash) - [ ] Visual regression testing (Percy/Chromatic) - [ ] Load testing (k6/Artillery) - [ ] Mobile E2E testing (Detox/Maestro) ### Long-Term (6-12 months) - [ ] Kubernetes migration (when scale demands) - [ ] Multi-region deployment - [ ] Global load balancing - [ ] Database replication - [ ] Advanced observability (distributed tracing) --- ## โœ… Plan Approval **Created by**: Hive Mind Collective Intelligence **Reviewed by**: \***\*\_\*\*** **Approved by**: \***\*\_\*\*** **Approval Date**: \***\*\_\*\*** **Next Steps**: 1. Review this plan with the team 2. Get budget approval ($56-146/month) 3. Start implementation following `TODO.md` 4. Track progress in `CHANGELOG.md` --- **Last Updated**: 2025-11-27 **Version**: 1.0 **Status**: Ready for Implementation โœ