fix(cicd): docker paths, formatting config,

and documentation - Fix Docker build paths in maerchenzauber and manadeck backends - Add comprehensive CI/CD documentation (private repo solution, type analysis) - Configure Prettier with proper plugins for Astro/Svelte - Update .gitignore to exclude .hive-mind and .claude-flow - Fix Turbo config for Presi app Related to cicd/integration branch - Priority 1 & 2 fixes
2026-05-14 20:21:09 +02:00 · 2025-11-27 18:33:08 +01:00 · 2025-11-27 18:33:08 +01:00 · 0241f5554c
commit 0241f5554c
parent f55962e135
16 changed files with 2173 additions and 187 deletions
--- a/cicd/PLAN.md
+++ b/cicd/PLAN.md
@ -15,6 +15,7 @@ This document outlines the complete plan for implementing CI/CD infrastructure f
 ## 🎯 Goals & Success Criteria

 ### Primary Goals
+
 1. **Automate deployments** - Deploy with a single commit to main
 2. **Zero-downtime updates** - Blue-green deployment strategy
 3. **Enforce quality** - Automated testing with 80% coverage
@ -22,6 +23,7 @@ This document outlines the complete plan for implementing CI/CD infrastructure f
 5. **Team productivity** - Reduce deployment time from 2+ hours to < 10 minutes

 ### Success Criteria
+
 - ✅ Staging auto-deploys on merge to main
 - ✅ Production deploys take < 10 minutes
 - ✅ Rollback can be executed in < 5 minutes
@ -35,6 +37,7 @@ This document outlines the complete plan for implementing CI/CD infrastructure f
 ## 🏗️ Architecture Overview

 ### Infrastructure Stack
+
 - **Platform**: Coolify (open-source PaaS)
 - **Hosting**: Hetzner Cloud (German data centers)
 - **Container Runtime**: Docker + Docker Compose
@ -46,63 +49,76 @@ This document outlines the complete plan for implementing CI/CD infrastructure f
 ### Service Inventory (39 Services Total)

 **Authentication**:
+
 - mana-core-auth (NestJS) - Central authentication service

 **Chat Project** (4 services):
+
 - chat-backend (NestJS)
 - chat-web (SvelteKit)
 - chat-mobile (Expo - OTA updates)
 - chat-landing (Astro)

 **Maerchenzauber Project** (4 services):
+
 - maerchenzauber-backend (NestJS)
 - maerchenzauber-web (SvelteKit)
 - maerchenzauber-mobile (Expo)
 - maerchenzauber-landing (Astro)

 **Manadeck Project** (4 services):
+
 - manadeck-backend (NestJS)
 - manadeck-web (SvelteKit)
 - manadeck-mobile (Expo)
 - manadeck-landing (Astro)

 **Memoro Project** (3 services):
+
 - memoro-web (SvelteKit)
 - memoro-mobile (Expo)
 - memoro-landing (Astro)

 **Picture Project** (3 services):
+
 - picture-web (SvelteKit)
 - picture-mobile (Expo)
 - picture-landing (Astro)

 **Wisekeep Project** (4 services):
+
 - wisekeep-backend (NestJS)
 - wisekeep-web (SvelteKit)
 - wisekeep-mobile (Expo)
 - wisekeep-landing (Astro)

 **Quote Project** (4 services):
+
 - quote-backend (NestJS)
 - quote-web (SvelteKit)
 - quote-mobile (Expo)
 - quote-landing (Astro)

 **Nutriphi Project** (2 services):
+
 - nutriphi-backend (NestJS)
 - nutriphi-web (SvelteKit)

 **Uload Project** (1 service):
+
 - uload-web (SvelteKit)

 **Bauntown Project** (1 service):
+
 - bauntown-landing (Astro)

 **Manacore Project** (2 services):
+
 - manacore-web (SvelteKit)
 - manacore-mobile (Expo)

 **Shared Infrastructure** (2 services):
+
 - postgres (PostgreSQL 16)
 - redis (Redis 7)

@ -111,21 +127,25 @@ This document outlines the complete plan for implementing CI/CD infrastructure f
 ## 📅 Implementation Timeline

 ### Week 1: Foundation (Days 1-2)
+
 **Goal**: Infrastructure setup and first deployment

 **Day 1 Morning** (2-3 hours):
+
 - Set up Hetzner account
 - Provision staging server (CCX32)
 - Install Coolify
 - Configure GitHub Container Registry

 **Day 1 Afternoon** (3-4 hours):
+
 - Configure GitHub secrets (staging)
 - Create first Dockerfile (mana-core-auth)
 - Test CI/CD pipeline with test PR
 - Deploy mana-core-auth to staging

 **Day 2** (6-8 hours):
+
 - Create Dockerfiles for remaining backends (6 services)
 - Deploy all backends to staging
 - Verify health checks
@ -134,15 +154,18 @@ This document outlines the complete plan for implementing CI/CD infrastructure f
 ---

 ### Week 1: Web Apps (Days 3-4)
+
 **Goal**: Deploy web apps and landing pages

 **Day 3** (6-8 hours):
+
 - Create SvelteKit Dockerfiles (9 services)
 - Test builds locally
 - Deploy to staging
 - Configure reverse proxy/domains

 **Day 4** (6-8 hours):
+
 - Create Astro Dockerfiles (9 services)
 - Deploy landing pages
 - Set up SSL/TLS (Let's Encrypt)
@ -151,21 +174,25 @@ This document outlines the complete plan for implementing CI/CD infrastructure f
 ---

 ### Week 2: Testing & Production (Days 5-7)
+
 **Goal**: Implement testing and deploy to production

 **Day 5** (6-8 hours):
+
 - Write critical path tests (auth, payments) - 100% coverage
 - Configure test frameworks
 - Enable coverage enforcement in CI
 - Fix any failing tests

 **Day 6** (6-8 hours):
+
 - Provision production server
 - Configure production secrets
 - Set up GitHub environments (approval gates)
 - Deploy mana-core-auth to production

 **Day 7** (6-8 hours):
+
 - Deploy all services to production
 - Configure DNS for all domains
 - Set up monitoring (Prometheus + Grafana)
@ -174,21 +201,25 @@ This document outlines the complete plan for implementing CI/CD infrastructure f
 ---

 ### Week 2-3: Monitoring & Optimization (Days 8-10+)
+
 **Goal**: Set up monitoring and optimize

 **Day 8** (4-6 hours):
+
 - Install Loki for logging
 - Configure Grafana dashboards
 - Set up alerting (Prometheus Alertmanager)
 - Integrate Sentry for error tracking

 **Day 9** (4-6 hours):
+
 - Set up automated backups
 - Test backup restoration
 - Perform disaster recovery drill
 - Document procedures

 **Day 10+** (ongoing):
+
 - Write remaining tests (80% coverage target)
 - Performance optimization (caching, CDN)
 - Team training
@ -199,6 +230,7 @@ This document outlines the complete plan for implementing CI/CD infrastructure f
 ## 🔄 Development Workflow

 ### Developer Workflow
+
 ```
 1. Create feature branch
   ↓
@ -224,6 +256,7 @@ This document outlines the complete plan for implementing CI/CD infrastructure f
 ```

 ### Deployment Workflow
+
 ```
 Staging (Automatic):
  Merge to main → Build → Push → Deploy → Health Check → Done
@ -238,23 +271,28 @@ Production (Manual Approval):
 ## 🐳 Docker Strategy

 ### Multi-Stage Builds
+
 All Dockerfiles use multi-stage builds for optimization:

 **Stage 1: Dependencies**
+
 - Install pnpm and dependencies
 - Uses layer caching

 **Stage 2: Build**
+
 - Build application
 - Generate production artifacts

 **Stage 3: Runtime**
+
 - Alpine Linux base (minimal)
 - Copy only production artifacts
 - Non-root user
 - Health checks configured

 ### Image Naming Convention
+
 ```
 ghcr.io/wuesteon/mana-core-auth:latest
 ghcr.io/wuesteon/mana-core-auth:main
@ -266,6 +304,7 @@ ghcr.io/wuesteon/chat-backend:main-abc1234
 ```

 **Tags**:
+
 - `latest` - Most recent build from main
 - `main` - Branch-based tag
 - `main-abc1234` - Git commit SHA (for rollbacks)
@ -275,6 +314,7 @@ ghcr.io/wuesteon/chat-backend:main-abc1234
 ## 🧪 Testing Strategy

 ### Coverage Targets
+
 - **Critical Paths**: 100% coverage required
  - Authentication (`@manacore/shared-auth`)
  - Payment/credit system
@ -286,19 +326,24 @@ ghcr.io/wuesteon/chat-backend:main-abc1234
  - Shared packages

 ### Test Types
+
 **Unit Tests**:
+
 - All services and components
 - Frameworks: Jest (backend/mobile), Vitest (web/shared)

 **Integration Tests**:
+
 - API endpoints with test database
 - Service interactions

 **E2E Tests** (Phase 2):
+
 - Playwright for web apps
 - Detox/Maestro for mobile apps

 ### CI/CD Integration
+
 - Run on every PR
 - Enforce coverage thresholds
 - Block merge if tests fail or coverage below 80%
@ -309,6 +354,7 @@ ghcr.io/wuesteon/chat-backend:main-abc1234
 ## 🚀 Deployment Strategy

 ### Blue-Green Deployment
+
 ```
 Current (Blue):    New (Green):
    v1.0    →      v1.1 (deploying)
@ -325,11 +371,13 @@ Traffic → Blue → Switch traffic → Green
 ```

 **Benefits**:
+
 - Zero downtime
 - Instant rollback (switch back to blue)
 - Test new version before full cutover

 ### Rollback Procedure
+
 1. Detect issue (monitoring alerts or manual detection)
 2. Run `scripts/deploy/rollback.sh`
 3. Switch traffic back to previous version
@ -341,37 +389,47 @@ Traffic → Blue → Switch traffic → Green
 ## 📊 Monitoring Strategy

 ### Metrics Collection (Prometheus)
+
 **Application Metrics**:
+
 - Request rate (requests/second)
 - Error rate (% of failed requests)
 - Response time (p50, p95, p99)
 - Active connections

 **Infrastructure Metrics**:
+
 - CPU usage per service
 - Memory usage per service
 - Disk usage
 - Network I/O

 ### Logging (Loki + Grafana)
+
 **Log Aggregation**:
+
 - All containers → stdout/stderr → Loki → Grafana
 - Structured JSON logs
 - Correlation IDs for tracing

 **Log Retention**:
+
 - 7 days online (searchable)
 - 30 days archived (backup)

 ### Error Tracking (Sentry)
+
 **What's Tracked**:
+
 - Application errors and exceptions
 - Source maps for better stack traces
 - User context (anonymized)
 - Performance metrics

 ### Alerting (Prometheus Alertmanager)
+
 **Alert Rules**:
+
 - Service down (health check fails for 2 minutes)
 - High error rate (> 5% of requests failing)
 - High CPU usage (> 80% for 5 minutes)
@ -379,6 +437,7 @@ Traffic → Blue → Switch traffic → Green
 - Disk space low (< 10% free)

 **Notification Channels**:
+
 - Slack (all alerts)
 - PagerDuty (critical alerts only)
 - Email (daily summary)
@ -410,20 +469,22 @@ Traffic → Blue → Switch traffic → Green
 | **Total** | **$146** | |

 **Cost Savings**:
+
 - vs AWS/Azure: $500-1,000/month (89-95% savings)
 - vs Heroku/Railway: $300-500/month (71-83% savings)
 - vs DigitalOcean: $150-300/month (51-71% savings)

 ### Resource Allocation (Per Service)
-| Service Type | CPU | RAM | Instances | Total |
-|--------------|-----|-----|-----------|-------|
-| NestJS Backend | 0.5 | 512 MB | 10 | 5 CPU, 5 GB RAM |
-| SvelteKit Web | 0.25 | 256 MB | 9 | 2.25 CPU, 2.25 GB RAM |
-| Astro Landing | 0.1 | 128 MB | 9 | 0.9 CPU, 1.1 GB RAM |
-| PostgreSQL | 1 | 2 GB | 1 | 1 CPU, 2 GB RAM |
-| Redis | 0.25 | 256 MB | 1 | 0.25 CPU, 256 MB RAM |
-| Monitoring | 1 | 2 GB | 1 | 1 CPU, 2 GB RAM |
-| **Total** | | | | **~10.5 CPU, ~12.5 GB RAM** |
+
+| Service Type   | CPU  | RAM    | Instances | Total                       |
+| -------------- | ---- | ------ | --------- | --------------------------- |
+| NestJS Backend | 0.5  | 512 MB | 10        | 5 CPU, 5 GB RAM             |
+| SvelteKit Web  | 0.25 | 256 MB | 9         | 2.25 CPU, 2.25 GB RAM       |
+| Astro Landing  | 0.1  | 128 MB | 9         | 0.9 CPU, 1.1 GB RAM         |
+| PostgreSQL     | 1    | 2 GB   | 1         | 1 CPU, 2 GB RAM             |
+| Redis          | 0.25 | 256 MB | 1         | 0.25 CPU, 256 MB RAM        |
+| Monitoring     | 1    | 2 GB   | 1         | 1 CPU, 2 GB RAM             |
+| **Total**      |      |        |           | **~10.5 CPU, ~12.5 GB RAM** |

 **Conclusion**: CCX32 (8 vCPU, 32 GB RAM) is sufficient for all services with headroom for growth.

@ -432,6 +493,7 @@ Traffic → Blue → Switch traffic → Green
 ## 🔐 Security Measures

 ### Infrastructure Security
+
 - [x] Firewall rules (only ports 22, 80, 443 exposed)
 - [x] SSH key-based authentication (no passwords)
 - [x] Non-root Docker containers
@ -440,6 +502,7 @@ Traffic → Blue → Switch traffic → Green
 - [x] Automatic security updates

 ### Application Security
+
 - [x] Environment variable encryption (GitHub Secrets)
 - [x] SSL/TLS for all services (Let's Encrypt)
 - [x] JWT-based authentication (@manacore/shared-auth)
@ -448,6 +511,7 @@ Traffic → Blue → Switch traffic → Green
 - [x] CORS policies enforced

 ### CI/CD Security
+
 - [x] Weekly dependency audits (Dependabot)
 - [x] Docker image scanning (Trivy)
 - [x] No secrets in code
@ -456,6 +520,7 @@ Traffic → Blue → Switch traffic → Green
 - [x] Signed commits (recommended)

 ### Compliance
+
 - [x] GDPR compliance (Hetzner EU data centers)
 - [x] ISO 27001 certified infrastructure
 - [x] SOC 2 Type II (Supabase)
@ -467,7 +532,9 @@ Traffic → Blue → Switch traffic → Green
 ## 🔄 Backup & Disaster Recovery

 ### Backup Strategy
+
 **What's Backed Up**:
+
 - PostgreSQL databases (daily)
 - Redis data (daily)
 - Docker volumes
@ -475,24 +542,30 @@ Traffic → Blue → Switch traffic → Green
 - Deployment manifests

 **Backup Schedule**:
+
 - Daily automated backups at 2 AM UTC
 - Retention: 30 days for databases, 7 days for Redis
 - Storage: Cloudflare R2 or Hetzner Storage Box

 **Backup Verification**:
+
 - Weekly automated restoration tests
 - Monthly manual restoration drills

 ### Disaster Recovery
+
 **Recovery Time Objective (RTO)**:
+
 - Service restart: < 1 hour
 - Full server restore: < 2 hours

 **Recovery Point Objective (RPO)**:
+
 - < 24 hours (daily backups)
 - Supabase PITR available for point-in-time recovery

 **Recovery Procedures**:
+
 1. **Service Failure**: Restart container (automated)
 2. **Data Corruption**: Restore from latest backup
 3. **Server Failure**: Provision new server, restore from backup
@ -503,18 +576,21 @@ Traffic → Blue → Switch traffic → Green
 ## 📚 Documentation Strategy

 ### For Developers
+
 - Quick start guide (30 minutes to first deployment)
 - Testing guide (how to write and run tests)
 - Troubleshooting guide (common issues)
 - Contributing guide (standards and patterns)

 ### For DevOps
+
 - Architecture documentation (complete system design)
 - Deployment runbooks (step-by-step procedures)
 - Monitoring guide (dashboards and alerts)
 - Incident response playbooks

 ### For Management
+
 - Cost analysis and projections
 - Success metrics and KPIs
 - Timeline and milestones
@ -525,6 +601,7 @@ Traffic → Blue → Switch traffic → Green
 ## 🎯 Phase Gates

 ### Phase 1 Complete When:
+
 - [x] Hetzner account created
 - [x] Staging server provisioned and Coolify installed
 - [x] GitHub secrets configured
@ -532,6 +609,7 @@ Traffic → Blue → Switch traffic → Green
 - [x] CI/CD pipeline tested end-to-end

 ### Phase 2 Complete When:
+
 - [x] All backend services deployed
 - [x] All web apps deployed
 - [x] All landing pages deployed
@ -539,12 +617,14 @@ Traffic → Blue → Switch traffic → Green
 - [x] Health checks passing for all services

 ### Phase 3 Complete When:
+
 - [x] Critical path tests at 100% coverage
 - [x] General code at 80% coverage
 - [x] Coverage enforcement in CI
 - [x] All tests passing consistently

 ### Phase 4 Complete When:
+
 - [x] Production server provisioned
 - [x] All services deployed to production
 - [x] Monitoring operational (Prometheus + Grafana + Loki)
@ -558,30 +638,35 @@ Traffic → Blue → Switch traffic → Green
 ### Identified Risks

 **Risk 1: Budget Overruns**
+
 - **Likelihood**: Low
 - **Impact**: Medium
 - **Mitigation**: Start with single server ($56/month), scale only when needed
 - **Contingency**: Downgrade server size, optimize resource usage

 **Risk 2: Deployment Failures**
+
 - **Likelihood**: Medium (during initial rollout)
 - **Impact**: High
 - **Mitigation**: Blue-green deployment, automated rollback, comprehensive testing
 - **Contingency**: Rollback procedures documented and tested

 **Risk 3: Service Outages**
+
 - **Likelihood**: Low
 - **Impact**: High
 - **Mitigation**: Health checks, monitoring, automated restarts
 - **Contingency**: Incident response playbooks, 24/7 monitoring

 **Risk 4: Data Loss**
+
 - **Likelihood**: Very Low
 - **Impact**: Critical
 - **Mitigation**: Daily backups, Supabase PITR, backup verification
 - **Contingency**: Multiple backup locations, disaster recovery drills

 **Risk 5: Security Breaches**
+
 - **Likelihood**: Low
 - **Impact**: Critical
 - **Mitigation**: Security best practices, automated audits, minimal attack surface
@ -592,24 +677,28 @@ Traffic → Blue → Switch traffic → Green
 ## 📈 Success Metrics & KPIs

 ### Deployment Metrics
+
 - **Deployment Frequency**: Target > 5/week (currently < 1/week)
 - **Deployment Duration**: Target < 10 minutes (currently 2+ hours manual)
 - **Deployment Success Rate**: Target > 95%
 - **Rollback Time**: Target < 5 minutes

 ### Quality Metrics
+
 - **Test Coverage**: Target 80% minimum (currently ~5%)
 - **Critical Path Coverage**: Target 100% (currently ~0%)
 - **Build Success Rate**: Target > 95%
 - **Code Review Turnaround**: Target < 24 hours

 ### Reliability Metrics
+
 - **Uptime**: Target 99.9% (43 minutes downtime/month)
 - **Mean Time to Recovery (MTTR)**: Target < 1 hour
 - **Mean Time Between Failures (MTBF)**: Target > 30 days
 - **Backup Success Rate**: Target 100%

 ### Cost Metrics
+
 - **Infrastructure Cost**: Target < $100/month (achieved: $56/month)
 - **Cost per Service**: Target < $5/month
 - **Cost Reduction**: 92% vs traditional PaaS
@ -619,18 +708,21 @@ Traffic → Blue → Switch traffic → Green
 ## 🎓 Training & Knowledge Transfer

 ### Developer Training (2-3 hours)
+
 - **Session 1**: CI/CD basics and GitHub Actions
 - **Session 2**: Writing and running tests
 - **Session 3**: Docker and deployment
 - **Session 4**: Troubleshooting and debugging

 ### DevOps Training (4-8 hours)
+
 - **Session 1**: Architecture deep dive
 - **Session 2**: Infrastructure setup (hands-on)
 - **Session 3**: CI/CD operations
 - **Session 4**: Incident response and recovery

 ### Documentation
+
 - All procedures documented in `cicd/` folder
 - Video tutorials (optional, future)
 - Regular knowledge sharing sessions
@ -640,6 +732,7 @@ Traffic → Blue → Switch traffic → Green
 ## 🔮 Future Enhancements

 ### Short-Term (3-6 months)
+
 - [ ] Canary deployments (gradual traffic shifting)
 - [ ] Feature flags (LaunchDarkly/Unleash)
 - [ ] Visual regression testing (Percy/Chromatic)
@ -647,6 +740,7 @@ Traffic → Blue → Switch traffic → Green
 - [ ] Mobile E2E testing (Detox/Maestro)

 ### Long-Term (6-12 months)
+
 - [ ] Kubernetes migration (when scale demands)
 - [ ] Multi-region deployment
 - [ ] Global load balancing
@ -658,11 +752,12 @@ Traffic → Blue → Switch traffic → Green
 ## ✅ Plan Approval

 **Created by**: Hive Mind Collective Intelligence
-**Reviewed by**: _________
-**Approved by**: _________
-**Approval Date**: _________
+**Reviewed by**: \***\*\_\*\***
+**Approved by**: \***\*\_\*\***
+**Approval Date**: \***\*\_\*\***

 **Next Steps**:
+
 1. Review this plan with the team
 2. Get budget approval ($56-146/month)
 3. Start implementation following `TODO.md`