fix(cicd): docker paths, formatting config,

and documentation

  - Fix Docker build paths in maerchenzauber and manadeck
  backends
  - Add comprehensive CI/CD documentation (private repo
  solution, type analysis)
  - Configure Prettier with proper plugins for Astro/Svelte
  - Update .gitignore to exclude .hive-mind and .claude-flow
  - Fix Turbo config for Presi app

  Related to cicd/integration branch - Priority 1 & 2 fixes
This commit is contained in:
Wuesteon 2025-11-27 18:33:08 +01:00
parent f55962e135
commit 0241f5554c
16 changed files with 2173 additions and 187 deletions

View file

@ -15,6 +15,7 @@ This document outlines the complete plan for implementing CI/CD infrastructure f
## 🎯 Goals & Success Criteria
### Primary Goals
1. **Automate deployments** - Deploy with a single commit to main
2. **Zero-downtime updates** - Blue-green deployment strategy
3. **Enforce quality** - Automated testing with 80% coverage
@ -22,6 +23,7 @@ This document outlines the complete plan for implementing CI/CD infrastructure f
5. **Team productivity** - Reduce deployment time from 2+ hours to < 10 minutes
### Success Criteria
- ✅ Staging auto-deploys on merge to main
- ✅ Production deploys take < 10 minutes
- ✅ Rollback can be executed in < 5 minutes
@ -35,6 +37,7 @@ This document outlines the complete plan for implementing CI/CD infrastructure f
## 🏗️ Architecture Overview
### Infrastructure Stack
- **Platform**: Coolify (open-source PaaS)
- **Hosting**: Hetzner Cloud (German data centers)
- **Container Runtime**: Docker + Docker Compose
@ -46,63 +49,76 @@ This document outlines the complete plan for implementing CI/CD infrastructure f
### Service Inventory (39 Services Total)
**Authentication**:
- mana-core-auth (NestJS) - Central authentication service
**Chat Project** (4 services):
- chat-backend (NestJS)
- chat-web (SvelteKit)
- chat-mobile (Expo - OTA updates)
- chat-landing (Astro)
**Maerchenzauber Project** (4 services):
- maerchenzauber-backend (NestJS)
- maerchenzauber-web (SvelteKit)
- maerchenzauber-mobile (Expo)
- maerchenzauber-landing (Astro)
**Manadeck Project** (4 services):
- manadeck-backend (NestJS)
- manadeck-web (SvelteKit)
- manadeck-mobile (Expo)
- manadeck-landing (Astro)
**Memoro Project** (3 services):
- memoro-web (SvelteKit)
- memoro-mobile (Expo)
- memoro-landing (Astro)
**Picture Project** (3 services):
- picture-web (SvelteKit)
- picture-mobile (Expo)
- picture-landing (Astro)
**Wisekeep Project** (4 services):
- wisekeep-backend (NestJS)
- wisekeep-web (SvelteKit)
- wisekeep-mobile (Expo)
- wisekeep-landing (Astro)
**Quote Project** (4 services):
- quote-backend (NestJS)
- quote-web (SvelteKit)
- quote-mobile (Expo)
- quote-landing (Astro)
**Nutriphi Project** (2 services):
- nutriphi-backend (NestJS)
- nutriphi-web (SvelteKit)
**Uload Project** (1 service):
- uload-web (SvelteKit)
**Bauntown Project** (1 service):
- bauntown-landing (Astro)
**Manacore Project** (2 services):
- manacore-web (SvelteKit)
- manacore-mobile (Expo)
**Shared Infrastructure** (2 services):
- postgres (PostgreSQL 16)
- redis (Redis 7)
@ -111,21 +127,25 @@ This document outlines the complete plan for implementing CI/CD infrastructure f
## 📅 Implementation Timeline
### Week 1: Foundation (Days 1-2)
**Goal**: Infrastructure setup and first deployment
**Day 1 Morning** (2-3 hours):
- Set up Hetzner account
- Provision staging server (CCX32)
- Install Coolify
- Configure GitHub Container Registry
**Day 1 Afternoon** (3-4 hours):
- Configure GitHub secrets (staging)
- Create first Dockerfile (mana-core-auth)
- Test CI/CD pipeline with test PR
- Deploy mana-core-auth to staging
**Day 2** (6-8 hours):
- Create Dockerfiles for remaining backends (6 services)
- Deploy all backends to staging
- Verify health checks
@ -134,15 +154,18 @@ This document outlines the complete plan for implementing CI/CD infrastructure f
---
### Week 1: Web Apps (Days 3-4)
**Goal**: Deploy web apps and landing pages
**Day 3** (6-8 hours):
- Create SvelteKit Dockerfiles (9 services)
- Test builds locally
- Deploy to staging
- Configure reverse proxy/domains
**Day 4** (6-8 hours):
- Create Astro Dockerfiles (9 services)
- Deploy landing pages
- Set up SSL/TLS (Let's Encrypt)
@ -151,21 +174,25 @@ This document outlines the complete plan for implementing CI/CD infrastructure f
---
### Week 2: Testing & Production (Days 5-7)
**Goal**: Implement testing and deploy to production
**Day 5** (6-8 hours):
- Write critical path tests (auth, payments) - 100% coverage
- Configure test frameworks
- Enable coverage enforcement in CI
- Fix any failing tests
**Day 6** (6-8 hours):
- Provision production server
- Configure production secrets
- Set up GitHub environments (approval gates)
- Deploy mana-core-auth to production
**Day 7** (6-8 hours):
- Deploy all services to production
- Configure DNS for all domains
- Set up monitoring (Prometheus + Grafana)
@ -174,21 +201,25 @@ This document outlines the complete plan for implementing CI/CD infrastructure f
---
### Week 2-3: Monitoring & Optimization (Days 8-10+)
**Goal**: Set up monitoring and optimize
**Day 8** (4-6 hours):
- Install Loki for logging
- Configure Grafana dashboards
- Set up alerting (Prometheus Alertmanager)
- Integrate Sentry for error tracking
**Day 9** (4-6 hours):
- Set up automated backups
- Test backup restoration
- Perform disaster recovery drill
- Document procedures
**Day 10+** (ongoing):
- Write remaining tests (80% coverage target)
- Performance optimization (caching, CDN)
- Team training
@ -199,6 +230,7 @@ This document outlines the complete plan for implementing CI/CD infrastructure f
## 🔄 Development Workflow
### Developer Workflow
```
1. Create feature branch
@ -224,6 +256,7 @@ This document outlines the complete plan for implementing CI/CD infrastructure f
```
### Deployment Workflow
```
Staging (Automatic):
Merge to main → Build → Push → Deploy → Health Check → Done
@ -238,23 +271,28 @@ Production (Manual Approval):
## 🐳 Docker Strategy
### Multi-Stage Builds
All Dockerfiles use multi-stage builds for optimization:
**Stage 1: Dependencies**
- Install pnpm and dependencies
- Uses layer caching
**Stage 2: Build**
- Build application
- Generate production artifacts
**Stage 3: Runtime**
- Alpine Linux base (minimal)
- Copy only production artifacts
- Non-root user
- Health checks configured
### Image Naming Convention
```
ghcr.io/wuesteon/mana-core-auth:latest
ghcr.io/wuesteon/mana-core-auth:main
@ -266,6 +304,7 @@ ghcr.io/wuesteon/chat-backend:main-abc1234
```
**Tags**:
- `latest` - Most recent build from main
- `main` - Branch-based tag
- `main-abc1234` - Git commit SHA (for rollbacks)
@ -275,6 +314,7 @@ ghcr.io/wuesteon/chat-backend:main-abc1234
## 🧪 Testing Strategy
### Coverage Targets
- **Critical Paths**: 100% coverage required
- Authentication (`@manacore/shared-auth`)
- Payment/credit system
@ -286,19 +326,24 @@ ghcr.io/wuesteon/chat-backend:main-abc1234
- Shared packages
### Test Types
**Unit Tests**:
- All services and components
- Frameworks: Jest (backend/mobile), Vitest (web/shared)
**Integration Tests**:
- API endpoints with test database
- Service interactions
**E2E Tests** (Phase 2):
- Playwright for web apps
- Detox/Maestro for mobile apps
### CI/CD Integration
- Run on every PR
- Enforce coverage thresholds
- Block merge if tests fail or coverage below 80%
@ -309,6 +354,7 @@ ghcr.io/wuesteon/chat-backend:main-abc1234
## 🚀 Deployment Strategy
### Blue-Green Deployment
```
Current (Blue): New (Green):
v1.0 → v1.1 (deploying)
@ -325,11 +371,13 @@ Traffic → Blue → Switch traffic → Green
```
**Benefits**:
- Zero downtime
- Instant rollback (switch back to blue)
- Test new version before full cutover
### Rollback Procedure
1. Detect issue (monitoring alerts or manual detection)
2. Run `scripts/deploy/rollback.sh`
3. Switch traffic back to previous version
@ -341,37 +389,47 @@ Traffic → Blue → Switch traffic → Green
## 📊 Monitoring Strategy
### Metrics Collection (Prometheus)
**Application Metrics**:
- Request rate (requests/second)
- Error rate (% of failed requests)
- Response time (p50, p95, p99)
- Active connections
**Infrastructure Metrics**:
- CPU usage per service
- Memory usage per service
- Disk usage
- Network I/O
### Logging (Loki + Grafana)
**Log Aggregation**:
- All containers → stdout/stderr → Loki → Grafana
- Structured JSON logs
- Correlation IDs for tracing
**Log Retention**:
- 7 days online (searchable)
- 30 days archived (backup)
### Error Tracking (Sentry)
**What's Tracked**:
- Application errors and exceptions
- Source maps for better stack traces
- User context (anonymized)
- Performance metrics
### Alerting (Prometheus Alertmanager)
**Alert Rules**:
- Service down (health check fails for 2 minutes)
- High error rate (> 5% of requests failing)
- High CPU usage (> 80% for 5 minutes)
@ -379,6 +437,7 @@ Traffic → Blue → Switch traffic → Green
- Disk space low (< 10% free)
**Notification Channels**:
- Slack (all alerts)
- PagerDuty (critical alerts only)
- Email (daily summary)
@ -410,20 +469,22 @@ Traffic → Blue → Switch traffic → Green
| **Total** | **$146** | |
**Cost Savings**:
- vs AWS/Azure: $500-1,000/month (89-95% savings)
- vs Heroku/Railway: $300-500/month (71-83% savings)
- vs DigitalOcean: $150-300/month (51-71% savings)
### Resource Allocation (Per Service)
| Service Type | CPU | RAM | Instances | Total |
|--------------|-----|-----|-----------|-------|
| NestJS Backend | 0.5 | 512 MB | 10 | 5 CPU, 5 GB RAM |
| SvelteKit Web | 0.25 | 256 MB | 9 | 2.25 CPU, 2.25 GB RAM |
| Astro Landing | 0.1 | 128 MB | 9 | 0.9 CPU, 1.1 GB RAM |
| PostgreSQL | 1 | 2 GB | 1 | 1 CPU, 2 GB RAM |
| Redis | 0.25 | 256 MB | 1 | 0.25 CPU, 256 MB RAM |
| Monitoring | 1 | 2 GB | 1 | 1 CPU, 2 GB RAM |
| **Total** | | | | **~10.5 CPU, ~12.5 GB RAM** |
| Service Type | CPU | RAM | Instances | Total |
| -------------- | ---- | ------ | --------- | --------------------------- |
| NestJS Backend | 0.5 | 512 MB | 10 | 5 CPU, 5 GB RAM |
| SvelteKit Web | 0.25 | 256 MB | 9 | 2.25 CPU, 2.25 GB RAM |
| Astro Landing | 0.1 | 128 MB | 9 | 0.9 CPU, 1.1 GB RAM |
| PostgreSQL | 1 | 2 GB | 1 | 1 CPU, 2 GB RAM |
| Redis | 0.25 | 256 MB | 1 | 0.25 CPU, 256 MB RAM |
| Monitoring | 1 | 2 GB | 1 | 1 CPU, 2 GB RAM |
| **Total** | | | | **~10.5 CPU, ~12.5 GB RAM** |
**Conclusion**: CCX32 (8 vCPU, 32 GB RAM) is sufficient for all services with headroom for growth.
@ -432,6 +493,7 @@ Traffic → Blue → Switch traffic → Green
## 🔐 Security Measures
### Infrastructure Security
- [x] Firewall rules (only ports 22, 80, 443 exposed)
- [x] SSH key-based authentication (no passwords)
- [x] Non-root Docker containers
@ -440,6 +502,7 @@ Traffic → Blue → Switch traffic → Green
- [x] Automatic security updates
### Application Security
- [x] Environment variable encryption (GitHub Secrets)
- [x] SSL/TLS for all services (Let's Encrypt)
- [x] JWT-based authentication (@manacore/shared-auth)
@ -448,6 +511,7 @@ Traffic → Blue → Switch traffic → Green
- [x] CORS policies enforced
### CI/CD Security
- [x] Weekly dependency audits (Dependabot)
- [x] Docker image scanning (Trivy)
- [x] No secrets in code
@ -456,6 +520,7 @@ Traffic → Blue → Switch traffic → Green
- [x] Signed commits (recommended)
### Compliance
- [x] GDPR compliance (Hetzner EU data centers)
- [x] ISO 27001 certified infrastructure
- [x] SOC 2 Type II (Supabase)
@ -467,7 +532,9 @@ Traffic → Blue → Switch traffic → Green
## 🔄 Backup & Disaster Recovery
### Backup Strategy
**What's Backed Up**:
- PostgreSQL databases (daily)
- Redis data (daily)
- Docker volumes
@ -475,24 +542,30 @@ Traffic → Blue → Switch traffic → Green
- Deployment manifests
**Backup Schedule**:
- Daily automated backups at 2 AM UTC
- Retention: 30 days for databases, 7 days for Redis
- Storage: Cloudflare R2 or Hetzner Storage Box
**Backup Verification**:
- Weekly automated restoration tests
- Monthly manual restoration drills
### Disaster Recovery
**Recovery Time Objective (RTO)**:
- Service restart: < 1 hour
- Full server restore: < 2 hours
**Recovery Point Objective (RPO)**:
- < 24 hours (daily backups)
- Supabase PITR available for point-in-time recovery
**Recovery Procedures**:
1. **Service Failure**: Restart container (automated)
2. **Data Corruption**: Restore from latest backup
3. **Server Failure**: Provision new server, restore from backup
@ -503,18 +576,21 @@ Traffic → Blue → Switch traffic → Green
## 📚 Documentation Strategy
### For Developers
- Quick start guide (30 minutes to first deployment)
- Testing guide (how to write and run tests)
- Troubleshooting guide (common issues)
- Contributing guide (standards and patterns)
### For DevOps
- Architecture documentation (complete system design)
- Deployment runbooks (step-by-step procedures)
- Monitoring guide (dashboards and alerts)
- Incident response playbooks
### For Management
- Cost analysis and projections
- Success metrics and KPIs
- Timeline and milestones
@ -525,6 +601,7 @@ Traffic → Blue → Switch traffic → Green
## 🎯 Phase Gates
### Phase 1 Complete When:
- [x] Hetzner account created
- [x] Staging server provisioned and Coolify installed
- [x] GitHub secrets configured
@ -532,6 +609,7 @@ Traffic → Blue → Switch traffic → Green
- [x] CI/CD pipeline tested end-to-end
### Phase 2 Complete When:
- [x] All backend services deployed
- [x] All web apps deployed
- [x] All landing pages deployed
@ -539,12 +617,14 @@ Traffic → Blue → Switch traffic → Green
- [x] Health checks passing for all services
### Phase 3 Complete When:
- [x] Critical path tests at 100% coverage
- [x] General code at 80% coverage
- [x] Coverage enforcement in CI
- [x] All tests passing consistently
### Phase 4 Complete When:
- [x] Production server provisioned
- [x] All services deployed to production
- [x] Monitoring operational (Prometheus + Grafana + Loki)
@ -558,30 +638,35 @@ Traffic → Blue → Switch traffic → Green
### Identified Risks
**Risk 1: Budget Overruns**
- **Likelihood**: Low
- **Impact**: Medium
- **Mitigation**: Start with single server ($56/month), scale only when needed
- **Contingency**: Downgrade server size, optimize resource usage
**Risk 2: Deployment Failures**
- **Likelihood**: Medium (during initial rollout)
- **Impact**: High
- **Mitigation**: Blue-green deployment, automated rollback, comprehensive testing
- **Contingency**: Rollback procedures documented and tested
**Risk 3: Service Outages**
- **Likelihood**: Low
- **Impact**: High
- **Mitigation**: Health checks, monitoring, automated restarts
- **Contingency**: Incident response playbooks, 24/7 monitoring
**Risk 4: Data Loss**
- **Likelihood**: Very Low
- **Impact**: Critical
- **Mitigation**: Daily backups, Supabase PITR, backup verification
- **Contingency**: Multiple backup locations, disaster recovery drills
**Risk 5: Security Breaches**
- **Likelihood**: Low
- **Impact**: Critical
- **Mitigation**: Security best practices, automated audits, minimal attack surface
@ -592,24 +677,28 @@ Traffic → Blue → Switch traffic → Green
## 📈 Success Metrics & KPIs
### Deployment Metrics
- **Deployment Frequency**: Target > 5/week (currently < 1/week)
- **Deployment Duration**: Target < 10 minutes (currently 2+ hours manual)
- **Deployment Success Rate**: Target > 95%
- **Rollback Time**: Target < 5 minutes
### Quality Metrics
- **Test Coverage**: Target 80% minimum (currently ~5%)
- **Critical Path Coverage**: Target 100% (currently ~0%)
- **Build Success Rate**: Target > 95%
- **Code Review Turnaround**: Target < 24 hours
### Reliability Metrics
- **Uptime**: Target 99.9% (43 minutes downtime/month)
- **Mean Time to Recovery (MTTR)**: Target < 1 hour
- **Mean Time Between Failures (MTBF)**: Target > 30 days
- **Backup Success Rate**: Target 100%
### Cost Metrics
- **Infrastructure Cost**: Target < $100/month (achieved: $56/month)
- **Cost per Service**: Target < $5/month
- **Cost Reduction**: 92% vs traditional PaaS
@ -619,18 +708,21 @@ Traffic → Blue → Switch traffic → Green
## 🎓 Training & Knowledge Transfer
### Developer Training (2-3 hours)
- **Session 1**: CI/CD basics and GitHub Actions
- **Session 2**: Writing and running tests
- **Session 3**: Docker and deployment
- **Session 4**: Troubleshooting and debugging
### DevOps Training (4-8 hours)
- **Session 1**: Architecture deep dive
- **Session 2**: Infrastructure setup (hands-on)
- **Session 3**: CI/CD operations
- **Session 4**: Incident response and recovery
### Documentation
- All procedures documented in `cicd/` folder
- Video tutorials (optional, future)
- Regular knowledge sharing sessions
@ -640,6 +732,7 @@ Traffic → Blue → Switch traffic → Green
## 🔮 Future Enhancements
### Short-Term (3-6 months)
- [ ] Canary deployments (gradual traffic shifting)
- [ ] Feature flags (LaunchDarkly/Unleash)
- [ ] Visual regression testing (Percy/Chromatic)
@ -647,6 +740,7 @@ Traffic → Blue → Switch traffic → Green
- [ ] Mobile E2E testing (Detox/Maestro)
### Long-Term (6-12 months)
- [ ] Kubernetes migration (when scale demands)
- [ ] Multi-region deployment
- [ ] Global load balancing
@ -658,11 +752,12 @@ Traffic → Blue → Switch traffic → Green
## ✅ Plan Approval
**Created by**: Hive Mind Collective Intelligence
**Reviewed by**: _________
**Approved by**: _________
**Approval Date**: _________
**Reviewed by**: \***\*\_\*\***
**Approved by**: \***\*\_\*\***
**Approval Date**: \***\*\_\*\***
**Next Steps**:
1. Review this plan with the team
2. Get budget approval ($56-146/month)
3. Start implementation following `TODO.md`