mirror of
https://github.com/Memo-2023/mana-monorepo.git
synced 2026-05-14 20:21:09 +02:00
docs: add staging deployment troubleshooting guide
Comprehensive documentation of the staging deployment journey including: - Problem 1: GitHub workflow file extensions (.yml.bak to disable) - Problem 2: chat-backend health check path (/api/v1/health not /api/health) - Problem 3: SvelteKit static env imports (use runtime patterns for Docker) - Problem 4: Orphan Docker containers Also fixes the cd-staging.yml health check path for chat-backend to match the actual NestJS endpoint at /api/v1/health. Includes checklists, debugging commands, and lessons learned. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
parent
4a56c888b0
commit
0c05097459
2 changed files with 296 additions and 1 deletions
2
.github/workflows/cd-staging.yml
vendored
2
.github/workflows/cd-staging.yml
vendored
|
|
@ -216,7 +216,7 @@ jobs:
|
||||||
|
|
||||||
# Check chat-backend
|
# Check chat-backend
|
||||||
echo "Checking chat-backend..."
|
echo "Checking chat-backend..."
|
||||||
if docker compose exec -T chat-backend wget -q -O - http://localhost:3002/api/health > /dev/null 2>&1; then
|
if docker compose exec -T chat-backend wget -q -O - http://localhost:3002/api/v1/health > /dev/null 2>&1; then
|
||||||
echo "✅ chat-backend is healthy"
|
echo "✅ chat-backend is healthy"
|
||||||
else
|
else
|
||||||
echo "❌ chat-backend health check failed"
|
echo "❌ chat-backend health check failed"
|
||||||
|
|
|
||||||
|
|
@ -8,6 +8,11 @@ Common issues and solutions for the manacore-monorepo.
|
||||||
- [Build Issues](#build-issues)
|
- [Build Issues](#build-issues)
|
||||||
- [Linting Issues](#linting-issues)
|
- [Linting Issues](#linting-issues)
|
||||||
- [NestJS Dependency Injection](#nestjs-dependency-injection)
|
- [NestJS Dependency Injection](#nestjs-dependency-injection)
|
||||||
|
- [Staging Deployment Issues](#staging-deployment-issues)
|
||||||
|
- [GitHub Running Disabled Workflows](#problem-1-github-running-disabled-workflows)
|
||||||
|
- [chat-backend Container Unhealthy](#problem-2-chat-backend-container-unhealthy)
|
||||||
|
- [SvelteKit Static Environment Variable Imports](#problem-3-sveltekit-static-environment-variable-imports)
|
||||||
|
- [Orphan Docker Containers](#problem-4-orphan-docker-containers)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
@ -405,6 +410,296 @@ docker run --rm --entrypoint cat test /app/dist/ai/ai.service.js
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## Staging Deployment Issues
|
||||||
|
|
||||||
|
### Overview
|
||||||
|
|
||||||
|
This section documents the complete troubleshooting journey for deploying mana-core-auth + chat (backend + web) to staging. It covers GitHub Actions CI/CD simplification, Docker health checks, database setup, and SvelteKit environment variables.
|
||||||
|
|
||||||
|
### Problem 1: GitHub Running Disabled Workflows
|
||||||
|
|
||||||
|
**Symptoms:**
|
||||||
|
|
||||||
|
- Workflows with `.full.yml` extension were still running
|
||||||
|
- `test.full.yml` was being recognized as a valid workflow
|
||||||
|
- Multiple unnecessary workflows running on every push
|
||||||
|
|
||||||
|
**What We Tried:**
|
||||||
|
|
||||||
|
1. ❌ Renaming to `.disabled` extension → Still ran
|
||||||
|
2. ❌ Renaming to `.full.yml` extension → Still ran (GitHub recognizes any `.yml` in `.github/workflows/`)
|
||||||
|
|
||||||
|
**Solution:**
|
||||||
|
|
||||||
|
- ✅ Rename to `.yml.bak` extension (GitHub ignores non-`.yml` files)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Disable a workflow
|
||||||
|
mv .github/workflows/test.yml .github/workflows/test.yml.bak
|
||||||
|
|
||||||
|
# Re-enable a workflow
|
||||||
|
mv .github/workflows/test.yml.bak .github/workflows/test.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
**Files Changed:**
|
||||||
|
|
||||||
|
- `test.yml` → `test.yml.bak`
|
||||||
|
- `test-coverage.yml` → `test-coverage.yml.bak`
|
||||||
|
- `ci-pull-request.yml` → `ci-pull-request.yml.bak`
|
||||||
|
- `dependency-update.yml` → `dependency-update.yml.bak`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Problem 2: chat-backend Container Unhealthy
|
||||||
|
|
||||||
|
**Symptoms:**
|
||||||
|
|
||||||
|
- Deployment failed with: `dependency failed to start: container chat-backend-staging is unhealthy`
|
||||||
|
- chat-web wouldn't start because it depends on chat-backend being healthy
|
||||||
|
|
||||||
|
**Debugging Steps:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Connect to staging server
|
||||||
|
ssh -i ~/.ssh/hetzner_deploy_key deploy@46.224.108.214
|
||||||
|
|
||||||
|
# Check container status
|
||||||
|
cd ~/manacore-staging
|
||||||
|
docker compose ps
|
||||||
|
|
||||||
|
# Check logs for the failing container
|
||||||
|
docker compose logs chat-backend --tail=100
|
||||||
|
|
||||||
|
# Test health endpoint manually from inside container
|
||||||
|
docker compose exec chat-backend wget -q -O - http://localhost:3002/api/v1/health
|
||||||
|
```
|
||||||
|
|
||||||
|
**Root Cause 1: Missing Database**
|
||||||
|
|
||||||
|
The logs showed:
|
||||||
|
|
||||||
|
```
|
||||||
|
error: database "chat" does not exist
|
||||||
|
```
|
||||||
|
|
||||||
|
**Fix:** Create the database manually:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker compose exec -T postgres psql -U postgres -c "CREATE DATABASE chat;"
|
||||||
|
```
|
||||||
|
|
||||||
|
**Root Cause 2: Wrong Health Check Path**
|
||||||
|
|
||||||
|
The `docker-compose.staging.yml` had:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
healthcheck:
|
||||||
|
test: ['CMD', 'wget', '...', 'http://localhost:3002/api/health'] # ❌ WRONG
|
||||||
|
```
|
||||||
|
|
||||||
|
But NestJS health endpoint is at `/api/v1/health`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
healthcheck:
|
||||||
|
test: ['CMD', 'wget', '...', 'http://localhost:3002/api/v1/health'] # ✅ CORRECT
|
||||||
|
```
|
||||||
|
|
||||||
|
**How to Verify Health Endpoints:**
|
||||||
|
|
||||||
|
| Service | Port | Health Endpoint |
|
||||||
|
| -------------- | ---- | ---------------- |
|
||||||
|
| mana-core-auth | 3001 | `/api/v1/health` |
|
||||||
|
| chat-backend | 3002 | `/api/v1/health` |
|
||||||
|
| chat-web | 3000 | `/health` |
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Test from outside the server
|
||||||
|
curl http://46.224.108.214:3001/api/v1/health
|
||||||
|
curl http://46.224.108.214:3002/api/v1/health
|
||||||
|
curl http://46.224.108.214:3000/health
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Problem 3: SvelteKit Static Environment Variable Imports
|
||||||
|
|
||||||
|
**Symptoms:**
|
||||||
|
|
||||||
|
- Docker build failed with: `PUBLIC_MANA_CORE_AUTH_URL is not exported by $env/static/public`
|
||||||
|
- Build error during `npm run build` in Docker
|
||||||
|
|
||||||
|
**Root Cause:**
|
||||||
|
|
||||||
|
SvelteKit's `$env/static/public` imports are resolved at **build time**, not runtime. When building in Docker, these environment variables don't exist.
|
||||||
|
|
||||||
|
**❌ WRONG - Static Import (Build Time):**
|
||||||
|
|
||||||
|
```typescript
|
||||||
|
// apps/chat/apps/web/src/lib/stores/auth.svelte.ts
|
||||||
|
import { PUBLIC_MANA_CORE_AUTH_URL } from '$env/static/public'; // ❌ Fails in Docker
|
||||||
|
|
||||||
|
const authUrl = PUBLIC_MANA_CORE_AUTH_URL;
|
||||||
|
```
|
||||||
|
|
||||||
|
**✅ CORRECT - Runtime Environment Variable:**
|
||||||
|
|
||||||
|
```typescript
|
||||||
|
// apps/chat/apps/web/src/lib/stores/auth.svelte.ts
|
||||||
|
import { browser } from '$app/environment';
|
||||||
|
|
||||||
|
function getAuthUrl(): string {
|
||||||
|
if (browser && typeof window !== 'undefined') {
|
||||||
|
// Client-side: check for injected env or use default
|
||||||
|
return (
|
||||||
|
(window as unknown as { __PUBLIC_MANA_CORE_AUTH_URL__?: string })
|
||||||
|
.__PUBLIC_MANA_CORE_AUTH_URL__ ||
|
||||||
|
import.meta.env.PUBLIC_MANA_CORE_AUTH_URL ||
|
||||||
|
'http://localhost:3001'
|
||||||
|
);
|
||||||
|
}
|
||||||
|
// Server-side: use process.env or default
|
||||||
|
return process.env.PUBLIC_MANA_CORE_AUTH_URL || 'http://localhost:3001';
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**The Pattern:**
|
||||||
|
|
||||||
|
1. Check if running in browser
|
||||||
|
2. Try window-injected variable (for runtime injection)
|
||||||
|
3. Try `import.meta.env` (for Vite build-time)
|
||||||
|
4. Fall back to `process.env` (for SSR)
|
||||||
|
5. Use localhost default for development
|
||||||
|
|
||||||
|
**Files Fixed:**
|
||||||
|
|
||||||
|
- `apps/chat/apps/web/src/lib/stores/auth.svelte.ts`
|
||||||
|
- `apps/chat/apps/web/src/lib/services/feedback.ts`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Problem 4: Orphan Docker Containers
|
||||||
|
|
||||||
|
**Symptoms:**
|
||||||
|
|
||||||
|
- Old containers from previous deployments still running
|
||||||
|
- `docker compose ps` shows unexpected services
|
||||||
|
|
||||||
|
**Fix:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Remove orphan containers
|
||||||
|
docker compose down --remove-orphans
|
||||||
|
|
||||||
|
# Bring up fresh
|
||||||
|
docker compose up -d
|
||||||
|
|
||||||
|
# Manually remove specific orphans
|
||||||
|
docker rm -f manadeck-backend-staging manacore-nginx-staging
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Complete Staging Deployment Checklist
|
||||||
|
|
||||||
|
#### Before Deployment
|
||||||
|
|
||||||
|
- [ ] Verify `docker-compose.staging.yml` has correct health check paths
|
||||||
|
- [ ] Verify CI/CD workflow (`cd-staging.yml`) has matching health check paths
|
||||||
|
- [ ] Check that required databases exist or CI creates them
|
||||||
|
|
||||||
|
#### During Deployment Failure
|
||||||
|
|
||||||
|
1. **SSH to server:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh -i ~/.ssh/hetzner_deploy_key deploy@46.224.108.214
|
||||||
|
cd ~/manacore-staging
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Check container status:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker compose ps
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Check logs for failing container:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker compose logs <container-name> --tail=100
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Common fixes:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Create missing database
|
||||||
|
docker compose exec -T postgres psql -U postgres -c "CREATE DATABASE <dbname>;"
|
||||||
|
|
||||||
|
# Restart a service
|
||||||
|
docker compose restart <service-name>
|
||||||
|
|
||||||
|
# Force recreate
|
||||||
|
docker compose up -d --force-recreate <service-name>
|
||||||
|
```
|
||||||
|
|
||||||
|
5. **Verify health:**
|
||||||
|
```bash
|
||||||
|
curl http://localhost:3001/api/v1/health # mana-core-auth
|
||||||
|
curl http://localhost:3002/api/v1/health # chat-backend
|
||||||
|
curl http://localhost:3000/health # chat-web
|
||||||
|
```
|
||||||
|
|
||||||
|
#### After Deployment
|
||||||
|
|
||||||
|
- [ ] Verify all health endpoints respond
|
||||||
|
- [ ] Check container logs for errors
|
||||||
|
- [ ] Test actual functionality (login, API calls)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Key Files for Staging Deployment
|
||||||
|
|
||||||
|
| File | Purpose |
|
||||||
|
| ---------------------------------- | ------------------------------------- |
|
||||||
|
| `docker-compose.staging.yml` | Service definitions and health checks |
|
||||||
|
| `.github/workflows/cd-staging.yml` | CI/CD deployment workflow |
|
||||||
|
| `.github/workflows/ci-main.yml` | Docker image builds on push to main |
|
||||||
|
|
||||||
|
### Health Check Patterns
|
||||||
|
|
||||||
|
**Docker Compose (`docker-compose.staging.yml`):**
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
healthcheck:
|
||||||
|
test: ['CMD', 'wget', '--no-verbose', '--tries=1', '--spider', 'http://localhost:PORT/ENDPOINT']
|
||||||
|
interval: 30s
|
||||||
|
timeout: 10s
|
||||||
|
retries: 3
|
||||||
|
start_period: 40s
|
||||||
|
```
|
||||||
|
|
||||||
|
**CI/CD Workflow (`cd-staging.yml`):**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check from inside container
|
||||||
|
docker compose exec -T chat-backend wget -q -O - http://localhost:3002/api/v1/health
|
||||||
|
```
|
||||||
|
|
||||||
|
### Lessons Learned
|
||||||
|
|
||||||
|
1. **GitHub Workflows:** Only files ending in `.yml` or `.yaml` in `.github/workflows/` are recognized. Use `.bak` extension to disable.
|
||||||
|
|
||||||
|
2. **NestJS Health Endpoints:** All NestJS backends use `/api/v1/health`, not `/api/health`.
|
||||||
|
|
||||||
|
3. **Docker Compose Dependencies:** When using `depends_on: condition: service_healthy`, the dependent service won't start until the health check passes.
|
||||||
|
|
||||||
|
4. **Database Creation:** Must happen AFTER PostgreSQL is healthy but BEFORE dependent services run migrations.
|
||||||
|
|
||||||
|
5. **SvelteKit Environment Variables:** Use runtime patterns (`process.env`, `import.meta.env`) instead of `$env/static/public` for Docker builds.
|
||||||
|
|
||||||
|
6. **Verify Before Commit:** Always check both `docker-compose.staging.yml` AND CI/CD workflows for matching paths.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## References
|
## References
|
||||||
|
|
||||||
- [CLAUDE.md - Turborepo Configuration](./CLAUDE.md#turborepo-configuration)
|
- [CLAUDE.md - Turborepo Configuration](./CLAUDE.md#turborepo-configuration)
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue