managarten/TROUBLESHOOTING.md
Wuesteon 0fe397504c fix(cd): use drizzle-kit push for schema migration
- Change db:migrate (non-existent) to drizzle-kit push --force
- Add --force flag to skip interactive confirmation in CI
- Document Problem 7: Missing Database Schema
- Add lessons learned about schema vs database creation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-05 04:16:32 +01:00

26 KiB

Troubleshooting Guide

Common issues and solutions for the manacore-monorepo.

Table of Contents


Recursive Turbo Calls

Problem: Infinite Loop / Tasks Running Forever

Symptoms:

  • pnpm run build runs for 10+ minutes without completing
  • pnpm run lint hangs indefinitely
  • pnpm run type-check shows thousands of duplicate task entries
  • CI/CD pipelines timeout after 10+ minutes

Root Cause:

Parent workspace packages (e.g., apps/zitare/package.json, apps/presi/package.json) have scripts that call turbo run <task>, creating an infinite recursion loop.

How It Happens

Root turbo → finds "build" script in apps/zitare/package.json
  → runs "turbo run build" in zitare
    → finds "build" script again
      → runs "turbo run build" again
        → (infinite loop!)

WRONG - Causes Infinite Recursion

// apps/zitare/package.json - DON'T DO THIS!
{
	"scripts": {
		"build": "turbo run build", // ❌ WRONG
		"lint": "turbo run lint", // ❌ WRONG
		"type-check": "turbo run type-check", // ❌ WRONG
		"clean": "turbo run clean" // ❌ WRONG
	}
}
// apps/picture/package.json - DON'T DO THIS!
{
	"scripts": {
		"build": "pnpm run --recursive build", // ❌ WRONG
		"lint": "pnpm --filter '@picture/*' run lint" // ❌ WRONG
	}
}

CORRECT - Let Root Turbo Handle Orchestration

// apps/zitare/package.json - CORRECT
{
	"scripts": {
		"dev": "turbo run dev", // ✅ OK for dev (persistent task, scoped)
		// No build, lint, type-check scripts - handled by root turbo
		"db:push": "pnpm --filter @zitare/backend db:push", // ✅ OK
		"db:studio": "pnpm --filter @zitare/backend db:studio" // ✅ OK
	}
}

Why dev is the Exception

Using turbo run dev in parent packages is acceptable because:

  1. It's typically run directly on that package (scoped: pnpm zitare:dev)
  2. Dev tasks are persistent (long-running) and turbo handles them differently
  3. Root never orchestrates dev across all packages simultaneously

The Rule

Parent workspace packages must NEVER have scripts that call turbo run <task> for tasks that turbo orchestrates from the root.

Tasks orchestrated from root (defined in turbo.json):

  • build - Root handles this
  • lint - Root handles this
  • type-check - Root handles this
  • test - Root handles this
  • clean - Root handles this
  • dev - Exception (scoped usage is fine)

How to Fix

If you added a recursive script:

  1. Open the parent package.json (e.g., apps/myapp/package.json)
  2. Remove the problematic script entirely:
  {
    "scripts": {
      "dev": "turbo run dev",
-     "build": "turbo run build",
-     "lint": "turbo run lint",
-     "type-check": "turbo run type-check",
      "db:push": "pnpm --filter @myapp/backend db:push"
    }
  }
  1. The root turbo.json already handles orchestration for these tasks

Affected Locations

Parent packages are located at:

  • apps/*/package.json (e.g., apps/zitare/package.json)
  • games/*/package.json (e.g., games/mana-games/package.json)

Do NOT add turbo scripts here!

Child packages (these are fine):

  • apps/*/apps/*/package.json (e.g., apps/zitare/apps/backend/package.json)
  • packages/*/package.json (e.g., packages/shared-theme/package.json)

Build Issues

Build Fails with "ELIFECYCLE Command failed"

Check for:

  1. Recursive turbo calls (see above)
  2. Missing dependencies in a package
  3. TypeScript errors in source code
  4. Import/export mismatches

Debugging:

# Run build and capture full output
pnpm run build 2>&1 | tee build.log

# Search for actual error (not just ELIFECYCLE)
grep -A10 "error during build" build.log

# Build specific package to isolate issue
pnpm --filter @zitare/backend build

Build Times Out in CI

Symptoms:

  • CI runs for 10+ minutes
  • Timeout before completion
  • "No output has been received in the last 10m0s"

Solution:

This is almost always caused by recursive turbo calls. See the Recursive Turbo Calls section above.

Quick fix:

# Locally, check if build completes in reasonable time
time pnpm run build

# Should complete in < 2 minutes for clean build
# Should complete in < 30 seconds for cached build

If it takes longer than 2-3 minutes, you have recursive scripts.


Linting Issues

Lint Hangs or Runs Forever

Same issue as build - recursive turbo calls!

WRONG:

// apps/presi/package.json - DON'T DO THIS!
{
	"scripts": {
		"lint": "pnpm --filter '@presi/*' run lint" // ❌ Recursive
	}
}

CORRECT:

// apps/presi/package.json - Remove the lint script
{
	"scripts": {
		"dev": "pnpm --filter '@presi/*' run dev"
		// No lint script - root turbo handles it
	}
}

Run lint from root:

# Lint all packages
pnpm run lint

# Lint specific package
pnpm --filter @presi/backend lint

# Lint specific project
pnpm turbo run lint --filter=presi

ESLint Errors

Common issues:

  1. Missing eslint config

    # Add shared config
    pnpm add -D @manacore/eslint-config --filter @myapp/backend
    
  2. Incompatible ESLint versions

    # Check versions
    pnpm ls eslint
    
    # Update to match root version
    pnpm add -D eslint@latest --filter @myapp/backend
    

Prevention Checklist

When creating a new app or package:

  • DO NOT add build, lint, type-check, or test scripts to parent packages
  • DO add these scripts to child packages (apps/myapp/apps/backend/package.json)
  • DO use project-specific scripts (e.g., db:push, db:studio)
  • DO test locally: pnpm run build should complete in < 2 minutes
  • DO refer to CLAUDE.md for patterns

Quick Validation

# Check for problematic patterns in parent packages
for pkg in apps/*/package.json games/*/package.json; do
  if grep -q '"build".*turbo run build' "$pkg" 2>/dev/null; then
    echo "❌ RECURSIVE SCRIPT FOUND: $pkg"
  fi
done

NestJS Dependency Injection

Problem: "Nest can't resolve dependencies" Error

Symptoms:

  • NestJS fails to start with error: Nest can't resolve dependencies of the XService (?)
  • Error mentions "argument Function at index [0] is available"
  • The module imports look correct but service still won't inject

Root Cause:

Using type-only imports (import {X }) for classes that need to be injected. TypeScript erases type-only imports at compile time, so the actual class is not available at runtime for dependency injection.

WRONG - Type-Only Import

// services/mana-core-auth/src/ai/ai.service.ts - DON'T DO THIS!
import { Injectable } from '@nestjs/common';
import { ConfigService } from '@nestjs/config'; // ❌ Type-only import

@Injectable()
export class AiService {
	constructor(private configService: ConfigService) {
		// NestJS can't inject ConfigService because it was type-only imported!
	}
}

What happens:

  1. TypeScript compiles the code
  2. The type keyword tells TypeScript to erase the import at compile time
  3. The compiled JS has NO import for ConfigService
  4. At runtime, NestJS can't find the ConfigService class to inject
  5. Error: "Nest can't resolve dependencies of the AiService (?)"

CORRECT - Regular Import

// services/mana-core-auth/src/ai/ai.service.ts - CORRECT
import { Injectable } from '@nestjs/common';
import { ConfigService } from '@nestjs/config'; // ✅ Regular import

@Injectable()
export class AiService {
	constructor(private configService: ConfigService) {
		// ConfigService is properly imported and can be injected
	}
}

The Rule

For NestJS dependency injection, NEVER use type-only imports (import {X }) for classes you need to inject.

  • import { ConfigService } - Regular import (works)
  • import {ConfigService } - Type-only import (breaks DI)
  • import type { MyInterface } - Type-only for interfaces (fine, not injected)
  • import {MyType, MyClass } - Mixed (MyType erased, MyClass available)

How to Fix

  1. Find the service with the DI error
  2. Check all imports for classes used in the constructor
  3. Remove the type keyword from class imports:
  import { Injectable } from '@nestjs/common';
- import {ConfigService } from '@nestjs/config';
+ import { ConfigService } from '@nestjs/config';

  @Injectable()
  export class AiService {
    constructor(private configService: ConfigService) {}
  }
  1. Rebuild and test:
pnpm --filter mana-core-auth build
pnpm --filter mana-core-auth start:dev

Debugging

If you're still getting DI errors after removing type-only imports:

  1. Check the module imports the provider's dependencies:
@Module({
	imports: [ConfigModule], // ← ConfigService needs ConfigModule
	providers: [AiService],
	exports: [AiService],
})
export class AiModule {}
  1. Verify the compiled JavaScript:
# Build the service
pnpm --filter mana-core-auth build

# Check the compiled output
cat services/mana-core-auth/dist/ai/ai.service.js | grep "require"

# Should see:
# const config_1 = require("@nestjs/config");  ✅ Good
# NOT:
# const config_1 = undefined;  ❌ Bad (type-only import)
  1. Check Docker builds:

If the error only happens in Docker but not locally:

# Build Docker image without cache
docker build --no-cache -f services/mana-core-auth/Dockerfile -t test .

# Check the compiled code in the image
docker run --rm --entrypoint cat test /app/dist/ai/ai.service.js
  • Commit d69cc607 - Fixed type-only ConfigService import in AiService
  • TypeScript import type vs import {} - both erase at compile time
  • Docker layer caching can hide fixes if source wasn't properly copied

Staging Deployment Issues

Overview

This section documents the complete troubleshooting journey for deploying mana-core-auth + chat (backend + web) to staging. It covers GitHub Actions CI/CD simplification, Docker health checks, database setup, and SvelteKit environment variables.

Problem 1: GitHub Running Disabled Workflows

Symptoms:

  • Workflows with .full.yml extension were still running
  • test.full.yml was being recognized as a valid workflow
  • Multiple unnecessary workflows running on every push

What We Tried:

  1. Renaming to .disabled extension → Still ran
  2. Renaming to .full.yml extension → Still ran (GitHub recognizes any .yml in .github/workflows/)

Solution:

  • Rename to .yml.bak extension (GitHub ignores non-.yml files)
# Disable a workflow
mv .github/workflows/test.yml .github/workflows/test.yml.bak

# Re-enable a workflow
mv .github/workflows/test.yml.bak .github/workflows/test.yml

Files Changed:

  • test.ymltest.yml.bak
  • test-coverage.ymltest-coverage.yml.bak
  • ci-pull-request.ymlci-pull-request.yml.bak
  • dependency-update.ymldependency-update.yml.bak

Problem 2: chat-backend Container Unhealthy

Symptoms:

  • Deployment failed with: dependency failed to start: container chat-backend-staging is unhealthy
  • chat-web wouldn't start because it depends on chat-backend being healthy

Debugging Steps:

# Connect to staging server
ssh -i ~/.ssh/hetzner_deploy_key deploy@46.224.108.214

# Check container status
cd ~/manacore-staging
docker compose ps

# Check logs for the failing container
docker compose logs chat-backend --tail=100

# Test health endpoint manually from inside container
docker compose exec chat-backend wget -q -O - http://localhost:3002/api/v1/health

Root Cause 1: Missing Database

The logs showed:

error: database "chat" does not exist

Fix: Create the database manually:

docker compose exec -T postgres psql -U postgres -c "CREATE DATABASE chat;"

Root Cause 2: Wrong Health Check Path

The docker-compose.staging.yml had:

healthcheck:
  test: ['CMD', 'wget', '...', 'http://localhost:3002/api/health'] # ❌ WRONG

But NestJS health endpoint is at /api/v1/health:

healthcheck:
  test: ['CMD', 'wget', '...', 'http://localhost:3002/api/v1/health'] # ✅ CORRECT

How to Verify Health Endpoints:

Service Port Health Endpoint
mana-core-auth 3001 /api/v1/health
chat-backend 3002 /api/v1/health
chat-web 3000 /health
# Test from outside the server
curl http://46.224.108.214:3001/api/v1/health
curl http://46.224.108.214:3002/api/v1/health
curl http://46.224.108.214:3000/health

Problem 3: SvelteKit Static Environment Variable Imports

Symptoms:

  • Docker build failed with: PUBLIC_MANA_CORE_AUTH_URL is not exported by $env/static/public
  • Build error during npm run build in Docker

Root Cause:

SvelteKit's $env/static/public imports are resolved at build time, not runtime. When building in Docker, these environment variables don't exist.

WRONG - Static Import (Build Time):

// apps/chat/apps/web/src/lib/stores/auth.svelte.ts
import { PUBLIC_MANA_CORE_AUTH_URL } from '$env/static/public'; // ❌ Fails in Docker

const authUrl = PUBLIC_MANA_CORE_AUTH_URL;

CORRECT - Runtime Environment Variable:

// apps/chat/apps/web/src/lib/stores/auth.svelte.ts
import { browser } from '$app/environment';

function getAuthUrl(): string {
	if (browser && typeof window !== 'undefined') {
		// Client-side: check for injected env or use default
		return (
			(window as unknown as { __PUBLIC_MANA_CORE_AUTH_URL__?: string })
				.__PUBLIC_MANA_CORE_AUTH_URL__ ||
			import.meta.env.PUBLIC_MANA_CORE_AUTH_URL ||
			'http://localhost:3001'
		);
	}
	// Server-side: use process.env or default
	return process.env.PUBLIC_MANA_CORE_AUTH_URL || 'http://localhost:3001';
}

The Pattern:

  1. Check if running in browser
  2. Try window-injected variable (for runtime injection)
  3. Try import.meta.env (for Vite build-time)
  4. Fall back to process.env (for SSR)
  5. Use localhost default for development

Files Fixed:

  • apps/chat/apps/web/src/lib/stores/auth.svelte.ts
  • apps/chat/apps/web/src/lib/services/feedback.ts

Problem 4: Orphan Docker Containers

Symptoms:

  • Old containers from previous deployments still running
  • docker compose ps shows unexpected services

Fix:

# Remove orphan containers
docker compose down --remove-orphans

# Bring up fresh
docker compose up -d

# Manually remove specific orphans
docker rm -f manadeck-backend-staging manacore-nginx-staging

Problem 5: Client-Side Calling localhost Instead of Public IP

Symptoms:

  • Browser console shows: POST http://localhost:3001/api/v1/auth/register net::ERR_CONNECTION_REFUSED
  • API calls work from server but fail from browser
  • The injected window.__PUBLIC_MANA_CORE_AUTH_URL__ is empty or undefined

Root Cause:

SvelteKit's environment variables work differently on server vs client:

  • Server-side (SSR): Has access to process.env
  • Client-side (browser): Does NOT have access to process.env - needs explicit injection

The initial fix using process.env only worked for SSR. Browser code falls back to localhost.

Solution - Runtime Environment Injection:

  1. Add client URLs to docker-compose.staging.yml:
chat-web:
  environment:
    # Server-side URLs (Docker internal network)
    PUBLIC_BACKEND_URL: http://chat-backend:3002
    PUBLIC_MANA_CORE_AUTH_URL: http://mana-core-auth:3001
    # Client-side URLs (browser access via public IP)
    PUBLIC_BACKEND_URL_CLIENT: http://46.224.108.214:3002
    PUBLIC_MANA_CORE_AUTH_URL_CLIENT: http://46.224.108.214:3001
  1. Inject into HTML via hooks.server.ts:
// apps/chat/apps/web/src/hooks.server.ts
import type { Handle } from '@sveltejs/kit';

const PUBLIC_MANA_CORE_AUTH_URL_CLIENT =
	process.env.PUBLIC_MANA_CORE_AUTH_URL_CLIENT || process.env.PUBLIC_MANA_CORE_AUTH_URL || '';

export const handle: Handle = async ({ event, resolve }) => {
	return resolve(event, {
		transformPageChunk: ({ html }) => {
			const envScript = `<script>
window.__PUBLIC_MANA_CORE_AUTH_URL__ = "${PUBLIC_MANA_CORE_AUTH_URL_CLIENT}";
</script>`;
			return html.replace('<head>', `<head>${envScript}`);
		},
	});
};
  1. Read from window in client code:
// apps/chat/apps/web/src/lib/stores/auth.svelte.ts
function getAuthUrl(): string {
	if (browser && typeof window !== 'undefined') {
		const injectedUrl = (window as unknown as { __PUBLIC_MANA_CORE_AUTH_URL__?: string })
			.__PUBLIC_MANA_CORE_AUTH_URL__;
		return injectedUrl || 'http://localhost:3001';
	}
	return process.env.PUBLIC_MANA_CORE_AUTH_URL || 'http://localhost:3001';
}

How to Verify:

Open browser DevTools (F12) → Console:

window.__PUBLIC_MANA_CORE_AUTH_URL__;
// Should show: "http://46.224.108.214:3001"

Problem 6: CORS Blocking Cross-Origin Requests

Symptoms:

  • Browser console shows: Access to fetch at 'http://46.224.108.214:3001/...' from origin 'http://46.224.108.214:3000' has been blocked by CORS policy
  • API calls work via curl but fail from browser
  • Preflight OPTIONS requests fail

Root Cause:

Browser security blocks requests between different origins (port counts as different origin):

  • chat-web: http://46.224.108.214:3000
  • mana-core-auth: http://46.224.108.214:3001

Even though they're on the same IP, different ports = different origins = CORS blocked.

Solution:

Add CORS_ORIGINS environment variable to mana-core-auth in docker-compose.staging.yml:

mana-core-auth:
  environment:
    # ... other env vars ...
    # CORS - Allow chat-web and other staging origins
    CORS_ORIGINS: http://46.224.108.214:3000,http://46.224.108.214:3002,http://localhost:3000

CORS Configuration in mana-core-auth:

The service reads CORS_ORIGINS from environment:

// services/mana-core-auth/src/config/configuration.ts
cors: {
  origin: process.env.CORS_ORIGINS?.split(',') || [
    'http://localhost:3000',
    'http://localhost:8081',
  ],
}

// services/mana-core-auth/src/main.ts
const corsOrigins = configService.get<string[]>('cors.origin') || [];
app.enableCors({
  origin: corsOrigins,
  credentials: true,
});

How to Verify:

# Test CORS preflight
curl -X OPTIONS http://46.224.108.214:3001/api/v1/auth/register \
  -H "Origin: http://46.224.108.214:3000" \
  -H "Access-Control-Request-Method: POST" \
  -v

# Should see in response headers:
# Access-Control-Allow-Origin: http://46.224.108.214:3000

Problem 7: Missing Database Schema

Symptoms:

  • API returns: {"statusCode": 500, "message": "relation \"auth.users\" does not exist"}
  • Registration/login endpoints fail with 500 error
  • Health check passes but auth endpoints fail

Root Cause:

The database exists but the schema hasn't been pushed. Drizzle ORM needs to run db:push to create:

  • auth schema with tables: users, accounts, sessions, passwords, verification, etc.
  • credits schema with tables: balances, transactions, packages, etc.

Why It Happened:

The CD workflow was calling pnpm run db:migrate but that script doesn't exist in the package.json. The correct script is db:push which runs drizzle-kit push.

Solution:

  1. Manual fix (immediate):
ssh -i ~/.ssh/hetzner_deploy_key deploy@46.224.108.214
cd ~/manacore-staging

# Push schema to database (--force skips interactive confirmation)
docker compose exec -T mana-core-auth npx drizzle-kit push --force
  1. Fix CD workflow (permanent):
# .github/workflows/cd-staging.yml - BEFORE
docker compose exec -T mana-core-auth pnpm run db:migrate || echo "Auth migrations skipped"

# .github/workflows/cd-staging.yml - AFTER
docker compose exec -T mana-core-auth npx drizzle-kit push --force || echo "Auth schema push skipped"

How to Verify:

# Check if auth schema exists
docker compose exec -T postgres psql -U postgres -d manacore_auth -c '\dt auth.*'

# Should show 12 tables:
# auth | accounts, invitations, jwks, members, organizations,
#        passwords, security_events, sessions, two_factor_auth,
#        user_settings, users, verification

Files Changed:

  • .github/workflows/cd-staging.yml - Line 253: db:migratedrizzle-kit push --force

Complete Staging Deployment Checklist

Before Deployment

  • Verify docker-compose.staging.yml has correct health check paths
  • Verify CI/CD workflow (cd-staging.yml) has matching health check paths
  • Check that required databases exist or CI creates them
  • Verify CD workflow runs drizzle-kit push --force to create schemas (not db:migrate)
  • Verify CORS_ORIGINS includes all frontend origins
  • Verify PUBLIC_*_CLIENT env vars have correct public IPs for browser access

During Deployment Failure

  1. SSH to server:

    ssh -i ~/.ssh/hetzner_deploy_key deploy@46.224.108.214
    cd ~/manacore-staging
    
  2. Check container status:

    docker compose ps
    
  3. Check logs for failing container:

    docker compose logs <container-name> --tail=100
    
  4. Common fixes:

    # Create missing database
    docker compose exec -T postgres psql -U postgres -c "CREATE DATABASE <dbname>;"
    
    # Restart a service
    docker compose restart <service-name>
    
    # Force recreate
    docker compose up -d --force-recreate <service-name>
    
  5. Verify health:

    curl http://localhost:3001/api/v1/health  # mana-core-auth
    curl http://localhost:3002/api/v1/health  # chat-backend
    curl http://localhost:3000/health         # chat-web
    

After Deployment

  • Verify all health endpoints respond
  • Check container logs for errors
  • Test actual functionality (login, API calls)

Key Files for Staging Deployment

File Purpose
docker-compose.staging.yml Service definitions and health checks
.github/workflows/cd-staging.yml CI/CD deployment workflow
.github/workflows/ci-main.yml Docker image builds on push to main

Health Check Patterns

Docker Compose (docker-compose.staging.yml):

healthcheck:
  test: ['CMD', 'wget', '--no-verbose', '--tries=1', '--spider', 'http://localhost:PORT/ENDPOINT']
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 40s

CI/CD Workflow (cd-staging.yml):

# Check from inside container
docker compose exec -T chat-backend wget -q -O - http://localhost:3002/api/v1/health

Lessons Learned

  1. GitHub Workflows: Only files ending in .yml or .yaml in .github/workflows/ are recognized. Use .bak extension to disable.

  2. NestJS Health Endpoints: All NestJS backends use /api/v1/health, not /api/health.

  3. Docker Compose Dependencies: When using depends_on: condition: service_healthy, the dependent service won't start until the health check passes.

  4. Database Creation: Must happen AFTER PostgreSQL is healthy but BEFORE dependent services run migrations.

  5. SvelteKit Environment Variables: Use runtime patterns (process.env, import.meta.env) instead of $env/static/public for Docker builds.

  6. Verify Before Commit: Always check both docker-compose.staging.yml AND CI/CD workflows for matching paths.

  7. Server vs Client URLs: Docker internal URLs (e.g., http://mana-core-auth:3001) only work server-side. Browsers need public IPs. Use separate _CLIENT env vars for browser access.

  8. SvelteKit Runtime Injection: Use hooks.server.ts with transformPageChunk to inject environment variables into HTML at runtime. This is the only reliable way to pass server env vars to client code.

  9. CORS for Multi-Service Apps: When frontend and backend are on different ports, configure CORS on the backend. Port differences count as different origins (e.g., :3000 vs :3001).

  10. Environment Variable Flow:

    docker-compose.yml → Container env → process.env (SSR) → hooks.server.ts → window.__VAR__ (browser)
    
  11. Database Schema vs Database: Creating a database (CREATE DATABASE) is not enough - Drizzle needs db:push to create schemas and tables. Health checks may pass with empty database, but API calls will fail with "relation does not exist".

  12. Drizzle Kit Interactive Mode: drizzle-kit push prompts for confirmation. Use --force flag in CI/CD to skip interactive mode.


References

Getting Help

If you encounter an issue not covered here:

  1. Check the GitHub Issues
  2. Review recent commits that may have introduced the issue
  3. Run pnpm clean and pnpm install to reset
  4. Create a new issue with full error logs