managarten/docs/ERROR_TRACKING_DESIGN.md
Wuesteon 2784143466 📝 docs: add error tracking and security documentation
- ERROR_TRACKING_DESIGN.md: Architecture for centralized error tracking
- MANA_CORE_AUTH_ANALYSIS.md: Comprehensive auth service analysis
- SECURITY_FIXES_IMPLEMENTATION_GUIDE.md: Security implementation guide
2025-12-19 02:18:42 +01:00

13 KiB

Centralized Error Tracking System

Design document for a centralized error tracking solution across all ManaCore applications.

Overview

A centralized error tracking system that allows all ManaCore applications (backends and frontends) to report errors to a single database table in mana-core-auth. This enables unified error monitoring, analysis, and debugging across the entire ecosystem.

Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   chat-backend  │    │  picture-web    │    │  zitare-mobile  │
│                 │    │                 │    │                 │
│ ErrorTracking   │    │ errorTracker    │    │ errorTracker    │
│    Filter       │    │  .captureError  │    │  .captureError  │
└────────┬────────┘    └────────┬────────┘    └────────┬────────┘
         │                      │                      │
         └──────────────────────┼──────────────────────┘
                                │
                    POST /api/v1/errors
                                │
                    ┌───────────▼───────────┐
                    │   mana-core-auth      │
                    │   ErrorLogsController │
                    │          │            │
                    │   ErrorLogsService    │
                    │          │            │
                    │   error_logs table    │
                    └───────────────────────┘

Components

1. Database Schema

Location: services/mana-core-auth/src/db/schema/error-logs.schema.ts

export const errorLogsSchema = pgSchema('error_logs');

export const errorLogs = errorLogsSchema.table('error_logs', {
  // Primary key
  id: uuid('id').primaryKey().defaultRandom(),

  // Error identification
  errorCode: text('error_code').notNull(),      // e.g., 'VALIDATION_FAILED'
  errorType: text('error_type').notNull(),      // e.g., 'AppError', 'TypeError'
  message: text('message').notNull(),
  stackTrace: text('stack_trace'),

  // Source identification
  appId: text('app_id').notNull(),              // 'chat', 'picture', 'zitare'
  sourceType: errorSourceTypeEnum('source_type'), // 'backend', 'frontend_web', 'frontend_mobile'
  serviceName: text('service_name'),            // 'chat-backend', 'picture-web'

  // User context (optional)
  userId: text('user_id').references(() => users.id, { onDelete: 'set null' }),
  sessionId: text('session_id'),

  // Request metadata (backend errors)
  requestUrl: text('request_url'),
  requestMethod: text('request_method'),
  requestHeaders: jsonb('request_headers'),     // Sanitized - no auth tokens
  requestBody: jsonb('request_body'),           // Sanitized - no passwords
  responseStatusCode: integer('response_status_code'),

  // Classification
  environment: errorEnvironmentEnum('environment'), // 'development', 'staging', 'production'
  severity: errorSeverityEnum('severity'),          // 'debug', 'info', 'warning', 'error', 'critical'

  // Additional context
  context: jsonb('context').default({}),
  fingerprint: text('fingerprint'),             // For error grouping/deduplication

  // Browser/device info (frontend errors)
  userAgent: text('user_agent'),
  browserInfo: jsonb('browser_info'),
  deviceInfo: jsonb('device_info'),

  // Timestamps
  occurredAt: timestamp('occurred_at', { withTimezone: true }).notNull(),
  createdAt: timestamp('created_at', { withTimezone: true }).defaultNow().notNull(),
});

Indexes:

  • appId - Filter by application
  • userId - Find user-specific errors
  • environment - Filter by environment
  • severity - Filter by severity level
  • occurredAt - Time-based queries
  • errorCode - Group by error type
  • fingerprint - Deduplicate similar errors

2. REST API

Endpoint: POST /api/v1/errors

Authentication: Optional (uses OptionalAuthGuard)

Headers:

  • X-App-Id: Application identifier (fallback if not in body)
  • Authorization: Bearer token (optional, for user context)

Request Body:

interface CreateErrorLogDto {
  // Required
  errorCode: string;      // Max 100 chars
  errorType: string;      // Max 100 chars
  message: string;        // Max 5000 chars

  // Optional
  stackTrace?: string;    // Max 50000 chars
  appId?: string;
  sourceType?: 'backend' | 'frontend_web' | 'frontend_mobile';
  serviceName?: string;
  userId?: string;
  sessionId?: string;
  requestUrl?: string;
  requestMethod?: string;
  requestHeaders?: Record<string, unknown>;
  requestBody?: Record<string, unknown>;
  responseStatusCode?: number;
  environment?: 'development' | 'staging' | 'production';
  severity?: 'debug' | 'info' | 'warning' | 'error' | 'critical';
  context?: Record<string, unknown>;
  fingerprint?: string;
  browserInfo?: Record<string, unknown>;
  deviceInfo?: Record<string, unknown>;
  occurredAt?: string;    // ISO 8601 timestamp
}

Response:

// Success
{ success: true, id: string }

// Failure (never throws - always returns)
{ success: false, error: string }

Batch Endpoint: POST /api/v1/errors/batch

// Request
{ errors: CreateErrorLogDto[] }

// Response
{ success: true, total: number, succeeded: number, failed: number }

3. Shared NestJS Package

Package: @manacore/shared-error-tracking

Installation:

pnpm add @manacore/shared-error-tracking

Exports:

// NestJS module and components
import {
  ErrorTrackingModule,
  ErrorTrackingService,
  ErrorTrackingFilter
} from '@manacore/shared-error-tracking/nestjs';

// Frontend clients
import {
  createErrorTracker,
  createSvelteErrorHandler,
  setupGlobalErrorHandler
} from '@manacore/shared-error-tracking/frontend';

// Type definitions
import type {
  ErrorLogPayload,
  ErrorTrackingConfig
} from '@manacore/shared-error-tracking/types';

NestJS Integration

Module Registration:

// app.module.ts
import { ErrorTrackingModule } from '@manacore/shared-error-tracking/nestjs';

@Module({
  imports: [
    ErrorTrackingModule.forRootAsync({
      useFactory: (configService: ConfigService) => ({
        errorTrackingUrl: configService.get('MANA_CORE_AUTH_URL'),
        appId: 'chat',
        serviceName: 'chat-backend',
        enableLocalLogging: configService.get('NODE_ENV') !== 'production',
      }),
      inject: [ConfigService],
    }),
  ],
})
export class AppModule {}

Global Exception Filter:

// main.ts
import { ErrorTrackingFilter } from '@manacore/shared-error-tracking/nestjs';

async function bootstrap() {
  const app = await NestFactory.create(AppModule);

  const errorTrackingFilter = app.get(ErrorTrackingFilter);
  app.useGlobalFilters(errorTrackingFilter);

  await app.listen(3002);
}

Manual Error Reporting:

import { ErrorTrackingService } from '@manacore/shared-error-tracking/nestjs';

@Injectable()
export class SomeService {
  constructor(private errorTracking: ErrorTrackingService) {}

  async riskyOperation() {
    try {
      // ... operation
    } catch (error) {
      // Report non-critical error without throwing
      this.errorTracking.reportError({
        errorCode: 'SYNC_WARNING',
        errorType: 'OperationWarning',
        message: 'Non-critical sync failed',
        severity: 'warning',
        context: { operationType: 'background-sync' },
      });
    }
  }
}

4. Frontend Clients

SvelteKit Integration

Setup:

// src/lib/error-tracking.ts
import { createErrorTracker } from '@manacore/shared-error-tracking/frontend';
import { PUBLIC_MANA_CORE_AUTH_URL } from '$env/static/public';

export const errorTracker = createErrorTracker({
  errorTrackingUrl: PUBLIC_MANA_CORE_AUTH_URL,
  appId: 'chat',
  serviceName: 'chat-web',
  environment: import.meta.env.MODE === 'production' ? 'production' : 'development',
  getAuthToken: async () => {
    // Return JWT token if user is authenticated
    return authStore.getToken();
  },
});

SvelteKit Hooks:

// src/hooks.client.ts
import { createSvelteErrorHandler, setupGlobalErrorHandler } from '@manacore/shared-error-tracking/frontend';
import { errorTracker } from '$lib/error-tracking';

// Capture unhandled errors and promise rejections
if (typeof window !== 'undefined') {
  setupGlobalErrorHandler(errorTracker);
}

// Export for SvelteKit
export const handleError = createSvelteErrorHandler(errorTracker);

Manual Error Capture:

import { errorTracker } from '$lib/error-tracking';

async function loadData() {
  try {
    const response = await fetch('/api/data');
    if (!response.ok) throw new Error('Failed to load data');
    return response.json();
  } catch (error) {
    errorTracker.captureError(error, {
      component: 'DataLoader',
      action: 'loadData',
    });
    throw error; // Re-throw for UI error boundary
  }
}

Expo/React Native Integration

Setup:

// src/lib/error-tracking.ts
import { createErrorTracker, createExpoErrorHandler } from '@manacore/shared-error-tracking/frontend';

export const errorTracker = createErrorTracker({
  errorTrackingUrl: process.env.EXPO_PUBLIC_MANA_CORE_AUTH_URL!,
  appId: 'chat',
  serviceName: 'chat-mobile',
  environment: __DEV__ ? 'development' : 'production',
  getAuthToken: async () => authStore.getToken(),
});

export const { errorHandler } = createExpoErrorHandler(errorTracker);

Error Boundary:

// App.tsx
import ErrorBoundary from 'react-native-error-boundary';
import { errorHandler } from '@/lib/error-tracking';

export default function App() {
  return (
    <ErrorBoundary onError={errorHandler}>
      <RootNavigator />
    </ErrorBoundary>
  );
}

Configuration

Environment Variables

mana-core-auth:

# No additional config needed - uses existing DATABASE_URL

Backend apps:

MANA_CORE_AUTH_URL=http://localhost:3001

Frontend apps (SvelteKit):

PUBLIC_MANA_CORE_AUTH_URL=http://localhost:3001

Mobile apps (Expo):

EXPO_PUBLIC_MANA_CORE_AUTH_URL=http://localhost:3001

Error Tracking Config Options

interface ErrorTrackingConfig {
  /** URL of mana-core-auth service */
  errorTrackingUrl: string;

  /** App identifier (e.g., 'chat', 'picture') */
  appId: string;

  /** Service name for identification */
  serviceName?: string;

  /** Default environment if not detected */
  environment?: 'development' | 'staging' | 'production';

  /** Log errors locally as well (default: true in dev) */
  enableLocalLogging?: boolean;

  /** Custom headers for requests */
  customHeaders?: Record<string, string>;

  /** Function to get auth token (optional) */
  getAuthToken?: () => Promise<string | null>;
}

Security Considerations

Automatic Sanitization

The system automatically sanitizes sensitive data before storage:

Headers sanitized:

  • authorization
  • cookie
  • x-api-key
  • api-key

Body fields sanitized:

  • password
  • token
  • secret
  • apikey
  • api_key

Data Retention

Consider implementing:

  • Automatic cleanup of old errors (e.g., > 30 days)
  • Aggregation of repeated errors
  • Storage limits per app

Error Grouping

Errors are grouped by fingerprint, which is auto-generated from:

  • errorCode
  • errorType
  • appId
  • requestUrl (path only, no query params)
  • requestMethod

This allows identifying recurring issues and tracking fix effectiveness.

Querying Errors

Example Queries

Recent errors by app:

SELECT * FROM error_logs.error_logs
WHERE app_id = 'chat'
  AND occurred_at > NOW() - INTERVAL '24 hours'
ORDER BY occurred_at DESC
LIMIT 100;

Error frequency by type:

SELECT error_code, COUNT(*) as count
FROM error_logs.error_logs
WHERE occurred_at > NOW() - INTERVAL '7 days'
GROUP BY error_code
ORDER BY count DESC;

User-specific errors:

SELECT * FROM error_logs.error_logs
WHERE user_id = 'user_123'
ORDER BY occurred_at DESC
LIMIT 50;

Errors by fingerprint (grouped):

SELECT fingerprint, error_code, message, COUNT(*) as occurrences,
       MIN(occurred_at) as first_seen,
       MAX(occurred_at) as last_seen
FROM error_logs.error_logs
WHERE environment = 'production'
  AND occurred_at > NOW() - INTERVAL '24 hours'
GROUP BY fingerprint, error_code, message
ORDER BY occurrences DESC
LIMIT 20;

Future Enhancements

  • Dashboard UI - Web interface for viewing/filtering errors
  • Alerting - Slack/email notifications for critical errors
  • Rate Limiting - Prevent error flooding
  • Sampling - Sample high-volume errors in production
  • Source Maps - Frontend stack trace deobfuscation
  • Metrics - Error rate trends and SLI tracking