managarten/services/mana-crawler/CLAUDE.md
Till-JS 4a3295d1d0 feat(mana-crawler): add web crawler service
NestJS-based web crawler service for structured content extraction.

Features:
- Depth-controlled crawling with URL pattern filtering
- robots.txt compliance
- HTML/PDF/Markdown content extraction
- BullMQ job queue for async processing
- Redis caching layer
- Prometheus metrics
2026-01-29 22:00:36 +01:00

8 KiB

Mana Crawler Service

Web crawler microservice for systematic website crawling and content extraction.

Overview

  • Port: 3023
  • Technology: NestJS + BullMQ + Cheerio + PostgreSQL + Redis
  • Purpose: Crawl websites, extract structured content, and queue-based processing

Architecture

┌─────────────────────────────────────────────────────────────┐
│              mana-crawler (Port 3023)                       │
│                                                             │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │ Crawl API   │  │ Queue       │  │ Parser      │         │
│  │ Controller  │──│ Service     │──│ Service     │         │
│  └─────────────┘  │ (BullMQ)    │  │ (Cheerio)   │         │
│                   └─────────────┘  └─────────────┘         │
│                         │                │                 │
│                   ┌─────┴────────────────┴─────┐           │
│                   │     Storage Service         │           │
│                   │  (PostgreSQL + Redis)       │           │
│                   └─────────────────────────────┘           │
└─────────────────────────────────────────────────────────────┘

Quick Start

Development

# 1. Start Redis and PostgreSQL (from monorepo root)
pnpm docker:up

# 2. Install dependencies
pnpm install

# 3. Push database schema
pnpm db:push

# 4. Start in development mode
pnpm dev

Production

pnpm build
pnpm start

API Endpoints

Crawl Jobs

Method Endpoint Description
POST /api/v1/crawl Start a new crawl job
GET /api/v1/crawl/:jobId Get job status
GET /api/v1/crawl/:jobId/results Get crawl results (paginated)
DELETE /api/v1/crawl/:jobId Cancel a crawl job
POST /api/v1/crawl/:jobId/pause Pause a running job
POST /api/v1/crawl/:jobId/resume Resume a paused job

Instant Extract

Method Endpoint Description
POST /api/v1/extract Extract single page (proxy to mana-search)

System

Method Endpoint Description
GET /health Health check
GET /metrics Prometheus metrics
GET /queue/dashboard Bull Board dashboard

Usage Examples

Start a Crawl Job

curl -X POST http://localhost:3023/api/v1/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "startUrl": "https://docs.example.com",
    "config": {
      "maxDepth": 3,
      "maxPages": 500,
      "respectRobots": true,
      "rateLimit": 2,
      "includePatterns": ["/docs/*"],
      "excludePatterns": ["/api/*", "*.pdf"],
      "selectors": {
        "content": "article.main-content",
        "title": "h1.page-title"
      },
      "output": {
        "format": "markdown",
        "includeScreenshots": false
      }
    }
  }'

# Response:
# {
#   "jobId": "uuid",
#   "status": "pending",
#   "estimatedPages": 500,
#   "queuePosition": 3
# }

Check Job Status

curl http://localhost:3023/api/v1/crawl/{jobId}

# Response:
# {
#   "jobId": "uuid",
#   "status": "running",
#   "progress": {
#     "discovered": 245,
#     "crawled": 127,
#     "failed": 3,
#     "queued": 115
#   },
#   "startedAt": "2024-01-29T12:00:00Z",
#   "averagePageTime": 450
# }

Get Results

curl "http://localhost:3023/api/v1/crawl/{jobId}/results?page=1&limit=50"

# Response:
# {
#   "results": [...],
#   "pagination": {
#     "page": 1,
#     "limit": 50,
#     "total": 127
#   }
# }

Environment Variables

Variable Default Description
PORT 3023 API port
DATABASE_URL - PostgreSQL connection URL
REDIS_HOST localhost Redis host
REDIS_PORT 6379 Redis port
CRAWLER_USER_AGENT ManaCoreCrawler/1.0 Crawler user agent
CRAWLER_DEFAULT_RATE_LIMIT 2 Default requests/second
CRAWLER_DEFAULT_MAX_DEPTH 3 Default max crawl depth
CRAWLER_DEFAULT_MAX_PAGES 100 Default max pages per job
CRAWLER_TIMEOUT 30000 Request timeout (ms)
MANA_SEARCH_URL http://localhost:3021 mana-search URL (for extract fallback)

Development Commands

# Install dependencies
pnpm install

# Start development server
pnpm dev

# Build for production
pnpm build

# Start production server
pnpm start

# Type checking
pnpm type-check

# Linting
pnpm lint

# Database commands
pnpm db:push       # Push schema to database
pnpm db:generate   # Generate migrations
pnpm db:migrate    # Run migrations
pnpm db:studio     # Open Drizzle Studio

Database Schema

The crawler uses its own schema (crawler) in the shared ManaCore database:

  • crawler.crawl_jobs - Crawl job configuration and status
  • crawler.crawl_results - Individual page results

Queue System

Uses BullMQ with Redis for job processing:

  • Queue Name: crawl
  • Concurrency: Configurable (default: 5)
  • Retry: 3 attempts with exponential backoff
  • Dashboard: Available at /queue/dashboard

Robots.txt Compliance

The crawler respects robots.txt by default:

  • Checks robots.txt before crawling each domain
  • Caches robots.txt rules in Redis (24h TTL)
  • Can be disabled per-job with respectRobots: false

Rate Limiting

Built-in rate limiting to be a good citizen:

  • Per-domain rate limiting
  • Configurable delay between requests
  • Default: 2 requests/second/domain

Project Structure

services/mana-crawler/
├── src/
│   ├── main.ts                 # Application entry point
│   ├── app.module.ts           # Root module
│   ├── config/
│   │   └── configuration.ts    # App configuration
│   ├── db/
│   │   ├── schema/             # Drizzle schemas
│   │   ├── database.module.ts  # Database provider
│   │   └── connection.ts       # DB connection
│   ├── crawler/                # Crawl job management
│   │   ├── crawler.controller.ts
│   │   ├── crawler.service.ts
│   │   └── dto/
│   ├── queue/                  # BullMQ queue processing
│   │   ├── queue.module.ts
│   │   └── processors/
│   ├── parser/                 # HTML parsing (Cheerio)
│   ├── robots/                 # robots.txt handling
│   ├── cache/                  # Redis caching
│   ├── metrics/                # Prometheus metrics
│   └── health/                 # Health check
├── drizzle.config.ts
├── package.json
├── tsconfig.json
└── Dockerfile

Integration with Other Services

The crawler can use mana-search for single-page extraction as a fallback:

POST http://mana-search:3021/api/v1/extract

mana-api-gateway

The crawler can be exposed via the API gateway for monetization:

POST /v1/crawler/start    → 5 Credits/Job + 1 Credit/100 pages
GET  /v1/crawler/:id      → 0 Credits

Troubleshooting

Redis connection issues

# Check Redis
docker exec mana-redis redis-cli ping

# Check queue status
curl http://localhost:3023/queue/dashboard

Jobs stuck in pending

Check that:

  1. Redis is running
  2. The queue processor is active
  3. No rate limit issues

High memory usage

The crawler loads pages into memory for parsing. For large crawls:

  • Reduce maxPages per job
  • Increase job concurrency instead
  • Monitor with /metrics