mirror of https://github.com/Memo-2023/mana-monorepo.git synced 2026-05-16 04:59:41 +02:00

Till-JS 4a3295d1d0 ✨ feat(mana-crawler): add web crawler service

NestJS-based web crawler service for structured content extraction.

Features:
- Depth-controlled crawling with URL pattern filtering
- robots.txt compliance
- HTML/PDF/Markdown content extraction
- BullMQ job queue for async processing
- Redis caching layer
- Prometheus metrics

2026-01-29 22:00:36 +01:00

8 KiB

Raw Blame History

Mana Crawler Service

Web crawler microservice for systematic website crawling and content extraction.

Overview

Port: 3023
Technology: NestJS + BullMQ + Cheerio + PostgreSQL + Redis
Purpose: Crawl websites, extract structured content, and queue-based processing

Architecture

┌─────────────────────────────────────────────────────────────┐
│              mana-crawler (Port 3023)                       │
│                                                             │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │ Crawl API   │  │ Queue       │  │ Parser      │         │
│  │ Controller  │──│ Service     │──│ Service     │         │
│  └─────────────┘  │ (BullMQ)    │  │ (Cheerio)   │         │
│                   └─────────────┘  └─────────────┘         │
│                         │                │                 │
│                   ┌─────┴────────────────┴─────┐           │
│                   │     Storage Service         │           │
│                   │  (PostgreSQL + Redis)       │           │
│                   └─────────────────────────────┘           │
└─────────────────────────────────────────────────────────────┘

Quick Start

Development

# 1. Start Redis and PostgreSQL (from monorepo root)
pnpm docker:up

# 2. Install dependencies
pnpm install

# 3. Push database schema
pnpm db:push

# 4. Start in development mode
pnpm dev

Production

pnpm build
pnpm start

API Endpoints

Crawl Jobs

Method	Endpoint	Description
POST	`/api/v1/crawl`	Start a new crawl job
GET	`/api/v1/crawl/:jobId`	Get job status
GET	`/api/v1/crawl/:jobId/results`	Get crawl results (paginated)
DELETE	`/api/v1/crawl/:jobId`	Cancel a crawl job
POST	`/api/v1/crawl/:jobId/pause`	Pause a running job
POST	`/api/v1/crawl/:jobId/resume`	Resume a paused job

Instant Extract

Method	Endpoint	Description
POST	`/api/v1/extract`	Extract single page (proxy to mana-search)

System

Method	Endpoint	Description
GET	`/health`	Health check
GET	`/metrics`	Prometheus metrics
GET	`/queue/dashboard`	Bull Board dashboard

Usage Examples

Start a Crawl Job

curl -X POST http://localhost:3023/api/v1/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "startUrl": "https://docs.example.com",
    "config": {
      "maxDepth": 3,
      "maxPages": 500,
      "respectRobots": true,
      "rateLimit": 2,
      "includePatterns": ["/docs/*"],
      "excludePatterns": ["/api/*", "*.pdf"],
      "selectors": {
        "content": "article.main-content",
        "title": "h1.page-title"
      },
      "output": {
        "format": "markdown",
        "includeScreenshots": false
      }
    }
  }'

# Response:
# {
#   "jobId": "uuid",
#   "status": "pending",
#   "estimatedPages": 500,
#   "queuePosition": 3
# }

Check Job Status

curl http://localhost:3023/api/v1/crawl/{jobId}

# Response:
# {
#   "jobId": "uuid",
#   "status": "running",
#   "progress": {
#     "discovered": 245,
#     "crawled": 127,
#     "failed": 3,
#     "queued": 115
#   },
#   "startedAt": "2024-01-29T12:00:00Z",
#   "averagePageTime": 450
# }

Get Results

curl "http://localhost:3023/api/v1/crawl/{jobId}/results?page=1&limit=50"

# Response:
# {
#   "results": [...],
#   "pagination": {
#     "page": 1,
#     "limit": 50,
#     "total": 127
#   }
# }

Environment Variables

Variable	Default	Description
`PORT`	3023	API port
`DATABASE_URL`	-	PostgreSQL connection URL
`REDIS_HOST`	localhost	Redis host
`REDIS_PORT`	6379	Redis port
`CRAWLER_USER_AGENT`	ManaCoreCrawler/1.0	Crawler user agent
`CRAWLER_DEFAULT_RATE_LIMIT`	2	Default requests/second
`CRAWLER_DEFAULT_MAX_DEPTH`	3	Default max crawl depth
`CRAWLER_DEFAULT_MAX_PAGES`	100	Default max pages per job
`CRAWLER_TIMEOUT`	30000	Request timeout (ms)
`MANA_SEARCH_URL`	http://localhost:3021	mana-search URL (for extract fallback)

Development Commands

# Install dependencies
pnpm install

# Start development server
pnpm dev

# Build for production
pnpm build

# Start production server
pnpm start

# Type checking
pnpm type-check

# Linting
pnpm lint

# Database commands
pnpm db:push       # Push schema to database
pnpm db:generate   # Generate migrations
pnpm db:migrate    # Run migrations
pnpm db:studio     # Open Drizzle Studio

Database Schema

The crawler uses its own schema (crawler) in the shared ManaCore database:

crawler.crawl_jobs - Crawl job configuration and status
crawler.crawl_results - Individual page results

Queue System

Uses BullMQ with Redis for job processing:

Queue Name: crawl
Concurrency: Configurable (default: 5)
Retry: 3 attempts with exponential backoff
Dashboard: Available at /queue/dashboard

Robots.txt Compliance

The crawler respects robots.txt by default:

Checks robots.txt before crawling each domain
Caches robots.txt rules in Redis (24h TTL)
Can be disabled per-job with respectRobots: false

Rate Limiting

Built-in rate limiting to be a good citizen:

Per-domain rate limiting
Configurable delay between requests
Default: 2 requests/second/domain

Project Structure

services/mana-crawler/
├── src/
│   ├── main.ts                 # Application entry point
│   ├── app.module.ts           # Root module
│   ├── config/
│   │   └── configuration.ts    # App configuration
│   ├── db/
│   │   ├── schema/             # Drizzle schemas
│   │   ├── database.module.ts  # Database provider
│   │   └── connection.ts       # DB connection
│   ├── crawler/                # Crawl job management
│   │   ├── crawler.controller.ts
│   │   ├── crawler.service.ts
│   │   └── dto/
│   ├── queue/                  # BullMQ queue processing
│   │   ├── queue.module.ts
│   │   └── processors/
│   ├── parser/                 # HTML parsing (Cheerio)
│   ├── robots/                 # robots.txt handling
│   ├── cache/                  # Redis caching
│   ├── metrics/                # Prometheus metrics
│   └── health/                 # Health check
├── drizzle.config.ts
├── package.json
├── tsconfig.json
└── Dockerfile

Integration with Other Services

mana-search

The crawler can use mana-search for single-page extraction as a fallback:

POST http://mana-search:3021/api/v1/extract

mana-api-gateway

The crawler can be exposed via the API gateway for monetization:

POST /v1/crawler/start    → 5 Credits/Job + 1 Credit/100 pages
GET  /v1/crawler/:id      → 0 Credits

Troubleshooting

Redis connection issues

# Check Redis
docker exec mana-redis redis-cli ping

# Check queue status
curl http://localhost:3023/queue/dashboard

Jobs stuck in pending

Check that:

Redis is running
The queue processor is active
No rate limit issues

High memory usage

The crawler loads pages into memory for parsing. For large crawls:

Reduce maxPages per job
Increase job concurrency instead
Monitor with /metrics

8 KiB Raw Blame History