managarten/services/mana-crawler/CLAUDE.md

# Mana Crawler Service

Web crawler microservice for systematic website crawling and content extraction.

## Overview

- **Port**: 3023
- **Technology**: NestJS + BullMQ + Cheerio + PostgreSQL + Redis
- **Purpose**: Crawl websites, extract structured content, and queue-based processing

## Architecture

```
┌─────────────────────────────────────────────────────────────┐
│              mana-crawler (Port 3023)                       │
│                                                             │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │ Crawl API   │  │ Queue       │  │ Parser      │         │
│  │ Controller  │──│ Service     │──│ Service     │         │
│  └─────────────┘  │ (BullMQ)    │  │ (Cheerio)   │         │
│                   └─────────────┘  └─────────────┘         │
│                         │                │                 │
│                   ┌─────┴────────────────┴─────┐           │
│                   │     Storage Service         │           │
│                   │  (PostgreSQL + Redis)       │           │
│                   └─────────────────────────────┘           │
└─────────────────────────────────────────────────────────────┘
```

## Quick Start

### Development

```bash
# 1. Start Redis and PostgreSQL (from monorepo root)
pnpm docker:up

# 2. Install dependencies
pnpm install

# 3. Push database schema
pnpm db:push

# 4. Start in development mode
pnpm dev
```

### Production

```bash
pnpm build
pnpm start
```

## API Endpoints

### Crawl Jobs

| Method | Endpoint | Description |
|--------|----------|-------------|
| POST | `/api/v1/crawl` | Start a new crawl job |
| GET | `/api/v1/crawl/:jobId` | Get job status |
| GET | `/api/v1/crawl/:jobId/results` | Get crawl results (paginated) |
| DELETE | `/api/v1/crawl/:jobId` | Cancel a crawl job |
| POST | `/api/v1/crawl/:jobId/pause` | Pause a running job |
| POST | `/api/v1/crawl/:jobId/resume` | Resume a paused job |

### Instant Extract

| Method | Endpoint | Description |
|--------|----------|-------------|
| POST | `/api/v1/extract` | Extract single page (proxy to mana-search) |

### System

| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | `/health` | Health check |
| GET | `/metrics` | Prometheus metrics |
| GET | `/queue/dashboard` | Bull Board dashboard |

## Usage Examples

### Start a Crawl Job

```bash
curl -X POST http://localhost:3023/api/v1/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "startUrl": "https://docs.example.com",
    "config": {
      "maxDepth": 3,
      "maxPages": 500,
      "respectRobots": true,
      "rateLimit": 2,
      "includePatterns": ["/docs/*"],
      "excludePatterns": ["/api/*", "*.pdf"],
      "selectors": {
        "content": "article.main-content",
        "title": "h1.page-title"
      },
      "output": {
        "format": "markdown",
        "includeScreenshots": false
      }
    }
  }'

# Response:
# {
#   "jobId": "uuid",
#   "status": "pending",
#   "estimatedPages": 500,
#   "queuePosition": 3
# }
```

### Check Job Status

```bash
curl http://localhost:3023/api/v1/crawl/{jobId}

# Response:
# {
#   "jobId": "uuid",
#   "status": "running",
#   "progress": {
#     "discovered": 245,
#     "crawled": 127,
#     "failed": 3,
#     "queued": 115
#   },
#   "startedAt": "2024-01-29T12:00:00Z",
#   "averagePageTime": 450
# }
```

### Get Results

```bash
curl "http://localhost:3023/api/v1/crawl/{jobId}/results?page=1&limit=50"

# Response:
# {
#   "results": [...],
#   "pagination": {
#     "page": 1,
#     "limit": 50,
#     "total": 127
#   }
# }
```

## Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `PORT` | 3023 | API port |
| `DATABASE_URL` | - | PostgreSQL connection URL |
| `REDIS_HOST` | localhost | Redis host |
| `REDIS_PORT` | 6379 | Redis port |
| `CRAWLER_USER_AGENT` | ManaCoreCrawler/1.0 | Crawler user agent |
| `CRAWLER_DEFAULT_RATE_LIMIT` | 2 | Default requests/second |
| `CRAWLER_DEFAULT_MAX_DEPTH` | 3 | Default max crawl depth |
| `CRAWLER_DEFAULT_MAX_PAGES` | 100 | Default max pages per job |
| `CRAWLER_TIMEOUT` | 30000 | Request timeout (ms) |
| `MANA_SEARCH_URL` | http://localhost:3021 | mana-search URL (for extract fallback) |

## Development Commands

```bash
# Install dependencies
pnpm install

# Start development server
pnpm dev

# Build for production
pnpm build

# Start production server
pnpm start

# Type checking
pnpm type-check

# Linting
pnpm lint

# Database commands
pnpm db:push       # Push schema to database
pnpm db:generate   # Generate migrations
pnpm db:migrate    # Run migrations
pnpm db:studio     # Open Drizzle Studio
```

## Database Schema

The crawler uses its own schema (`crawler`) in the shared ManaCore database:

- `crawler.crawl_jobs` - Crawl job configuration and status
- `crawler.crawl_results` - Individual page results

## Queue System

Uses BullMQ with Redis for job processing:

- **Queue Name**: `crawl`
- **Concurrency**: Configurable (default: 5)
- **Retry**: 3 attempts with exponential backoff
- **Dashboard**: Available at `/queue/dashboard`

## Robots.txt Compliance

The crawler respects robots.txt by default:
- Checks robots.txt before crawling each domain
- Caches robots.txt rules in Redis (24h TTL)
- Can be disabled per-job with `respectRobots: false`

## Rate Limiting

Built-in rate limiting to be a good citizen:
- Per-domain rate limiting
- Configurable delay between requests
- Default: 2 requests/second/domain

## Project Structure

```
services/mana-crawler/
├── src/
│   ├── main.ts                 # Application entry point
│   ├── app.module.ts           # Root module
│   ├── config/
│   │   └── configuration.ts    # App configuration
│   ├── db/
│   │   ├── schema/             # Drizzle schemas
│   │   ├── database.module.ts  # Database provider
│   │   └── connection.ts       # DB connection
│   ├── crawler/                # Crawl job management
│   │   ├── crawler.controller.ts
│   │   ├── crawler.service.ts
│   │   └── dto/
│   ├── queue/                  # BullMQ queue processing
│   │   ├── queue.module.ts
│   │   └── processors/
│   ├── parser/                 # HTML parsing (Cheerio)
│   ├── robots/                 # robots.txt handling
│   ├── cache/                  # Redis caching
│   ├── metrics/                # Prometheus metrics
│   └── health/                 # Health check
├── drizzle.config.ts
├── package.json
├── tsconfig.json
└── Dockerfile
```

## Integration with Other Services

### mana-search
The crawler can use mana-search for single-page extraction as a fallback:
```typescript
POST http://mana-search:3021/api/v1/extract
```

### mana-api-gateway
The crawler can be exposed via the API gateway for monetization:
```
POST /v1/crawler/start    → 5 Credits/Job + 1 Credit/100 pages
GET  /v1/crawler/:id      → 0 Credits
```

## Troubleshooting

### Redis connection issues

```bash
# Check Redis
docker exec mana-redis redis-cli ping

# Check queue status
curl http://localhost:3023/queue/dashboard
```

### Jobs stuck in pending

Check that:
1. Redis is running
2. The queue processor is active
3. No rate limit issues

### High memory usage

The crawler loads pages into memory for parsing. For large crawls:
- Reduce `maxPages` per job
- Increase job concurrency instead
- Monitor with `/metrics`