mirror of
https://github.com/Memo-2023/mana-monorepo.git
synced 2026-05-20 00:01:24 +02:00
NestJS-based web crawler service for structured content extraction. Features: - Depth-controlled crawling with URL pattern filtering - robots.txt compliance - HTML/PDF/Markdown content extraction - BullMQ job queue for async processing - Redis caching layer - Prometheus metrics
297 lines
8 KiB
Markdown
297 lines
8 KiB
Markdown
# Mana Crawler Service
|
|
|
|
Web crawler microservice for systematic website crawling and content extraction.
|
|
|
|
## Overview
|
|
|
|
- **Port**: 3023
|
|
- **Technology**: NestJS + BullMQ + Cheerio + PostgreSQL + Redis
|
|
- **Purpose**: Crawl websites, extract structured content, and queue-based processing
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ mana-crawler (Port 3023) │
|
|
│ │
|
|
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
|
│ │ Crawl API │ │ Queue │ │ Parser │ │
|
|
│ │ Controller │──│ Service │──│ Service │ │
|
|
│ └─────────────┘ │ (BullMQ) │ │ (Cheerio) │ │
|
|
│ └─────────────┘ └─────────────┘ │
|
|
│ │ │ │
|
|
│ ┌─────┴────────────────┴─────┐ │
|
|
│ │ Storage Service │ │
|
|
│ │ (PostgreSQL + Redis) │ │
|
|
│ └─────────────────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Quick Start
|
|
|
|
### Development
|
|
|
|
```bash
|
|
# 1. Start Redis and PostgreSQL (from monorepo root)
|
|
pnpm docker:up
|
|
|
|
# 2. Install dependencies
|
|
pnpm install
|
|
|
|
# 3. Push database schema
|
|
pnpm db:push
|
|
|
|
# 4. Start in development mode
|
|
pnpm dev
|
|
```
|
|
|
|
### Production
|
|
|
|
```bash
|
|
pnpm build
|
|
pnpm start
|
|
```
|
|
|
|
## API Endpoints
|
|
|
|
### Crawl Jobs
|
|
|
|
| Method | Endpoint | Description |
|
|
|--------|----------|-------------|
|
|
| POST | `/api/v1/crawl` | Start a new crawl job |
|
|
| GET | `/api/v1/crawl/:jobId` | Get job status |
|
|
| GET | `/api/v1/crawl/:jobId/results` | Get crawl results (paginated) |
|
|
| DELETE | `/api/v1/crawl/:jobId` | Cancel a crawl job |
|
|
| POST | `/api/v1/crawl/:jobId/pause` | Pause a running job |
|
|
| POST | `/api/v1/crawl/:jobId/resume` | Resume a paused job |
|
|
|
|
### Instant Extract
|
|
|
|
| Method | Endpoint | Description |
|
|
|--------|----------|-------------|
|
|
| POST | `/api/v1/extract` | Extract single page (proxy to mana-search) |
|
|
|
|
### System
|
|
|
|
| Method | Endpoint | Description |
|
|
|--------|----------|-------------|
|
|
| GET | `/health` | Health check |
|
|
| GET | `/metrics` | Prometheus metrics |
|
|
| GET | `/queue/dashboard` | Bull Board dashboard |
|
|
|
|
## Usage Examples
|
|
|
|
### Start a Crawl Job
|
|
|
|
```bash
|
|
curl -X POST http://localhost:3023/api/v1/crawl \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"startUrl": "https://docs.example.com",
|
|
"config": {
|
|
"maxDepth": 3,
|
|
"maxPages": 500,
|
|
"respectRobots": true,
|
|
"rateLimit": 2,
|
|
"includePatterns": ["/docs/*"],
|
|
"excludePatterns": ["/api/*", "*.pdf"],
|
|
"selectors": {
|
|
"content": "article.main-content",
|
|
"title": "h1.page-title"
|
|
},
|
|
"output": {
|
|
"format": "markdown",
|
|
"includeScreenshots": false
|
|
}
|
|
}
|
|
}'
|
|
|
|
# Response:
|
|
# {
|
|
# "jobId": "uuid",
|
|
# "status": "pending",
|
|
# "estimatedPages": 500,
|
|
# "queuePosition": 3
|
|
# }
|
|
```
|
|
|
|
### Check Job Status
|
|
|
|
```bash
|
|
curl http://localhost:3023/api/v1/crawl/{jobId}
|
|
|
|
# Response:
|
|
# {
|
|
# "jobId": "uuid",
|
|
# "status": "running",
|
|
# "progress": {
|
|
# "discovered": 245,
|
|
# "crawled": 127,
|
|
# "failed": 3,
|
|
# "queued": 115
|
|
# },
|
|
# "startedAt": "2024-01-29T12:00:00Z",
|
|
# "averagePageTime": 450
|
|
# }
|
|
```
|
|
|
|
### Get Results
|
|
|
|
```bash
|
|
curl "http://localhost:3023/api/v1/crawl/{jobId}/results?page=1&limit=50"
|
|
|
|
# Response:
|
|
# {
|
|
# "results": [...],
|
|
# "pagination": {
|
|
# "page": 1,
|
|
# "limit": 50,
|
|
# "total": 127
|
|
# }
|
|
# }
|
|
```
|
|
|
|
## Environment Variables
|
|
|
|
| Variable | Default | Description |
|
|
|----------|---------|-------------|
|
|
| `PORT` | 3023 | API port |
|
|
| `DATABASE_URL` | - | PostgreSQL connection URL |
|
|
| `REDIS_HOST` | localhost | Redis host |
|
|
| `REDIS_PORT` | 6379 | Redis port |
|
|
| `CRAWLER_USER_AGENT` | ManaCoreCrawler/1.0 | Crawler user agent |
|
|
| `CRAWLER_DEFAULT_RATE_LIMIT` | 2 | Default requests/second |
|
|
| `CRAWLER_DEFAULT_MAX_DEPTH` | 3 | Default max crawl depth |
|
|
| `CRAWLER_DEFAULT_MAX_PAGES` | 100 | Default max pages per job |
|
|
| `CRAWLER_TIMEOUT` | 30000 | Request timeout (ms) |
|
|
| `MANA_SEARCH_URL` | http://localhost:3021 | mana-search URL (for extract fallback) |
|
|
|
|
## Development Commands
|
|
|
|
```bash
|
|
# Install dependencies
|
|
pnpm install
|
|
|
|
# Start development server
|
|
pnpm dev
|
|
|
|
# Build for production
|
|
pnpm build
|
|
|
|
# Start production server
|
|
pnpm start
|
|
|
|
# Type checking
|
|
pnpm type-check
|
|
|
|
# Linting
|
|
pnpm lint
|
|
|
|
# Database commands
|
|
pnpm db:push # Push schema to database
|
|
pnpm db:generate # Generate migrations
|
|
pnpm db:migrate # Run migrations
|
|
pnpm db:studio # Open Drizzle Studio
|
|
```
|
|
|
|
## Database Schema
|
|
|
|
The crawler uses its own schema (`crawler`) in the shared ManaCore database:
|
|
|
|
- `crawler.crawl_jobs` - Crawl job configuration and status
|
|
- `crawler.crawl_results` - Individual page results
|
|
|
|
## Queue System
|
|
|
|
Uses BullMQ with Redis for job processing:
|
|
|
|
- **Queue Name**: `crawl`
|
|
- **Concurrency**: Configurable (default: 5)
|
|
- **Retry**: 3 attempts with exponential backoff
|
|
- **Dashboard**: Available at `/queue/dashboard`
|
|
|
|
## Robots.txt Compliance
|
|
|
|
The crawler respects robots.txt by default:
|
|
- Checks robots.txt before crawling each domain
|
|
- Caches robots.txt rules in Redis (24h TTL)
|
|
- Can be disabled per-job with `respectRobots: false`
|
|
|
|
## Rate Limiting
|
|
|
|
Built-in rate limiting to be a good citizen:
|
|
- Per-domain rate limiting
|
|
- Configurable delay between requests
|
|
- Default: 2 requests/second/domain
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
services/mana-crawler/
|
|
├── src/
|
|
│ ├── main.ts # Application entry point
|
|
│ ├── app.module.ts # Root module
|
|
│ ├── config/
|
|
│ │ └── configuration.ts # App configuration
|
|
│ ├── db/
|
|
│ │ ├── schema/ # Drizzle schemas
|
|
│ │ ├── database.module.ts # Database provider
|
|
│ │ └── connection.ts # DB connection
|
|
│ ├── crawler/ # Crawl job management
|
|
│ │ ├── crawler.controller.ts
|
|
│ │ ├── crawler.service.ts
|
|
│ │ └── dto/
|
|
│ ├── queue/ # BullMQ queue processing
|
|
│ │ ├── queue.module.ts
|
|
│ │ └── processors/
|
|
│ ├── parser/ # HTML parsing (Cheerio)
|
|
│ ├── robots/ # robots.txt handling
|
|
│ ├── cache/ # Redis caching
|
|
│ ├── metrics/ # Prometheus metrics
|
|
│ └── health/ # Health check
|
|
├── drizzle.config.ts
|
|
├── package.json
|
|
├── tsconfig.json
|
|
└── Dockerfile
|
|
```
|
|
|
|
## Integration with Other Services
|
|
|
|
### mana-search
|
|
The crawler can use mana-search for single-page extraction as a fallback:
|
|
```typescript
|
|
POST http://mana-search:3021/api/v1/extract
|
|
```
|
|
|
|
### mana-api-gateway
|
|
The crawler can be exposed via the API gateway for monetization:
|
|
```
|
|
POST /v1/crawler/start → 5 Credits/Job + 1 Credit/100 pages
|
|
GET /v1/crawler/:id → 0 Credits
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Redis connection issues
|
|
|
|
```bash
|
|
# Check Redis
|
|
docker exec mana-redis redis-cli ping
|
|
|
|
# Check queue status
|
|
curl http://localhost:3023/queue/dashboard
|
|
```
|
|
|
|
### Jobs stuck in pending
|
|
|
|
Check that:
|
|
1. Redis is running
|
|
2. The queue processor is active
|
|
3. No rate limit issues
|
|
|
|
### High memory usage
|
|
|
|
The crawler loads pages into memory for parsing. For large crawls:
|
|
- Reduce `maxPages` per job
|
|
- Increase job concurrency instead
|
|
- Monitor with `/metrics`
|