mirror of
https://github.com/Memo-2023/mana-monorepo.git
synced 2026-05-16 13:59:40 +02:00
NestJS-based web crawler service for structured content extraction. Features: - Depth-controlled crawling with URL pattern filtering - robots.txt compliance - HTML/PDF/Markdown content extraction - BullMQ job queue for async processing - Redis caching layer - Prometheus metrics
8 KiB
8 KiB
Mana Crawler Service
Web crawler microservice for systematic website crawling and content extraction.
Overview
- Port: 3023
- Technology: NestJS + BullMQ + Cheerio + PostgreSQL + Redis
- Purpose: Crawl websites, extract structured content, and queue-based processing
Architecture
┌─────────────────────────────────────────────────────────────┐
│ mana-crawler (Port 3023) │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Crawl API │ │ Queue │ │ Parser │ │
│ │ Controller │──│ Service │──│ Service │ │
│ └─────────────┘ │ (BullMQ) │ │ (Cheerio) │ │
│ └─────────────┘ └─────────────┘ │
│ │ │ │
│ ┌─────┴────────────────┴─────┐ │
│ │ Storage Service │ │
│ │ (PostgreSQL + Redis) │ │
│ └─────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Quick Start
Development
# 1. Start Redis and PostgreSQL (from monorepo root)
pnpm docker:up
# 2. Install dependencies
pnpm install
# 3. Push database schema
pnpm db:push
# 4. Start in development mode
pnpm dev
Production
pnpm build
pnpm start
API Endpoints
Crawl Jobs
| Method | Endpoint | Description |
|---|---|---|
| POST | /api/v1/crawl |
Start a new crawl job |
| GET | /api/v1/crawl/:jobId |
Get job status |
| GET | /api/v1/crawl/:jobId/results |
Get crawl results (paginated) |
| DELETE | /api/v1/crawl/:jobId |
Cancel a crawl job |
| POST | /api/v1/crawl/:jobId/pause |
Pause a running job |
| POST | /api/v1/crawl/:jobId/resume |
Resume a paused job |
Instant Extract
| Method | Endpoint | Description |
|---|---|---|
| POST | /api/v1/extract |
Extract single page (proxy to mana-search) |
System
| Method | Endpoint | Description |
|---|---|---|
| GET | /health |
Health check |
| GET | /metrics |
Prometheus metrics |
| GET | /queue/dashboard |
Bull Board dashboard |
Usage Examples
Start a Crawl Job
curl -X POST http://localhost:3023/api/v1/crawl \
-H "Content-Type: application/json" \
-d '{
"startUrl": "https://docs.example.com",
"config": {
"maxDepth": 3,
"maxPages": 500,
"respectRobots": true,
"rateLimit": 2,
"includePatterns": ["/docs/*"],
"excludePatterns": ["/api/*", "*.pdf"],
"selectors": {
"content": "article.main-content",
"title": "h1.page-title"
},
"output": {
"format": "markdown",
"includeScreenshots": false
}
}
}'
# Response:
# {
# "jobId": "uuid",
# "status": "pending",
# "estimatedPages": 500,
# "queuePosition": 3
# }
Check Job Status
curl http://localhost:3023/api/v1/crawl/{jobId}
# Response:
# {
# "jobId": "uuid",
# "status": "running",
# "progress": {
# "discovered": 245,
# "crawled": 127,
# "failed": 3,
# "queued": 115
# },
# "startedAt": "2024-01-29T12:00:00Z",
# "averagePageTime": 450
# }
Get Results
curl "http://localhost:3023/api/v1/crawl/{jobId}/results?page=1&limit=50"
# Response:
# {
# "results": [...],
# "pagination": {
# "page": 1,
# "limit": 50,
# "total": 127
# }
# }
Environment Variables
| Variable | Default | Description |
|---|---|---|
PORT |
3023 | API port |
DATABASE_URL |
- | PostgreSQL connection URL |
REDIS_HOST |
localhost | Redis host |
REDIS_PORT |
6379 | Redis port |
CRAWLER_USER_AGENT |
ManaCoreCrawler/1.0 | Crawler user agent |
CRAWLER_DEFAULT_RATE_LIMIT |
2 | Default requests/second |
CRAWLER_DEFAULT_MAX_DEPTH |
3 | Default max crawl depth |
CRAWLER_DEFAULT_MAX_PAGES |
100 | Default max pages per job |
CRAWLER_TIMEOUT |
30000 | Request timeout (ms) |
MANA_SEARCH_URL |
http://localhost:3021 | mana-search URL (for extract fallback) |
Development Commands
# Install dependencies
pnpm install
# Start development server
pnpm dev
# Build for production
pnpm build
# Start production server
pnpm start
# Type checking
pnpm type-check
# Linting
pnpm lint
# Database commands
pnpm db:push # Push schema to database
pnpm db:generate # Generate migrations
pnpm db:migrate # Run migrations
pnpm db:studio # Open Drizzle Studio
Database Schema
The crawler uses its own schema (crawler) in the shared ManaCore database:
crawler.crawl_jobs- Crawl job configuration and statuscrawler.crawl_results- Individual page results
Queue System
Uses BullMQ with Redis for job processing:
- Queue Name:
crawl - Concurrency: Configurable (default: 5)
- Retry: 3 attempts with exponential backoff
- Dashboard: Available at
/queue/dashboard
Robots.txt Compliance
The crawler respects robots.txt by default:
- Checks robots.txt before crawling each domain
- Caches robots.txt rules in Redis (24h TTL)
- Can be disabled per-job with
respectRobots: false
Rate Limiting
Built-in rate limiting to be a good citizen:
- Per-domain rate limiting
- Configurable delay between requests
- Default: 2 requests/second/domain
Project Structure
services/mana-crawler/
├── src/
│ ├── main.ts # Application entry point
│ ├── app.module.ts # Root module
│ ├── config/
│ │ └── configuration.ts # App configuration
│ ├── db/
│ │ ├── schema/ # Drizzle schemas
│ │ ├── database.module.ts # Database provider
│ │ └── connection.ts # DB connection
│ ├── crawler/ # Crawl job management
│ │ ├── crawler.controller.ts
│ │ ├── crawler.service.ts
│ │ └── dto/
│ ├── queue/ # BullMQ queue processing
│ │ ├── queue.module.ts
│ │ └── processors/
│ ├── parser/ # HTML parsing (Cheerio)
│ ├── robots/ # robots.txt handling
│ ├── cache/ # Redis caching
│ ├── metrics/ # Prometheus metrics
│ └── health/ # Health check
├── drizzle.config.ts
├── package.json
├── tsconfig.json
└── Dockerfile
Integration with Other Services
mana-search
The crawler can use mana-search for single-page extraction as a fallback:
POST http://mana-search:3021/api/v1/extract
mana-api-gateway
The crawler can be exposed via the API gateway for monetization:
POST /v1/crawler/start → 5 Credits/Job + 1 Credit/100 pages
GET /v1/crawler/:id → 0 Credits
Troubleshooting
Redis connection issues
# Check Redis
docker exec mana-redis redis-cli ping
# Check queue status
curl http://localhost:3023/queue/dashboard
Jobs stuck in pending
Check that:
- Redis is running
- The queue processor is active
- No rate limit issues
High memory usage
The crawler loads pages into memory for parsing. For large crawls:
- Reduce
maxPagesper job - Increase job concurrency instead
- Monitor with
/metrics