mirror of
https://github.com/Memo-2023/mana-monorepo.git
synced 2026-05-21 15:26:43 +02:00
Goroutine-based crawler replacing NestJS mana-crawler: - goquery for HTML parsing (title, content, links, metadata) - robots.txt checker with 24h cache - Worker pool with configurable concurrency + rate limiting - PostgreSQL for job/result storage - Same API surface: POST/GET/DELETE /api/v1/crawl 11 MB binary, ~15 MB Docker image vs ~200 MB NestJS. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
819 B
819 B
mana-crawler (Go)
Go web crawler replacing the NestJS mana-crawler. Goroutine-based worker pool instead of BullMQ.
Architecture
- Language: Go 1.25
- HTML Parsing: goquery (jQuery-like selectors)
- Robots.txt: temoto/robotstxt with 24h cache
- Job Queue: Goroutine worker pool + channels (replaces BullMQ)
- Database: PostgreSQL (pgx v5)
- Port: 3023
Endpoints
POST /api/v1/crawl— Start crawl jobGET /api/v1/crawl— List jobsGET /api/v1/crawl/{jobId}— Job statusGET /api/v1/crawl/{jobId}/results— Paginated resultsDELETE /api/v1/crawl/{jobId}— Cancel jobGET /health— Health checkGET /metrics— Prometheus metrics
Commands
go run ./cmd/server # Dev
go build ./cmd/server # Build
go test ./... # Test