Adds the services/news-ingester Bun service that pulls 25 public RSS/JSON feeds into news.curated_articles every 15 min, with Mozilla Readability fallback for thin RSS bodies and 30-day retention. apps/api /feed is rewritten to read from the new pool table directly instead of the sync_changes hack, with topics/lang/since/limit/offset query params. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3.5 KiB
news-ingester
Pulls public RSS/JSON feeds into news.curated_articles for the News Hub
module in the unified Mana app. The unified mana-api reads from the
same table to serve GET /api/v1/news/feed.
Tech Stack
| Layer | Technology |
|---|---|
| Runtime | Bun |
| Framework | Hono (only for health/status/manual trigger) |
| Database | PostgreSQL + Drizzle ORM (schema news in mana_platform) |
| Parsing | rss-parser for RSS/Atom, @mozilla/readability + jsdom for full-text fallback |
Port: 3066
What it does
On startup and every TICK_INTERVAL_MS (default 15 min):
- For each source in
src/sources.ts, fetch the feed (RSS or HN JSON). - Normalize items and dedupe by
sha256(originalUrl)against theurl_hashunique index — re-runs are safe. - If the feed body has fewer than 200 words, fall back to Mozilla Readability against the original URL to get the full article text.
- Insert into
news.curated_articleswith topic + source slug from the source definition. Topic classification is static (per-source); we do not run any content classifier. - Prune rows older than 30 days at the end of each tick.
API
| Method | Path | Description |
|---|---|---|
| GET | /health |
Healthcheck — returns 503 if Postgres unreachable |
| GET | /status |
Last tick result (sources, counts, duration) |
| POST | /ingest/run |
Trigger an ingest tick now (returns immediately) |
No auth — service is internal-only behind the docker network.
Adding a source
- Append to
SOURCESinsrc/sources.tswith a stableslug, type (rssorhn), URL, topic, and language. - Mirror the slug + name into the unified web app's onboarding picker
at
apps/mana/apps/web/src/lib/modules/news/sources-meta.tsso users can opt out of it. Slugs must match — user blocklists reference them. - Restart container and
curl -X POST http://localhost:3066/ingest/runto populate immediately.
Topics
The seven shipped topics are: tech, wissenschaft, weltgeschehen,
wirtschaft, kultur, gesundheit, politik. Adding a new topic
means updating the Topic union in src/sources.ts AND the matching
type in the unified web app's news/types.ts.
Database
Schema: news in mana_platform. Single table curated_articles,
indexed on (topic, published_at), (language, published_at),
source_slug, and ingested_at.
bun run db:push pushes the schema. The schema is intentionally NOT
referenced from apps/api — apps/api/src/modules/news/routes.ts
queries the table via raw SQL to keep the API service free of a Drizzle
schema dependency on this service.
Environment Variables
PORT=3066
DATABASE_URL=postgresql://mana:devpassword@localhost:5432/mana_platform
TICK_INTERVAL_MS=900000 # 15 minutes
RUN_ON_STARTUP=true
Local Dev
cd services/news-ingester
bun install
bun run db:push # creates news.curated_articles
bun run dev # starts on :3066, ticks immediately
curl -X POST http://localhost:3066/ingest/run
curl http://localhost:3066/status | jq
Privacy / Legal
Only public RSS feeds intended for syndication are ingested. The
User-Agent is ManaNewsIngester/1.0 (+https://mana.how/news) so site
owners can identify and contact us. Per-source rate limit is implicit
(15 min interval × ~30 items/source = ~2 req/min/source).
User reading behavior is not tracked here. Personalization happens client-side in the unified Mana app's local IndexedDB; the ingester only knows what was published, not what was read.