news-ingester: als DEPRECATED markiert (Cutover auf mana-news-pool)
Some checks are pending
CD Mac Mini / Detect Changes (push) Waiting to run
CD Mac Mini / Deploy (push) Blocked by required conditions
CI / Detect Changes (push) Waiting to run
CI / Validate (push) Waiting to run
CI / Build mana-search (push) Blocked by required conditions
CI / Build mana-sync (push) Blocked by required conditions
CI / Build mana-api-gateway (push) Blocked by required conditions
CI / Build mana-crawler (push) Blocked by required conditions
Docker Validate / Validate Dockerfiles (push) Waiting to run
Docker Validate / Build calendar-web (push) Blocked by required conditions
Docker Validate / Build quotes-web (push) Blocked by required conditions
Docker Validate / Build todo-backend (push) Blocked by required conditions
Docker Validate / Build todo-web (push) Blocked by required conditions
Docker Validate / Build mana-auth (push) Blocked by required conditions
Docker Validate / Build mana-sync (push) Blocked by required conditions
Docker Validate / Build mana-media (push) Blocked by required conditions
Mirror to Forgejo / Push to Forgejo (push) Waiting to run

CLAUDE.md umgeschrieben — Service-Beschreibung war seit dem
2026-05-17-Cutover irreführend (sprach von Container :3066, der
nicht mehr läuft, und 'unified mana-api liest aus derselben
Tabelle', wo jetzt HTTP-Proxy steht).

Klare Drop-Bedingungen für das ganze Verzeichnis dokumentiert:
- mana-news-pool 30 Tage stabil (~2026-06-17)
- altes news.curated_articles-Schema gedroppt
Bis dahin nicht anfassen — Source-Tree als Referenz für die
letzte managarten-eigene Source-Liste.
This commit is contained in:
Till JS 2026-05-17 18:17:38 +02:00
parent 5c47de8dd2
commit 501055a76c

View file

@ -1,100 +1,37 @@
# news-ingester
# news-ingester — DEPRECATED 2026-05-17
Pulls public RSS/JSON feeds into `news.curated_articles` for the News Hub
module in the unified Mana app. The unified `mana-api` reads from the
same table to serve `GET /api/v1/news/feed`.
> **Dieser Service wurde am 2026-05-17 durch
> [`mana-news-pool`](https://git.mana.how/mana/mana) (Plattform-Port 3079,
> eigene DB `mana_news_pool`, Schema `pool.curated_articles`) ersetzt.**
## Tech Stack
Der Container `news-ingester:3066` läuft nicht mehr. `managarten/apps/api/
src/modules/news/routes.ts` ist seit Commit `ad97c5362` ein HTTP-Proxy
auf `MANA_NEWS_POOL_URL=http://mana-news-pool:3079`.
| Layer | Technology |
|-------|------------|
| Runtime | Bun |
| Framework | Hono (only for health/status/manual trigger) |
| Database | PostgreSQL + Drizzle ORM (schema `news` in `mana_platform`) |
| Parsing | `rss-parser` for RSS/Atom, `@mozilla/readability` + `jsdom` for full-text fallback |
Source-Liste, Ingest-Logik, Konventionen leben jetzt in:
`mana/services/mana-news-pool/` (siehe `CLAUDE.md` dort).
## Port: 3066
## Was hier noch steht — und warum
## What it does
- **Source-Tree als Referenz**: `services/news-ingester/src/sources.ts`
ist die Stand-2026-05-16-Source-Liste. Wenn jemand die Drift zwischen
alten und neuen Sources rückblickend prüfen will, ist das die letzte
managarten-eigene Version.
- **Dockerfile + package.json**: dokumentieren das alte Pattern. Können
beim Sprint-Aufräumen gedroppt werden.
On startup and every `TICK_INTERVAL_MS` (default 15 min):
## Drop-Plan
1. For each source in `src/sources.ts`, fetch the feed (RSS or HN JSON).
2. Normalize items and dedupe by `sha256(originalUrl)` against the
`url_hash` unique index — re-runs are safe.
3. If the feed body has fewer than 200 words, fall back to Mozilla
Readability against the original URL to get the full article text.
4. Insert into `news.curated_articles` with topic + source slug from the
source definition. Topic classification is **static** (per-source);
we do not run any content classifier.
5. Prune rows older than 30 days at the end of each tick.
Dieses Verzeichnis kann komplett gelöscht werden, sobald:
## API
1. `mana-news-pool` 30 Tage stabil läuft (~2026-06-17).
2. Das alte `mana_platform.news.curated_articles`-Schema gedroppt ist
(siehe Memory `project_news_pool_old_schema_drop`).
| Method | Path | Description |
|--------|------|-------------|
| GET | `/health` | Healthcheck — returns 503 if Postgres unreachable |
| GET | `/status` | Last tick result (sources, counts, duration) |
| POST | `/ingest/run` | Trigger an ingest tick now (returns immediately) |
Bis dahin: nicht anfassen, dokumentiert den Cutover-Pfad.
No auth — service is internal-only behind the docker network.
## Cross-Refs
## Adding a source
1. Append to `SOURCES` in `src/sources.ts` with a stable `slug`, type
(`rss` or `hn`), URL, topic, and language.
2. Mirror the slug + name into the unified web app's onboarding picker
at `apps/mana/apps/web/src/lib/modules/news/sources-meta.ts` so users
can opt out of it. **Slugs must match** — user blocklists reference
them.
3. Restart container and `curl -X POST http://localhost:3066/ingest/run`
to populate immediately.
## Topics
The seven shipped topics are: `tech`, `wissenschaft`, `weltgeschehen`,
`wirtschaft`, `kultur`, `gesundheit`, `politik`. Adding a new topic
means updating the `Topic` union in `src/sources.ts` AND the matching
type in the unified web app's `news/types.ts`.
## Database
Schema: `news` in `mana_platform`. Single table `curated_articles`,
indexed on `(topic, published_at)`, `(language, published_at)`,
`source_slug`, and `ingested_at`.
`bun run db:push` pushes the schema. The schema is intentionally NOT
referenced from `apps/api``apps/api/src/modules/news/routes.ts`
queries the table via raw SQL to keep the API service free of a Drizzle
schema dependency on this service.
## Environment Variables
```env
PORT=3066
DATABASE_URL=postgresql://mana:devpassword@localhost:5432/mana_platform
TICK_INTERVAL_MS=900000 # 15 minutes
RUN_ON_STARTUP=true
```
## Local Dev
```bash
cd services/news-ingester
bun install
bun run db:push # creates news.curated_articles
bun run dev # starts on :3066, ticks immediately
curl -X POST http://localhost:3066/ingest/run
curl http://localhost:3066/status | jq
```
## Privacy / Legal
Only public RSS feeds intended for syndication are ingested. The
`User-Agent` is `ManaNewsIngester/1.0 (+https://mana.how/news)` so site
owners can identify and contact us. Per-source rate limit is implicit
(15 min interval × ~30 items/source = ~2 req/min/source).
User reading behavior is **not** tracked here. Personalization happens
client-side in the unified Mana app's local IndexedDB; the ingester
only knows what was published, not what was read.
- `mana/services/mana-news-pool/CLAUDE.md` — neuer Service
- `managarten/apps/api/src/modules/news/routes.ts` — Proxy-Implementation
- `mana/docs/MICROSERVICES_KANDIDATEN.md` — Lift-B-Plan