managarten/apps/api/src/modules/articles/consent-wall.ts
Till JS b297f68ee4 fix(articles, mana-ai): rollout-block hardening for sync_changes projections
Four cross-cutting fixes that make the bulk-import worker safe to run
under real production load. All four were called out as live-rollout
risks in the post-ship review of docs/plans/articles-bulk-import.md.

#1 — Same fieldMetaTime bug fixed in mana-ai
   The articles fix in 054b9e5be hoists the helper to its own file
   `apps/api/src/modules/articles/field-meta.ts`. The same naive
   `rowFM[k] >= localTime` LWW comparison existed in three more
   projections under services/mana-ai (missions-projection,
   snapshot-refresh, agents-projection). Once any F3 stamp lands
   beside a legacy-string stamp, the comparison evaluates
   `'[object Object]' >= 'ISO-…'` (false) and the older value wins.
   New `services/mana-ai/src/db/field-meta.ts` — same helper,
   deliberately duplicated (each service treats sync_changes as a
   read-only event log; sharing infra across services is out of
   scope here). All 61 mana-ai bun tests still pass.

#2 — Stale 'extracting' items recycle
   If the worker dies mid-fetch (OOM, pod restart), items stay in
   state='extracting' forever and the job never completes. New sweep
   at the start of `processOneJob`: items whose lastAttemptAt is
   older than 5 minutes get bounced back to 'pending' so the next
   tick re-claims them. STALE_EXTRACTING_MS tuned for the 15s
   shared-rss fetch + JSDOM-parse worst case.

#3 — Pickup-row GC
   Every 30 ticks (~once per minute) the worker hard-deletes
   articleExtractPickup rows older than 24h. Without this a stuck
   pickup-consumer (all tabs closed, Web-Lock mismatch) would let
   sync_changes accumulate without bound. Logs the row count when
   non-zero so we can spot stuck consumers in the wild.

#4 — DRY consent-wall heuristic
   Identical CONSENT_KEYWORDS + threshold lived in routes.ts AND
   import-extractor.ts. Hoisted to
   `apps/api/src/modules/articles/consent-wall.ts`; both call sites
   now share one heuristic.

Plan: docs/plans/articles-bulk-import.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 00:53:39 +02:00

37 lines
1.3 KiB
TypeScript
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

/**
* Consent-wall heuristic shared by every server-side article-extract
* path:
* - `/api/v1/articles/extract` and `/extract/html` (single-URL)
* - The bulk-import worker's `extractOneItem` (background)
*
* When the extracted text is suspiciously short AND contains GDPR /
* cookie-consent vocabulary, the server's anonymous fetch most likely
* hit a consent dialog instead of the article itself. The caller can
* use the flag to nudge the user toward the browser-HTML bookmarklet
* (which fetches with the user's existing session cookies) rather
* than silently persisting the GDPR overlay text as the article body.
*/
const CONSENT_KEYWORDS = [
'cookies zustimmen',
'cookie consent',
'zustimmung',
'accept all cookies',
'consent to the use',
'enable javascript',
'javascript is disabled',
'please enable',
'privacy center',
'datenschutz­einstellungen',
'datenschutzeinstellungen',
];
/** Wordcount floor below which the heuristic is considered. Real
* articles are typically >300 words; consent dialogs are <50. */
const CONSENT_WORDCOUNT_THRESHOLD = 300;
export function looksLikeConsentWall(content: string, wordCount: number): boolean {
if (wordCount >= CONSENT_WORDCOUNT_THRESHOLD) return false;
const haystack = content.toLowerCase();
return CONSENT_KEYWORDS.some((needle) => haystack.includes(needle));
}