feat(articles): browser-HTML bookmarklet + consent-wall detection + auto-save

Three intertwined improvements so the "save an article" flow actually works on real-world sites, not just bloggy happy-path URLs. === Consent-wall detection === apps/api/src/modules/articles/routes.ts: the /extract response now includes `warning: 'probable_consent_wall'` when the extracted text is both short (<300 words) AND contains cookie-dialog vocabulary (Cookies zustimmen / cookie consent / Zustimmung / accept all cookies / enable javascript / privacy center / Datenschutzeinstellungen). The server still returns whatever it got so the client can decide; it just flags it as probably-not-the-article. Frontend surfaces that warning prominently instead of silently persisting a "Cookies zustimmen…" blob as the article body. === Browser-HTML extract path === Server-side: new POST /api/v1/articles/extract/html endpoint accepting { url, html }, running @mana/shared-rss's extractFromHtml on the caller-supplied HTML. 10 MiB payload cap. Same response shape as /extract, including the consent-wall warning (in case the bookmarklet fires before the user dismisses the dialog). Client-side: new extractFromHtml() in api.ts with the same 25s timeout + typed network-error mapping as extractArticle. AddUrlForm gains a postMessage handshake: when loaded with ?source=bookmarklet, it posts `mana-ready` to window.opener and listens one-shot for `mana-html` with { url, html, title } from the opener's tab. The HTML goes straight to our own /extract/html endpoint — same-origin, carries the user's auth cookie. No CORS, no form-submission CSP tango, no cross-origin token smuggling. If nothing arrives within 30s we surface a clear error instead of hanging. Settings page adds a second "browser-HTML" bookmarklet (marked as "Empfohlen") alongside the legacy URL bookmarklet. New snippet opens /articles/add?source=bookmarklet in a new tab, waits for mana-ready, then postMessages the tab's documentElement.outerHTML over. 15s safety timeout. This bypasses cookie-consent walls and soft paywalls because the HTML already comes from the user's own authenticated, consented browser tab. === Auto-save after successful extract === Previously every save path had a two-click UX: preview → confirm. Now on clean extract the preview skips straight to persist + navigate to the reader. Consent-wall warning is the only fallback that pauses the flow — the user gets a "Trotzdem speichern" button to opt into saving a teaser anyway. Button in the manual input row is renamed "Vorschau abrufen" → "Speichern" since it's now the commit action, not the inspect action. Loading-block messaging distinguishes "Server extrahiert…" vs "Speichere in deine Leseliste… Gleich weiter zum Reader." Net click count: Bookmarklet v1/v2 on working site: 2 clicks → 1 click Manual paste: 2 clicks → 1 click Consent-wall fallback: 2 clicks (explicit "Trotzdem") Duplicate: 2 clicks ("Zum gespeicherten Artikel") Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 18:41:08 +02:00 · 2026-04-22 15:29:53 +02:00 · 2026-04-22 15:29:53 +02:00 · efe1810b04
commit efe1810b04
parent 86c205ffc5
4 changed files with 590 additions and 92 deletions
--- a/apps/api/src/modules/articles/routes.ts
+++ b/apps/api/src/modules/articles/routes.ts
@ -1,35 +1,72 @@
 /**
 * Articles module — server-side URL extraction.
 *
- * Thin wrapper around `@mana/shared-rss`'s Readability pipeline. The
- * extracted payload is returned to the client which then encrypts +
- * stores it locally (and syncs via mana-sync). The server keeps no
- * per-user article state — all reading-list data lives in the unified
- * Mana app's IndexedDB.
+ * Two endpoints, both thin wrappers around `@mana/shared-rss`:
 *
- * One endpoint (`POST /extract`), not two. News has a `preview` + `save`
- * split for legacy reasons; here both UI paths (AddUrlForm preview + the
- * direct saveFromUrl path) use the same payload. The client caches the
- * response when the user confirms, avoiding a double server fetch.
+ *   POST /extract         ← server fetches the URL itself, then runs
+ *                           Readability on the HTML it got back. Works
+ *                           for simple sites but fails on anything behind
+ *                           a cookie-consent wall or a paywall — the
+ *                           server has no user session.
+ *   POST /extract/html    ← client already has the rendered HTML (from a
+ *                           browser bookmarklet running in the user's
+ *                           own tab with all their cookies applied).
+ *                           Server just runs Readability on that. This
+ *                           is how we bypass Golem / Spiegel / Zeit /
+ *                           Heise-style consent dialogs: use the user's
+ *                           already-consented session, not the server's
+ *                           anonymous fetch.
+ *
+ * Consent-wall heuristic: when /extract returns a suspiciously short
+ * payload that contains consent-dialog vocabulary we still hand the
+ * extracted text back but flag it with `warning: 'probable_consent_wall'`
+ * so the client can offer the bookmarklet-v2 path instead of pretending
+ * a 4-line "Cookies zustimmen" blob is the article.
 */

 import { Hono } from 'hono';
-import { extractFromUrl } from '@mana/shared-rss';
+import { extractFromUrl, extractFromHtml } from '@mana/shared-rss';

 const routes = new Hono();

+const CONSENT_KEYWORDS = [
+	'cookies zustimmen',
+	'cookie consent',
+	'zustimmung',
+	'accept all cookies',
+	'consent to the use',
+	'enable javascript',
+	'javascript is disabled',
+	'please enable',
+	'privacy center',
+	'datenschutzeinstellungen',
+	'datenschutzeinstellungen',
+];
+const CONSENT_WORDCOUNT_THRESHOLD = 300;
+
+function looksLikeConsentWall(content: string, wordCount: number): boolean {
+	if (wordCount >= CONSENT_WORDCOUNT_THRESHOLD) return false;
+	const haystack = content.toLowerCase();
+	return CONSENT_KEYWORDS.some((needle) => haystack.includes(needle));
+}
+
+function isValidHttpUrl(url: string): boolean {
+	try {
+		const u = new URL(url);
+		return u.protocol === 'http:' || u.protocol === 'https:';
+	} catch {
+		return false;
+	}
+}
+
+// POST /extract — server fetches the URL + extracts. Legacy path.
 routes.post('/extract', async (c) => {
 	const body = await c.req.json<{ url?: string }>().catch(() => ({}) as { url?: string });
 	const url = body.url;
 	if (!url || typeof url !== 'string') {
 		return c.json({ error: 'URL is required' }, 400);
 	}
-
-	// Minimal URL shape check — extractFromUrl will no-op on a bad URL but
-	// the caller deserves a clear 400 vs a generic 502.
-	try {
-		new URL(url);
-	} catch {
+	if (!isValidHttpUrl(url)) {
 		return c.json({ error: 'Invalid URL' }, 400);
 	}

@ -38,6 +75,10 @@ routes.post('/extract', async (c) => {
 		return c.json({ error: 'Extraction failed' }, 502);
 	}

+	const warning = looksLikeConsentWall(extracted.content, extracted.wordCount)
+		? 'probable_consent_wall'
+		: undefined;
+
 	return c.json({
 		originalUrl: url,
 		title: extracted.title,
@ -48,6 +89,59 @@ routes.post('/extract', async (c) => {
 		siteName: extracted.siteName,
 		wordCount: extracted.wordCount,
 		readingTimeMinutes: extracted.readingTimeMinutes,
+		...(warning && { warning }),
+	});
+});
+
+// POST /extract/html — client supplies HTML (from the user's browser
+// tab, where cookies + JS rendering already happened). We only run
+// Readability on it. Cap payload to 10 MiB so a pathological site
+// can't exhaust server memory via the bookmarklet — typical rendered
+// article HTML is 200-800 KB.
+const MAX_HTML_BYTES = 10 * 1024 * 1024;
+
+routes.post('/extract/html', async (c) => {
+	const body = await c.req
+		.json<{ url?: string; html?: string }>()
+		.catch(() => ({}) as { url?: string; html?: string });
+	const url = body.url;
+	const html = body.html;
+	if (!url || typeof url !== 'string') {
+		return c.json({ error: 'URL is required' }, 400);
+	}
+	if (!html || typeof html !== 'string') {
+		return c.json({ error: 'HTML is required' }, 400);
+	}
+	if (!isValidHttpUrl(url)) {
+		return c.json({ error: 'Invalid URL' }, 400);
+	}
+	if (html.length > MAX_HTML_BYTES) {
+		return c.json({ error: 'HTML payload too large' }, 413);
+	}
+
+	const extracted = await extractFromHtml(html, url);
+	if (!extracted) {
+		return c.json({ error: 'Extraction failed' }, 502);
+	}
+
+	// The consent-wall heuristic still applies here — a rare case is
+	// that the user bookmarklet-fires BEFORE the consent dialog is
+	// dismissed. Flag it so the client doesn't silently persist garbage.
+	const warning = looksLikeConsentWall(extracted.content, extracted.wordCount)
+		? 'probable_consent_wall'
+		: undefined;
+
+	return c.json({
+		originalUrl: url,
+		title: extracted.title,
+		excerpt: extracted.excerpt,
+		content: extracted.content,
+		htmlContent: extracted.htmlContent,
+		author: extracted.byline,
+		siteName: extracted.siteName,
+		wordCount: extracted.wordCount,
+		readingTimeMinutes: extracted.readingTimeMinutes,
+		...(warning && { warning }),
 	});
 });