managarten/apps/api/src/modules/articles/routes.ts
Till JS efe1810b04 feat(articles): browser-HTML bookmarklet + consent-wall detection + auto-save
Three intertwined improvements so the "save an article" flow actually
works on real-world sites, not just bloggy happy-path URLs.

=== Consent-wall detection ===

apps/api/src/modules/articles/routes.ts: the /extract response now
includes `warning: 'probable_consent_wall'` when the extracted text
is both short (<300 words) AND contains cookie-dialog vocabulary
(Cookies zustimmen / cookie consent / Zustimmung / accept all cookies
/ enable javascript / privacy center / Datenschutzeinstellungen). The
server still returns whatever it got so the client can decide; it just
flags it as probably-not-the-article.

Frontend surfaces that warning prominently instead of silently
persisting a "Cookies zustimmen…" blob as the article body.

=== Browser-HTML extract path ===

Server-side: new POST /api/v1/articles/extract/html endpoint accepting
{ url, html }, running @mana/shared-rss's extractFromHtml on the
caller-supplied HTML. 10 MiB payload cap. Same response shape as
/extract, including the consent-wall warning (in case the bookmarklet
fires before the user dismisses the dialog).

Client-side: new extractFromHtml() in api.ts with the same 25s
timeout + typed network-error mapping as extractArticle.

AddUrlForm gains a postMessage handshake: when loaded with
?source=bookmarklet, it posts `mana-ready` to window.opener and
listens one-shot for `mana-html` with { url, html, title } from the
opener's tab. The HTML goes straight to our own /extract/html
endpoint — same-origin, carries the user's auth cookie. No CORS, no
form-submission CSP tango, no cross-origin token smuggling. If
nothing arrives within 30s we surface a clear error instead of
hanging.

Settings page adds a second "browser-HTML" bookmarklet (marked as
"Empfohlen") alongside the legacy URL bookmarklet. New snippet opens
/articles/add?source=bookmarklet in a new tab, waits for mana-ready,
then postMessages the tab's documentElement.outerHTML over. 15s
safety timeout.

This bypasses cookie-consent walls and soft paywalls because the
HTML already comes from the user's own authenticated, consented
browser tab.

=== Auto-save after successful extract ===

Previously every save path had a two-click UX: preview → confirm.
Now on clean extract the preview skips straight to persist + navigate
to the reader. Consent-wall warning is the only fallback that pauses
the flow — the user gets a "Trotzdem speichern" button to opt into
saving a teaser anyway.

Button in the manual input row is renamed "Vorschau abrufen" → "Speichern"
since it's now the commit action, not the inspect action. Loading-block
messaging distinguishes "Server extrahiert…" vs "Speichere in deine
Leseliste… Gleich weiter zum Reader."

Net click count:
  Bookmarklet v1/v2 on working site:  2 clicks → 1 click
  Manual paste:                        2 clicks → 1 click
  Consent-wall fallback:              2 clicks (explicit "Trotzdem")
  Duplicate:                          2 clicks ("Zum gespeicherten
                                        Artikel")

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 15:29:53 +02:00

148 lines
4.9 KiB
TypeScript
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

/**
* Articles module — server-side URL extraction.
*
* Two endpoints, both thin wrappers around `@mana/shared-rss`:
*
* POST /extract ← server fetches the URL itself, then runs
* Readability on the HTML it got back. Works
* for simple sites but fails on anything behind
* a cookie-consent wall or a paywall — the
* server has no user session.
* POST /extract/html ← client already has the rendered HTML (from a
* browser bookmarklet running in the user's
* own tab with all their cookies applied).
* Server just runs Readability on that. This
* is how we bypass Golem / Spiegel / Zeit /
* Heise-style consent dialogs: use the user's
* already-consented session, not the server's
* anonymous fetch.
*
* Consent-wall heuristic: when /extract returns a suspiciously short
* payload that contains consent-dialog vocabulary we still hand the
* extracted text back but flag it with `warning: 'probable_consent_wall'`
* so the client can offer the bookmarklet-v2 path instead of pretending
* a 4-line "Cookies zustimmen" blob is the article.
*/
import { Hono } from 'hono';
import { extractFromUrl, extractFromHtml } from '@mana/shared-rss';
const routes = new Hono();
const CONSENT_KEYWORDS = [
'cookies zustimmen',
'cookie consent',
'zustimmung',
'accept all cookies',
'consent to the use',
'enable javascript',
'javascript is disabled',
'please enable',
'privacy center',
'datenschutz­einstellungen',
'datenschutzeinstellungen',
];
const CONSENT_WORDCOUNT_THRESHOLD = 300;
function looksLikeConsentWall(content: string, wordCount: number): boolean {
if (wordCount >= CONSENT_WORDCOUNT_THRESHOLD) return false;
const haystack = content.toLowerCase();
return CONSENT_KEYWORDS.some((needle) => haystack.includes(needle));
}
function isValidHttpUrl(url: string): boolean {
try {
const u = new URL(url);
return u.protocol === 'http:' || u.protocol === 'https:';
} catch {
return false;
}
}
// POST /extract — server fetches the URL + extracts. Legacy path.
routes.post('/extract', async (c) => {
const body = await c.req.json<{ url?: string }>().catch(() => ({}) as { url?: string });
const url = body.url;
if (!url || typeof url !== 'string') {
return c.json({ error: 'URL is required' }, 400);
}
if (!isValidHttpUrl(url)) {
return c.json({ error: 'Invalid URL' }, 400);
}
const extracted = await extractFromUrl(url);
if (!extracted) {
return c.json({ error: 'Extraction failed' }, 502);
}
const warning = looksLikeConsentWall(extracted.content, extracted.wordCount)
? 'probable_consent_wall'
: undefined;
return c.json({
originalUrl: url,
title: extracted.title,
excerpt: extracted.excerpt,
content: extracted.content,
htmlContent: extracted.htmlContent,
author: extracted.byline,
siteName: extracted.siteName,
wordCount: extracted.wordCount,
readingTimeMinutes: extracted.readingTimeMinutes,
...(warning && { warning }),
});
});
// POST /extract/html — client supplies HTML (from the user's browser
// tab, where cookies + JS rendering already happened). We only run
// Readability on it. Cap payload to 10 MiB so a pathological site
// can't exhaust server memory via the bookmarklet — typical rendered
// article HTML is 200-800 KB.
const MAX_HTML_BYTES = 10 * 1024 * 1024;
routes.post('/extract/html', async (c) => {
const body = await c.req
.json<{ url?: string; html?: string }>()
.catch(() => ({}) as { url?: string; html?: string });
const url = body.url;
const html = body.html;
if (!url || typeof url !== 'string') {
return c.json({ error: 'URL is required' }, 400);
}
if (!html || typeof html !== 'string') {
return c.json({ error: 'HTML is required' }, 400);
}
if (!isValidHttpUrl(url)) {
return c.json({ error: 'Invalid URL' }, 400);
}
if (html.length > MAX_HTML_BYTES) {
return c.json({ error: 'HTML payload too large' }, 413);
}
const extracted = await extractFromHtml(html, url);
if (!extracted) {
return c.json({ error: 'Extraction failed' }, 502);
}
// The consent-wall heuristic still applies here — a rare case is
// that the user bookmarklet-fires BEFORE the consent dialog is
// dismissed. Flag it so the client doesn't silently persist garbage.
const warning = looksLikeConsentWall(extracted.content, extracted.wordCount)
? 'probable_consent_wall'
: undefined;
return c.json({
originalUrl: url,
title: extracted.title,
excerpt: extracted.excerpt,
content: extracted.content,
htmlContent: extracted.htmlContent,
author: extracted.byline,
siteName: extracted.siteName,
wordCount: extracted.wordCount,
readingTimeMinutes: extracted.readingTimeMinutes,
...(warning && { warning }),
});
});
export { routes as articlesRoutes };