mirror of
https://github.com/Memo-2023/mana-monorepo.git
synced 2026-05-14 18:41:08 +02:00
feat(articles): browser-HTML bookmarklet + consent-wall detection + auto-save
Three intertwined improvements so the "save an article" flow actually
works on real-world sites, not just bloggy happy-path URLs.
=== Consent-wall detection ===
apps/api/src/modules/articles/routes.ts: the /extract response now
includes `warning: 'probable_consent_wall'` when the extracted text
is both short (<300 words) AND contains cookie-dialog vocabulary
(Cookies zustimmen / cookie consent / Zustimmung / accept all cookies
/ enable javascript / privacy center / Datenschutzeinstellungen). The
server still returns whatever it got so the client can decide; it just
flags it as probably-not-the-article.
Frontend surfaces that warning prominently instead of silently
persisting a "Cookies zustimmen…" blob as the article body.
=== Browser-HTML extract path ===
Server-side: new POST /api/v1/articles/extract/html endpoint accepting
{ url, html }, running @mana/shared-rss's extractFromHtml on the
caller-supplied HTML. 10 MiB payload cap. Same response shape as
/extract, including the consent-wall warning (in case the bookmarklet
fires before the user dismisses the dialog).
Client-side: new extractFromHtml() in api.ts with the same 25s
timeout + typed network-error mapping as extractArticle.
AddUrlForm gains a postMessage handshake: when loaded with
?source=bookmarklet, it posts `mana-ready` to window.opener and
listens one-shot for `mana-html` with { url, html, title } from the
opener's tab. The HTML goes straight to our own /extract/html
endpoint — same-origin, carries the user's auth cookie. No CORS, no
form-submission CSP tango, no cross-origin token smuggling. If
nothing arrives within 30s we surface a clear error instead of
hanging.
Settings page adds a second "browser-HTML" bookmarklet (marked as
"Empfohlen") alongside the legacy URL bookmarklet. New snippet opens
/articles/add?source=bookmarklet in a new tab, waits for mana-ready,
then postMessages the tab's documentElement.outerHTML over. 15s
safety timeout.
This bypasses cookie-consent walls and soft paywalls because the
HTML already comes from the user's own authenticated, consented
browser tab.
=== Auto-save after successful extract ===
Previously every save path had a two-click UX: preview → confirm.
Now on clean extract the preview skips straight to persist + navigate
to the reader. Consent-wall warning is the only fallback that pauses
the flow — the user gets a "Trotzdem speichern" button to opt into
saving a teaser anyway.
Button in the manual input row is renamed "Vorschau abrufen" → "Speichern"
since it's now the commit action, not the inspect action. Loading-block
messaging distinguishes "Server extrahiert…" vs "Speichere in deine
Leseliste… Gleich weiter zum Reader."
Net click count:
Bookmarklet v1/v2 on working site: 2 clicks → 1 click
Manual paste: 2 clicks → 1 click
Consent-wall fallback: 2 clicks (explicit "Trotzdem")
Duplicate: 2 clicks ("Zum gespeicherten
Artikel")
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
86c205ffc5
commit
efe1810b04
4 changed files with 590 additions and 92 deletions
|
|
@ -1,35 +1,72 @@
|
|||
/**
|
||||
* Articles module — server-side URL extraction.
|
||||
*
|
||||
* Thin wrapper around `@mana/shared-rss`'s Readability pipeline. The
|
||||
* extracted payload is returned to the client which then encrypts +
|
||||
* stores it locally (and syncs via mana-sync). The server keeps no
|
||||
* per-user article state — all reading-list data lives in the unified
|
||||
* Mana app's IndexedDB.
|
||||
* Two endpoints, both thin wrappers around `@mana/shared-rss`:
|
||||
*
|
||||
* One endpoint (`POST /extract`), not two. News has a `preview` + `save`
|
||||
* split for legacy reasons; here both UI paths (AddUrlForm preview + the
|
||||
* direct saveFromUrl path) use the same payload. The client caches the
|
||||
* response when the user confirms, avoiding a double server fetch.
|
||||
* POST /extract ← server fetches the URL itself, then runs
|
||||
* Readability on the HTML it got back. Works
|
||||
* for simple sites but fails on anything behind
|
||||
* a cookie-consent wall or a paywall — the
|
||||
* server has no user session.
|
||||
* POST /extract/html ← client already has the rendered HTML (from a
|
||||
* browser bookmarklet running in the user's
|
||||
* own tab with all their cookies applied).
|
||||
* Server just runs Readability on that. This
|
||||
* is how we bypass Golem / Spiegel / Zeit /
|
||||
* Heise-style consent dialogs: use the user's
|
||||
* already-consented session, not the server's
|
||||
* anonymous fetch.
|
||||
*
|
||||
* Consent-wall heuristic: when /extract returns a suspiciously short
|
||||
* payload that contains consent-dialog vocabulary we still hand the
|
||||
* extracted text back but flag it with `warning: 'probable_consent_wall'`
|
||||
* so the client can offer the bookmarklet-v2 path instead of pretending
|
||||
* a 4-line "Cookies zustimmen" blob is the article.
|
||||
*/
|
||||
|
||||
import { Hono } from 'hono';
|
||||
import { extractFromUrl } from '@mana/shared-rss';
|
||||
import { extractFromUrl, extractFromHtml } from '@mana/shared-rss';
|
||||
|
||||
const routes = new Hono();
|
||||
|
||||
const CONSENT_KEYWORDS = [
|
||||
'cookies zustimmen',
|
||||
'cookie consent',
|
||||
'zustimmung',
|
||||
'accept all cookies',
|
||||
'consent to the use',
|
||||
'enable javascript',
|
||||
'javascript is disabled',
|
||||
'please enable',
|
||||
'privacy center',
|
||||
'datenschutzeinstellungen',
|
||||
'datenschutzeinstellungen',
|
||||
];
|
||||
const CONSENT_WORDCOUNT_THRESHOLD = 300;
|
||||
|
||||
function looksLikeConsentWall(content: string, wordCount: number): boolean {
|
||||
if (wordCount >= CONSENT_WORDCOUNT_THRESHOLD) return false;
|
||||
const haystack = content.toLowerCase();
|
||||
return CONSENT_KEYWORDS.some((needle) => haystack.includes(needle));
|
||||
}
|
||||
|
||||
function isValidHttpUrl(url: string): boolean {
|
||||
try {
|
||||
const u = new URL(url);
|
||||
return u.protocol === 'http:' || u.protocol === 'https:';
|
||||
} catch {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
// POST /extract — server fetches the URL + extracts. Legacy path.
|
||||
routes.post('/extract', async (c) => {
|
||||
const body = await c.req.json<{ url?: string }>().catch(() => ({}) as { url?: string });
|
||||
const url = body.url;
|
||||
if (!url || typeof url !== 'string') {
|
||||
return c.json({ error: 'URL is required' }, 400);
|
||||
}
|
||||
|
||||
// Minimal URL shape check — extractFromUrl will no-op on a bad URL but
|
||||
// the caller deserves a clear 400 vs a generic 502.
|
||||
try {
|
||||
new URL(url);
|
||||
} catch {
|
||||
if (!isValidHttpUrl(url)) {
|
||||
return c.json({ error: 'Invalid URL' }, 400);
|
||||
}
|
||||
|
||||
|
|
@ -38,6 +75,10 @@ routes.post('/extract', async (c) => {
|
|||
return c.json({ error: 'Extraction failed' }, 502);
|
||||
}
|
||||
|
||||
const warning = looksLikeConsentWall(extracted.content, extracted.wordCount)
|
||||
? 'probable_consent_wall'
|
||||
: undefined;
|
||||
|
||||
return c.json({
|
||||
originalUrl: url,
|
||||
title: extracted.title,
|
||||
|
|
@ -48,6 +89,59 @@ routes.post('/extract', async (c) => {
|
|||
siteName: extracted.siteName,
|
||||
wordCount: extracted.wordCount,
|
||||
readingTimeMinutes: extracted.readingTimeMinutes,
|
||||
...(warning && { warning }),
|
||||
});
|
||||
});
|
||||
|
||||
// POST /extract/html — client supplies HTML (from the user's browser
|
||||
// tab, where cookies + JS rendering already happened). We only run
|
||||
// Readability on it. Cap payload to 10 MiB so a pathological site
|
||||
// can't exhaust server memory via the bookmarklet — typical rendered
|
||||
// article HTML is 200-800 KB.
|
||||
const MAX_HTML_BYTES = 10 * 1024 * 1024;
|
||||
|
||||
routes.post('/extract/html', async (c) => {
|
||||
const body = await c.req
|
||||
.json<{ url?: string; html?: string }>()
|
||||
.catch(() => ({}) as { url?: string; html?: string });
|
||||
const url = body.url;
|
||||
const html = body.html;
|
||||
if (!url || typeof url !== 'string') {
|
||||
return c.json({ error: 'URL is required' }, 400);
|
||||
}
|
||||
if (!html || typeof html !== 'string') {
|
||||
return c.json({ error: 'HTML is required' }, 400);
|
||||
}
|
||||
if (!isValidHttpUrl(url)) {
|
||||
return c.json({ error: 'Invalid URL' }, 400);
|
||||
}
|
||||
if (html.length > MAX_HTML_BYTES) {
|
||||
return c.json({ error: 'HTML payload too large' }, 413);
|
||||
}
|
||||
|
||||
const extracted = await extractFromHtml(html, url);
|
||||
if (!extracted) {
|
||||
return c.json({ error: 'Extraction failed' }, 502);
|
||||
}
|
||||
|
||||
// The consent-wall heuristic still applies here — a rare case is
|
||||
// that the user bookmarklet-fires BEFORE the consent dialog is
|
||||
// dismissed. Flag it so the client doesn't silently persist garbage.
|
||||
const warning = looksLikeConsentWall(extracted.content, extracted.wordCount)
|
||||
? 'probable_consent_wall'
|
||||
: undefined;
|
||||
|
||||
return c.json({
|
||||
originalUrl: url,
|
||||
title: extracted.title,
|
||||
excerpt: extracted.excerpt,
|
||||
content: extracted.content,
|
||||
htmlContent: extracted.htmlContent,
|
||||
author: extracted.byline,
|
||||
siteName: extracted.siteName,
|
||||
wordCount: extracted.wordCount,
|
||||
readingTimeMinutes: extracted.readingTimeMinutes,
|
||||
...(warning && { warning }),
|
||||
});
|
||||
});
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue