managarten/apps/mana/apps/web/src/lib/modules/articles/api.ts
Till JS efe1810b04 feat(articles): browser-HTML bookmarklet + consent-wall detection + auto-save
Three intertwined improvements so the "save an article" flow actually
works on real-world sites, not just bloggy happy-path URLs.

=== Consent-wall detection ===

apps/api/src/modules/articles/routes.ts: the /extract response now
includes `warning: 'probable_consent_wall'` when the extracted text
is both short (<300 words) AND contains cookie-dialog vocabulary
(Cookies zustimmen / cookie consent / Zustimmung / accept all cookies
/ enable javascript / privacy center / Datenschutzeinstellungen). The
server still returns whatever it got so the client can decide; it just
flags it as probably-not-the-article.

Frontend surfaces that warning prominently instead of silently
persisting a "Cookies zustimmen…" blob as the article body.

=== Browser-HTML extract path ===

Server-side: new POST /api/v1/articles/extract/html endpoint accepting
{ url, html }, running @mana/shared-rss's extractFromHtml on the
caller-supplied HTML. 10 MiB payload cap. Same response shape as
/extract, including the consent-wall warning (in case the bookmarklet
fires before the user dismisses the dialog).

Client-side: new extractFromHtml() in api.ts with the same 25s
timeout + typed network-error mapping as extractArticle.

AddUrlForm gains a postMessage handshake: when loaded with
?source=bookmarklet, it posts `mana-ready` to window.opener and
listens one-shot for `mana-html` with { url, html, title } from the
opener's tab. The HTML goes straight to our own /extract/html
endpoint — same-origin, carries the user's auth cookie. No CORS, no
form-submission CSP tango, no cross-origin token smuggling. If
nothing arrives within 30s we surface a clear error instead of
hanging.

Settings page adds a second "browser-HTML" bookmarklet (marked as
"Empfohlen") alongside the legacy URL bookmarklet. New snippet opens
/articles/add?source=bookmarklet in a new tab, waits for mana-ready,
then postMessages the tab's documentElement.outerHTML over. 15s
safety timeout.

This bypasses cookie-consent walls and soft paywalls because the
HTML already comes from the user's own authenticated, consented
browser tab.

=== Auto-save after successful extract ===

Previously every save path had a two-click UX: preview → confirm.
Now on clean extract the preview skips straight to persist + navigate
to the reader. Consent-wall warning is the only fallback that pauses
the flow — the user gets a "Trotzdem speichern" button to opt into
saving a teaser anyway.

Button in the manual input row is renamed "Vorschau abrufen" → "Speichern"
since it's now the commit action, not the inspect action. Loading-block
messaging distinguishes "Server extrahiert…" vs "Speichere in deine
Leseliste… Gleich weiter zum Reader."

Net click count:
  Bookmarklet v1/v2 on working site:  2 clicks → 1 click
  Manual paste:                        2 clicks → 1 click
  Consent-wall fallback:              2 clicks (explicit "Trotzdem")
  Duplicate:                          2 clicks ("Zum gespeicherten
                                        Artikel")

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 15:29:53 +02:00

131 lines
4.3 KiB
TypeScript

/**
* Articles API client — talks to apps/api `/api/v1/articles/*`.
*
* One endpoint (`POST /extract`) with the Readability result. Both the
* preview (AddUrlForm) and the direct save paths share the same call;
* the client chooses whether to show the result or immediately persist.
*
* Auth + base-URL handling mirrors news/api.ts — see that file for the
* full rationale on why we read `getManaApiUrl()` and `authStore.
* getValidToken()` instead of the cookie/env shortcuts.
*/
import { authStore } from '$lib/stores/auth.svelte';
import { getManaApiUrl } from '$lib/api/config';
async function authHeader(): Promise<Record<string, string>> {
const token = await authStore.getValidToken();
return token ? { Authorization: `Bearer ${token}` } : {};
}
export interface ExtractedArticle {
originalUrl: string;
title: string;
excerpt: string | null;
content: string;
htmlContent: string;
author: string | null;
siteName: string | null;
wordCount: number;
readingTimeMinutes: number;
/**
* Server-side quality flag. Today only `'probable_consent_wall'` is
* emitted: the extracted text was suspiciously short AND contained
* consent-dialog vocabulary, which typically means the server's
* anonymous fetch hit a GDPR interstitial instead of the article.
* The client uses this to offer the bookmarklet-v2 (browser-HTML)
* path without silently persisting garbage.
*/
warning?: 'probable_consent_wall';
}
/**
* Hard client-side timeout for the extract roundtrip. The server's
* own Readability fetch has a 15s timeout + a few seconds of JSDOM
* parse overhead; anything past 25s on the wire is almost certainly a
* dead server or a stuck network path, not a slow article. Without
* this, AddUrlForm's loader just sat there forever when the API was
* unreachable — hence the bookmarklet-lands-on-loader bug.
*/
const EXTRACT_TIMEOUT_MS = 25_000;
export async function extractArticle(
url: string,
fetchImpl: typeof fetch = fetch
): Promise<ExtractedArticle> {
let response: Response;
try {
response = await fetchImpl(`${getManaApiUrl()}/api/v1/articles/extract`, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
...(await authHeader()),
},
body: JSON.stringify({ url }),
signal: AbortSignal.timeout(EXTRACT_TIMEOUT_MS),
});
} catch (err) {
if (err instanceof DOMException && err.name === 'TimeoutError') {
throw new Error(
`Server antwortet nicht (nach ${EXTRACT_TIMEOUT_MS / 1000}s). Läuft apps/api?`
);
}
if (err instanceof TypeError) {
// Network-layer failure (connection refused, DNS, offline).
throw new Error(
`Server nicht erreichbar. Prüf dass apps/api läuft — pnpm run mana:dev startet beides.`
);
}
throw err;
}
if (!response.ok) {
const text = await response.text();
throw new Error(`extractArticle failed: ${response.status} ${text}`);
}
return (await response.json()) as ExtractedArticle;
}
/**
* Extract from a HTML payload the browser already has. Used by the
* bookmarklet-v2 flow — the user's browser already dealt with the
* cookie-consent wall, so we skip the server-side fetch entirely.
*
* The HTML cap is 10 MiB on the server; the browser sends
* `document.documentElement.outerHTML` which for typical article
* pages is 200-800 KB, well under the limit.
*/
export async function extractFromHtml(
url: string,
html: string,
fetchImpl: typeof fetch = fetch
): Promise<ExtractedArticle> {
let response: Response;
try {
response = await fetchImpl(`${getManaApiUrl()}/api/v1/articles/extract/html`, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
...(await authHeader()),
},
body: JSON.stringify({ url, html }),
signal: AbortSignal.timeout(EXTRACT_TIMEOUT_MS),
});
} catch (err) {
if (err instanceof DOMException && err.name === 'TimeoutError') {
throw new Error(
`Server antwortet nicht (nach ${EXTRACT_TIMEOUT_MS / 1000}s). Läuft apps/api?`
);
}
if (err instanceof TypeError) {
throw new Error(
`Server nicht erreichbar. Prüf dass apps/api läuft — pnpm run mana:dev startet beides.`
);
}
throw err;
}
if (!response.ok) {
const text = await response.text();
throw new Error(`extractFromHtml failed: ${response.status} ${text}`);
}
return (await response.json()) as ExtractedArticle;
}