Expand description
readex — main-content extraction for arbitrary HTML.
readex takes a &str of HTML plus an optional base URL and returns the
page’s main textual content together with a little metadata. It performs
no network I/O, no JavaScript rendering, and no encoding
detection — the caller owns all of that (parent brief
2026.05.16 - BRIEF - Rust Content Extraction Library.md, “What is
explicitly OUT of scope”). The crate is pure, synchronous, std-only
string/DOM work; a caller that needs it off the async hot path wraps it in
spawn_blocking.
§Milestone status
M3 Stage 9 (HLD 2026.05.19 - HLD - mdrcel Trafilatura Port (M3) §7.6,
THE M3 FINALE): the public extract / extract_with functions now
drive the full Trafilatura cascade (core.bare_extraction,
core.py:130-358) — parse + tree_cleaning + convert_tags +
bare_extraction_with_cascade (own → readability_fork → jusText, with the
7-branch arbiter + dedup gate + sanitize post-pass) +
metadata::extract_metadata (OG / meta-name / itemprop / JSON-LD / URL /
date) + extract_comments. The M2 Readability port is preserved verbatim
under extract_via_readability for callers who want the older path.
Every public type and signature is byte-unchanged from M2 except for ONE
additive field on Extracted (comments: String, defaulting to "")
— additive only, exhaustive struct-literal callers upgrade via
..Extracted::default() (the M2 Stage 4 pattern).
M2 Stage 1a/1b/1c/2 (HLD 2026.05.18 - HLD - mdrcel Readability Port (M2) §7.1–§7.4): the public API is unchanged but extract /
extract_with now run an idiomatic Rust port of Mozilla Readability
v0.6.0 — the parse spine (_removeScripts / _prepDocument), title
resolution, scoring, single top-candidate selection, sibling-append, the
FLAG_* retry / flag-sieve / longest-text fallback, the readability- page-1 page-wrap, AND (Stage 2) the full faithful _prepArticle:
_markDataTables (with the JS-faithful parse_int_js rowspan/colspan
coercion), _cleanConditionally (the complete shadiness checklist incl.
the data-table KEEP, ancestor-table KEEP, ancestor-code KEEP, and
image-gallery exception), _cleanHeaders, _cleanStyles,
_cleanMatchedNodes (share-strip), single-cell-<table> unwrap,
<h1>→<h2> retag, <br>-before-<p> removal. A page yielding an
article returns a populated Ok; a genuinely-empty extraction is a valid
empty Ok (the Bug-E2 doctrine — “found little” is success, never an
error and never ExtractError::NotImplemented). Full non-body metadata
is the last stage (HLD §7.6) and is deliberately not yet ported. The
ExtractError::NotImplemented variant is retained but is no longer
returned on the happy path.
HLD §4 anti-inversion (Stage 2 anchor). _cleanConditionally
deliberately KEEPS marked data tables (Readability.js:2461-2463 and the
ancestor-data-table check :2466-2468); the port faithfully preserves
EDGAR/HMRC financial tables exactly as Readability-JS does. The faithful
port converges TO Readability-JS — it does NOT out-clean it. Word-count
gaps versus a “narrative-only” human gold on table-heavy pages are
therefore the documented diagnostic residual, never a tuning signal.
There is intentionally no trait / strategy / plugin scaffolding here. The parent brief explicitly warns against premature abstraction (the “M8 Glasgow ring road” antipattern — on-ramps built to nowhere). The dispatcher between extraction strategies is a later-milestone concern and is added when the strategies actually exist, not speculatively now.
§The extract / extract_with invariant
The parent brief mandates: “Keep the default-Options path the same as
extract().” That invariant is guaranteed by construction rather than
by parallel maintenance: extract is literally
extract_with(html, base_url, &Options::default()). The two entry points
therefore cannot diverge — there is only one code path. A unit test pins
the equivalence so a future refactor that breaks it fails loudly.
§Word count
Extracted::word_count is the library’s own count over its own
extracted text. The differential test harness deliberately does not
trust it: the harness recomputes word count with its single canonical
tokenizer (harness HLD §8 — “The harness never trusts an external word
count”), exactly as it ignores each oracle’s self-reported count. The field
is provided for direct library consumers (e.g. the consumer) as a convenience;
it is informational, not the harness’s comparability surface.
Structs§
- Extracted
- The extracted main content of an HTML document, plus light metadata.
- Options
- Tuning knobs for
extract_with.
Enums§
- Extract
Error - Errors returned by
extract/extract_with.
Functions§
- extract
- Extract the main content of
html. - extract_
to_ csv - Extract
htmland render the main content as CSV (or delimiter- separated values). Equivalent to Python’sextract(html, output_format="csv")percore.py:63-64(returnstring = xmltocsv(document, options.formatting)). - extract_
to_ json - Extract
htmland render the main content as a JSON object. - extract_
to_ markdown - Extract
htmland render the main content as markdown. - extract_
to_ tei - Extract
htmland render result as TEI-conformant XML (Text Encoding Initiative). Stricter than Trafilatura’s own XML format — runs throughcheck_teito fix invalid structures (move/wrap/relabel). - extract_
to_ txt - Extract
htmland render the main content as plain TXT. - extract_
to_ xml - Extract
htmland render the main content as Trafilatura-flavoured XML. - extract_
via_ readability - Extract via the M2 Mozilla Readability port (the previous default).
- extract_
with - Extract the main content of
htmlwith explicitOptions.