Skip to main content

Crate readex

Crate readex 

Source
Expand description

readex — main-content extraction for arbitrary HTML.

readex takes a &str of HTML plus an optional base URL and returns the page’s main textual content together with a little metadata. It performs no network I/O, no JavaScript rendering, and no encoding detection — the caller owns all of that (parent brief 2026.05.16 - BRIEF - Rust Content Extraction Library.md, “What is explicitly OUT of scope”). The crate is pure, synchronous, std-only string/DOM work; a caller that needs it off the async hot path wraps it in spawn_blocking.

§Milestone status

M3 Stage 9 (HLD 2026.05.19 - HLD - mdrcel Trafilatura Port (M3) §7.6, THE M3 FINALE): the public extract / extract_with functions now drive the full Trafilatura cascade (core.bare_extraction, core.py:130-358) — parse + tree_cleaning + convert_tags + bare_extraction_with_cascade (own → readability_fork → jusText, with the 7-branch arbiter + dedup gate + sanitize post-pass) + metadata::extract_metadata (OG / meta-name / itemprop / JSON-LD / URL / date) + extract_comments. The M2 Readability port is preserved verbatim under extract_via_readability for callers who want the older path. Every public type and signature is byte-unchanged from M2 except for ONE additive field on Extracted (comments: String, defaulting to "") — additive only, exhaustive struct-literal callers upgrade via ..Extracted::default() (the M2 Stage 4 pattern).

M2 Stage 1a/1b/1c/2 (HLD 2026.05.18 - HLD - mdrcel Readability Port (M2) §7.1–§7.4): the public API is unchanged but extract / extract_with now run an idiomatic Rust port of Mozilla Readability v0.6.0 — the parse spine (_removeScripts / _prepDocument), title resolution, scoring, single top-candidate selection, sibling-append, the FLAG_* retry / flag-sieve / longest-text fallback, the readability- page-1 page-wrap, AND (Stage 2) the full faithful _prepArticle: _markDataTables (with the JS-faithful parse_int_js rowspan/colspan coercion), _cleanConditionally (the complete shadiness checklist incl. the data-table KEEP, ancestor-table KEEP, ancestor-code KEEP, and image-gallery exception), _cleanHeaders, _cleanStyles, _cleanMatchedNodes (share-strip), single-cell-<table> unwrap, <h1><h2> retag, <br>-before-<p> removal. A page yielding an article returns a populated Ok; a genuinely-empty extraction is a valid empty Ok (the Bug-E2 doctrine — “found little” is success, never an error and never ExtractError::NotImplemented). Full non-body metadata is the last stage (HLD §7.6) and is deliberately not yet ported. The ExtractError::NotImplemented variant is retained but is no longer returned on the happy path.

HLD §4 anti-inversion (Stage 2 anchor). _cleanConditionally deliberately KEEPS marked data tables (Readability.js:2461-2463 and the ancestor-data-table check :2466-2468); the port faithfully preserves EDGAR/HMRC financial tables exactly as Readability-JS does. The faithful port converges TO Readability-JS — it does NOT out-clean it. Word-count gaps versus a “narrative-only” human gold on table-heavy pages are therefore the documented diagnostic residual, never a tuning signal.

There is intentionally no trait / strategy / plugin scaffolding here. The parent brief explicitly warns against premature abstraction (the “M8 Glasgow ring road” antipattern — on-ramps built to nowhere). The dispatcher between extraction strategies is a later-milestone concern and is added when the strategies actually exist, not speculatively now.

§The extract / extract_with invariant

The parent brief mandates: “Keep the default-Options path the same as extract().” That invariant is guaranteed by construction rather than by parallel maintenance: extract is literally extract_with(html, base_url, &Options::default()). The two entry points therefore cannot diverge — there is only one code path. A unit test pins the equivalence so a future refactor that breaks it fails loudly.

§Word count

Extracted::word_count is the library’s own count over its own extracted text. The differential test harness deliberately does not trust it: the harness recomputes word count with its single canonical tokenizer (harness HLD §8 — “The harness never trusts an external word count”), exactly as it ignores each oracle’s self-reported count. The field is provided for direct library consumers (e.g. the consumer) as a convenience; it is informational, not the harness’s comparability surface.

Structs§

Extracted
The extracted main content of an HTML document, plus light metadata.
Options
Tuning knobs for extract_with.

Enums§

ExtractError
Errors returned by extract / extract_with.

Functions§

extract
Extract the main content of html.
extract_to_csv
Extract html and render the main content as CSV (or delimiter- separated values). Equivalent to Python’s extract(html, output_format="csv") per core.py:63-64 (returnstring = xmltocsv(document, options.formatting)).
extract_to_json
Extract html and render the main content as a JSON object.
extract_to_markdown
Extract html and render the main content as markdown.
extract_to_tei
Extract html and render result as TEI-conformant XML (Text Encoding Initiative). Stricter than Trafilatura’s own XML format — runs through check_tei to fix invalid structures (move/wrap/relabel).
extract_to_txt
Extract html and render the main content as plain TXT.
extract_to_xml
Extract html and render the main content as Trafilatura-flavoured XML.
extract_via_readability
Extract via the M2 Mozilla Readability port (the previous default).
extract_with
Extract the main content of html with explicit Options.