readex
HTML main-content extraction for Rust. Give readex a &str of HTML and
it returns the article body, title, byline, publish date, language, and
~15 other metadata fields — no network I/O, no JavaScript rendering, no
encoding detection. Pure synchronous string and DOM work, suitable for
embedding anywhere from a desktop tool to a server pipeline.
Quick start
[]
= "0.19"
use ;
let html = r#"
<html>
<head><title>Hello readex</title></head>
<body>
<article>
<h1>Hello readex</h1>
<p>This is the body of an article. It contains enough words
that the extractor will consider it substantive content.</p>
<p>A second paragraph adds more text so the scorer has signal.</p>
</article>
</body>
</html>
"#;
let Extracted = extract.expect;
assert_eq!;
assert!;
A more representative example
Real web pages come wrapped in navigation, cookie banners, share widgets, and
comment sections. readex strips the chrome and returns just the body and
the metadata it can recover:
use extract;
let html = r#"
<html lang="en">
<head>
<title>Why the bridge collapsed — The Daily Example</title>
<meta property="og:site_name" content="The Daily Example">
<meta name="author" content="Jane Reporter">
<meta property="article:published_time" content="2026-05-24T09:30:00Z">
</head>
<body>
<nav><a href="/">Home</a> <a href="/news">News</a></nav>
<aside class="cookie-banner">We use cookies. <button>OK</button></aside>
<article>
<h1>Why the bridge collapsed</h1>
<p class="byline">By Jane Reporter, 24 May 2026</p>
<p>Investigators arrived on site shortly after dawn and began
sampling the steelwork for fatigue cracks.</p>
<p>The bridge, opened in 1972, had been scheduled for inspection
next month. Engineers say the failure mode is consistent with
corrosion at the western anchorage.</p>
</article>
<section class="comments">
<h3>Comments (412)</h3>
<p>"Knew this would happen" — anonymous</p>
</section>
<footer>© 2026 The Daily Example</footer>
</body>
</html>
"#;
let result = extract.unwrap;
assert_eq!;
assert_eq!;
assert_eq!;
assert_eq!;
assert!;
assert!;
assert!; // banner stripped
assert!; // nav stripped
assert!; // comments stripped
readex carries the lineage of three well-validated extractors:
| Origin | Role inside readex |
|---|---|
| Mozilla Readability (JS) | Article-scoring core — the M2 port preserves the full _grabArticle / _prepArticle / flag-sieve pipeline. |
| Trafilatura (Python) | The M3 cascade — own → readability fork → jusText — with the 7-branch arbiter, dedup gate, and sanitize post-pass. |
| htmldate (Python) | Publication-date extraction with the same precedence rules as upstream. |
Each is a clean-room reimplementation in Rust; the upstream Python and JavaScript projects are the differential-test oracles, not vendored code.
API reference (cheat sheet)
| Function | Purpose |
|---|---|
extract |
Default extraction. Returns an Extracted with title, body text, canonical URL, language, byline, excerpt, site name, published time, categories, tags, image, license, hostname, and (optionally) sanitised HTML. |
extract_with |
extract(html, base_url) plus a third &Options parameter (so extract_with(html, base_url, &Options::default()) is exactly equivalent to extract(html, base_url)). Lets you opt into sanitised HTML output, set a minimum word-count threshold, or request a YAML metadata header. |
extract_to_markdown |
Body as Markdown — Trafilatura's output_format="markdown". |
extract_to_txt |
Plain-text body — Trafilatura's output_format="txt". |
extract_to_json / extract_to_csv / extract_to_xml / extract_to_tei |
Structured output formats. |
extract_via_readability |
Forces the M2 Mozilla-Readability path (older, simpler, no Trafilatura cascade). Useful when you specifically need that algorithm's output shape. |
extract and extract_with(.., .., &Options::default()) are byte-identical by
construction — extract is literally a one-line delegate, so the two cannot
drift apart.
Why readex?
There are already a handful of HTML-extraction crates on crates.io. Honest positioning vs. the obvious alternatives:
readex |
readability |
dom_content_extraction |
|
|---|---|---|---|
| Algorithms | Readability + Trafilatura cascade + htmldate | Readability only | DOM-centric (different family) |
| Metadata fields | ~15 (title, byline, language, dates, OG/Twitter/JSON-LD, categories, tags, image, license, hostname …) | Title + summary | Body text only |
| Output formats | text, sanitised HTML, Markdown, TXT, JSON, CSV, XML, TEI | text only | text only |
| Differential parity testing | Yes — 51-URL corpus + 50K broad sweep, every release | No | No |
| Hard pin on parser versions | Yes (html5ever 0.39.0, plus a documented "parser-equivalence fence") |
No | No |
| Edition / MSRV | 2024 / 1.85 | 2018 / older | 2021 / older |
| Comments extraction (Reddit/vBulletin/etc.) | Yes (via Trafilatura) | No | No |
| Date extraction | Yes (via htmldate) | No | No |
If your input is well-structured English-language articles and you want one
algorithm with no extra moving parts, readability may be all you need.
readex exists because real-world corpora (SEC filings, regulator
publications, multilingual news, low-template blogs, hub/index pages) defeat
single-algorithm extractors — the Trafilatura cascade was designed
specifically for that long tail.
Quality & differential testing
readex is developed against a differential-test harness that runs every
benchmark URL through three extractors in parallel — readex, Mozilla
Readability (via Node), and Trafilatura (via Python) — and scores agreement
across token sequences and metadata fields. The harness lives in
benchmark/ in the repo (not published as part of the crate) and is
re-run on every release.
Latest verdicts (as of 0.19.0):
| Gate | Corpus | Result |
|---|---|---|
Trafilatura extract_content (Markdown path) |
51 URLs | 48 / 51 byte-equivalent (41 substantive + 7 documented allowlist) |
| Trafilatura plain-text (TXT) path | 51 URLs | 45 / 51 substantive + 5 allowlist + 1 deferred |
| Trafilatura TEI structured output | 51 URLs | 51 / 51 (39 substantive + 12 allowlist) |
Mozilla Readability textContent |
51 URLs | 50 / 51 byte-equivalent vs. jsdom |
| Parser equivalence (rcdom vs. jsdom) | 51 URLs | 51 / 51 byte-equivalent DOM |
| Broad-sweep confidence (Common Crawl) | 50,000 pages | Tail-distribution scan vs. Python Trafilatura |
The "allowlist" entries are documented per-page divergences where readex
and upstream genuinely disagree for traced reasons (e.g. upstream emits a
cookie banner the page lacks chrome-class hints for; or upstream skips a
table the data-table heuristic rescues). They live under
wrk_docs/m{5,7}-allowlist/ in the repo with one Markdown file per
fixture.
If you find a page where readex disagrees with both Readability and
Trafilatura in a way that matters, please file an issue with the URL or
HTML — the harness will pick it up.
What is out of scope
- Network fetching.
readextakes a&str. The caller owns HTTP, redirects, SSRF guarding, and encoding detection. - JavaScript rendering.
readexparses the bytes as given. Pages that need JS to render their body need a headless browser upstream. - PDF extraction. HTML only.
- Streaming. The whole document is parsed at once.
These boundaries keep the crate sync, dependency-light, and easy to embed.
Status
0.19.0 is the first public crates.io release. The API surface is:
- Stable:
extract,extract_with,extract_to_markdown,extract_to_txt,extract_to_json,extract_to_csv,extract_to_xml,extract_to_tei,extract_via_readability, plus theExtracted,Options, andExtractErrortypes they use. #[doc(hidden)]internals:readability::*,trafilatura::*,htmldate::*. These are reachable but explicitly not part of the semver contract — they exist for the in-workspace differential test harness. Treat them as private; they can change at any time.
readex is at 0.x — additive minor bumps may add fields to Extracted or
Options; breaking changes (renames, signature changes) will only land on
a 0.X.0 boundary with a clear changelog entry.
Minimum supported Rust version (MSRV)
readex targets Rust 1.85+ (Rust 2024 edition).
Contributing
Issues and PRs welcome at https://github.com/0x4D44/readex. For
non-trivial changes, please open an issue first so we can discuss the
approach — readex is gated by a parity-test harness against Readability
and Trafilatura, and the cheapest path through that gate is usually a
quick sketch of intent before code.
License
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE or https://www.apache.org/licenses/LICENSE-2.0)
- MIT License (LICENSE-MIT or https://opensource.org/licenses/MIT)
at your option.
Unless you explicitly state otherwise, any contribution intentionally
submitted for inclusion in readex by you, as defined in the Apache-2.0
license, shall be dual-licensed as above, without any additional terms or
conditions.
See NOTICE for attribution to the upstream Readability,
Trafilatura, and htmldate projects whose algorithms readex ports.