readex 0.19.1

HTML main-content extraction (article body, title, metadata) — Rust ports of Mozilla Readability, Trafilatura, and htmldate.
Documentation

readex

Crates.io docs.rs License: MIT OR Apache-2.0

HTML main-content extraction for Rust. Give readex a &str of HTML and it returns the article body, title, byline, publish date, language, and ~15 other metadata fields — no network I/O, no JavaScript rendering, no encoding detection. Pure synchronous string and DOM work, suitable for embedding anywhere from a desktop tool to a server pipeline.


Quick start

[dependencies]
readex = "0.19"
use readex::{extract, Extracted};

let html = r#"
    <html>
      <head><title>Hello readex</title></head>
      <body>
        <article>
          <h1>Hello readex</h1>
          <p>This is the body of an article. It contains enough words
             that the extractor will consider it substantive content.</p>
          <p>A second paragraph adds more text so the scorer has signal.</p>
        </article>
      </body>
    </html>
"#;

let Extracted { title, text, .. } = extract(html, None).expect("extraction failed");

assert_eq!(title.as_deref(), Some("Hello readex"));
assert!(text.contains("body of an article"));

A more representative example

Real web pages come wrapped in navigation, cookie banners, share widgets, and comment sections. readex strips the chrome and returns just the body and the metadata it can recover:

use readex::extract;

let html = r#"
    <html lang="en">
      <head>
        <title>Why the bridge collapsed — The Daily Example</title>
        <meta property="og:site_name" content="The Daily Example">
        <meta name="author" content="Jane Reporter">
        <meta property="article:published_time" content="2026-05-24T09:30:00Z">
      </head>
      <body>
        <nav><a href="/">Home</a> <a href="/news">News</a></nav>
        <aside class="cookie-banner">We use cookies. <button>OK</button></aside>
        <article>
          <h1>Why the bridge collapsed</h1>
          <p class="byline">By Jane Reporter, 24 May 2026</p>
          <p>Investigators arrived on site shortly after dawn and began
             sampling the steelwork for fatigue cracks.</p>
          <p>The bridge, opened in 1972, had been scheduled for inspection
             next month. Engineers say the failure mode is consistent with
             corrosion at the western anchorage.</p>
        </article>
        <section class="comments">
          <h3>Comments (412)</h3>
          <p>"Knew this would happen" — anonymous</p>
        </section>
        <footer>© 2026 The Daily Example</footer>
      </body>
    </html>
"#;

let result = extract(html, Some("https://example.com/news/bridge")).unwrap();

assert_eq!(result.title.as_deref(), Some("Why the bridge collapsed"));
assert_eq!(result.byline.as_deref(), Some("Jane Reporter"));
assert_eq!(result.site_name.as_deref(), Some("The Daily Example"));
assert_eq!(result.language.as_deref(), Some("en"));
assert!(result.published_time.is_some());
assert!(result.text.contains("Investigators arrived on site"));
assert!(!result.text.contains("cookie"));          // banner stripped
assert!(!result.text.contains("Home"));            // nav stripped
assert!(!result.text.contains("Knew this would")); // comments stripped

readex carries the lineage of three well-validated extractors:

Origin Role inside readex
Mozilla Readability (JS) Article-scoring core — the M2 port preserves the full _grabArticle / _prepArticle / flag-sieve pipeline.
Trafilatura (Python) The M3 cascade — own → readability fork → jusText — with the 7-branch arbiter, dedup gate, and sanitize post-pass.
htmldate (Python) Publication-date extraction with the same precedence rules as upstream.

Each is a clean-room reimplementation in Rust; the upstream Python and JavaScript projects are the differential-test oracles, not vendored code.


API reference (cheat sheet)

Function Purpose
extract Default extraction. Returns an Extracted with title, body text, canonical URL, language, byline, excerpt, site name, published time, categories, tags, image, license, hostname, and (optionally) sanitised HTML.
extract_with extract(html, base_url) plus a third &Options parameter (so extract_with(html, base_url, &Options::default()) is exactly equivalent to extract(html, base_url)). Lets you opt into sanitised HTML output, set a minimum word-count threshold, or request a YAML metadata header.
extract_to_markdown Body as Markdown — Trafilatura's output_format="markdown".
extract_to_txt Plain-text body — Trafilatura's output_format="txt".
extract_to_json / extract_to_csv / extract_to_xml / extract_to_tei Structured output formats.
extract_via_readability Forces the M2 Mozilla-Readability path (older, simpler, no Trafilatura cascade). Useful when you specifically need that algorithm's output shape.

extract and extract_with(.., .., &Options::default()) are byte-identical by construction — extract is literally a one-line delegate, so the two cannot drift apart.


Why readex?

There are already a handful of HTML-extraction crates on crates.io. Honest positioning vs. the obvious alternatives:

readex readability dom_content_extraction
Algorithms Readability + Trafilatura cascade + htmldate Readability only DOM-centric (different family)
Metadata fields ~15 (title, byline, language, dates, OG/Twitter/JSON-LD, categories, tags, image, license, hostname …) Title + summary Body text only
Output formats text, sanitised HTML, Markdown, TXT, JSON, CSV, XML, TEI text only text only
Differential parity testing Yes — 51-URL corpus + 50K broad sweep, every release No No
Hard pin on parser versions Yes (html5ever 0.39.0, plus a documented "parser-equivalence fence") No No
Edition / MSRV 2024 / 1.85 2018 / older 2021 / older
Comments extraction (Reddit/vBulletin/etc.) Yes (via Trafilatura) No No
Date extraction Yes (via htmldate) No No

If your input is well-structured English-language articles and you want one algorithm with no extra moving parts, readability may be all you need. readex exists because real-world corpora (SEC filings, regulator publications, multilingual news, low-template blogs, hub/index pages) defeat single-algorithm extractors — the Trafilatura cascade was designed specifically for that long tail.


Quality & differential testing

readex is developed against a differential-test harness that runs every benchmark URL through three extractors in parallel — readex, Mozilla Readability (via Node), and Trafilatura (via Python) — and scores agreement across token sequences and metadata fields. The harness lives in benchmark/ in the repo (not published as part of the crate) and is re-run on every release.

Latest verdicts (as of 0.19.0):

Gate Corpus Result
Trafilatura extract_content (Markdown path) 51 URLs 48 / 51 byte-equivalent (41 substantive + 7 documented allowlist)
Trafilatura plain-text (TXT) path 51 URLs 45 / 51 substantive + 5 allowlist + 1 deferred
Trafilatura TEI structured output 51 URLs 51 / 51 (39 substantive + 12 allowlist)
Mozilla Readability textContent 51 URLs 50 / 51 byte-equivalent vs. jsdom
Parser equivalence (rcdom vs. jsdom) 51 URLs 51 / 51 byte-equivalent DOM
Broad-sweep confidence (Common Crawl) 50,000 pages Tail-distribution scan vs. Python Trafilatura

The "allowlist" entries are documented per-page divergences where readex and upstream genuinely disagree for traced reasons (e.g. upstream emits a cookie banner the page lacks chrome-class hints for; or upstream skips a table the data-table heuristic rescues). They live under wrk_docs/m{5,7}-allowlist/ in the repo with one Markdown file per fixture.

If you find a page where readex disagrees with both Readability and Trafilatura in a way that matters, please file an issue with the URL or HTML — the harness will pick it up.


What is out of scope

  • Network fetching. readex takes a &str. The caller owns HTTP, redirects, SSRF guarding, and encoding detection.
  • JavaScript rendering. readex parses the bytes as given. Pages that need JS to render their body need a headless browser upstream.
  • PDF extraction. HTML only.
  • Streaming. The whole document is parsed at once.

These boundaries keep the crate sync, dependency-light, and easy to embed.


Status

0.19.0 is the first public crates.io release. The API surface is:

  • Stable: extract, extract_with, extract_to_markdown, extract_to_txt, extract_to_json, extract_to_csv, extract_to_xml, extract_to_tei, extract_via_readability, plus the Extracted, Options, and ExtractError types they use.
  • #[doc(hidden)] internals: readability::*, trafilatura::*, htmldate::*. These are reachable but explicitly not part of the semver contract — they exist for the in-workspace differential test harness. Treat them as private; they can change at any time.

readex is at 0.x — additive minor bumps may add fields to Extracted or Options; breaking changes (renames, signature changes) will only land on a 0.X.0 boundary with a clear changelog entry.

Minimum supported Rust version (MSRV)

readex targets Rust 1.85+ (Rust 2024 edition).


Contributing

Issues and PRs welcome at https://github.com/0x4D44/readex. For non-trivial changes, please open an issue first so we can discuss the approach — readex is gated by a parity-test harness against Readability and Trafilatura, and the cheapest path through that gate is usually a quick sketch of intent before code.


License

Licensed under either of:

at your option.

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in readex by you, as defined in the Apache-2.0 license, shall be dual-licensed as above, without any additional terms or conditions.

See NOTICE for attribution to the upstream Readability, Trafilatura, and htmldate projects whose algorithms readex ports.