Expand description
A Rust port of Mozilla’s Readability algorithm for extracting the main article content from an HTML page.
§Quick start
use readable_rs::{extract, ExtractOptions};
let html = "<html><body><article><p>The actual article text goes here.</p></article></body></html>";
let product = extract(html, "https://example.com/article", ExtractOptions::default());
// product.content holds the extracted DOM (or None if nothing was found)
// product.title, product.by_line, product.sitename, etc. hold metadata§Module layout
- Top level –
extractis the single entry-point.ProductandExtractOptionsare the main public types. parser– thin wrappers around the underlying HTML parser (parser::NodeRef,parser::parse_html).shared_utils– a curated set of DOM helpers useful when post-processing the extracted content (URL resolution, text normalisation, etc.).NodeExt/NodeScoreStore– the trait and store that the scorer uses to attach readability metadata to DOM nodes without modifying the nodes themselves.
Modules§
- parser
- Thin wrappers around the underlying HTML parser.
- shared_
utils - Convenience re-exports of DOM helpers for post-processing extracted content.
Structs§
- Extract
Options - Knobs that control the behaviour of the extraction algorithm.
- Node
Score Store - An external store that maps DOM nodes to readability metadata without mutating the nodes themselves.
- Product
- The output of
crate::extract. Contains the extracted article content as a DOM subtree together with any metadata that was found.
Traits§
Functions§
- extract
- Extract the main article content from an HTML page.
- new_
html_ element - Create a new, detached HTML element node with the given tag name and no attributes or children.