Skip to main content

Crate libreadability

Crate libreadability 

Source
Expand description

Readability article extraction library.

libreadability extracts the main article content from web pages by analyzing DOM structure, scoring content density, and removing boilerplate. It is a Rust port of readability by readeck, itself a Go port of Mozilla’s Readability.js.

§Quick start

use libreadability::Parser;

let html = r#"<html><body>
  <nav>Navigation links</nav>
  <article><p>This is the main article body with enough text to be extracted.</p>
  <p>The readability algorithm scores content density and identifies the
  primary article content, stripping navigation, ads, and other boilerplate.</p></article>
  <aside>Sidebar content</aside>
</body></html>"#;

let mut parser = Parser::new();
let article = parser.parse(html, None).expect("valid HTML");
assert!(!article.content.is_empty());
assert!(!article.text_content.is_empty());

§Output

Article contains both cleaned HTML (content) and plain text (text_content), plus metadata like title, byline, excerpt, published time, and text direction.

Structs§

Article
The extracted article content and metadata.
Parser
Port of Parser — the core readability extraction engine.

Enums§

Error