Expand description
§Legible
A Rust port of Mozilla’s Readability.js for extracting readable content from web pages.
This library provides functionality to extract the main content from HTML documents, stripping away navigation, ads, and other non-content elements to produce clean, readable article content.
§Quick Start
use legible::parse;
let html = r#"
<html>
<head><title>My Article</title></head>
<body>
<nav>Navigation</nav>
<article>
<h1>Article Title</h1>
<p>This is the main content of the article. It contains several
paragraphs of text that make up the body of the article.</p>
<p>More content here to ensure we have enough text for the
readability algorithm to work with properly.</p>
</article>
<footer>Footer</footer>
</body>
</html>
"#;
match parse(html, Some("https://example.com"), None) {
Ok(article) => {
println!("Title: {}", article.title);
println!("Byline: {:?}", article.byline);
println!("Content: {}", article.content);
println!("Text: {}", article.text_content);
}
Err(e) => eprintln!("Error: {}", e),
}The returned Article contains:
title- The article titlecontent- The article content as HTMLtext_content- The article content as plain textbyline- The author bylineexcerpt- A short excerpt from the articlesite_name- The site namepublished_time- The published timedir- Text direction (ltr or rtl)lang- Document languagelength- Length of the text content
§Checking Readability
You can quickly check if a document is likely to be parseable without running the full algorithm:
use legible::is_probably_readerable;
let html = "<html><body><article>Long article content...</article></body></html>";
if is_probably_readerable(html, None) {
println!("Document appears to be readerable");
}§Pre-parsed Document
If you want to check readability before parsing, use Document to avoid
parsing the HTML twice:
use legible::Document;
let html = r#"
<html>
<head><title>My Article</title></head>
<body>
<article>
<h1>Article Title</h1>
<p>This is the main content of the article. It contains several
paragraphs of text that make up the body of the article.</p>
<p>More content here to ensure we have enough text for the
readability algorithm to work with properly.</p>
</article>
</body>
</html>
"#;
let doc = Document::new(html);
if doc.is_probably_readerable(None) {
match doc.parse(Some("https://example.com"), None) {
Ok(article) => println!("Title: {}", article.title),
Err(e) => eprintln!("Error: {}", e),
}
}§Configuration
Use the Options builder to customize parsing behavior:
use legible::{parse, Options};
let html = "<html><body><article>Content...</article></body></html>";
let options = Options::new()
.char_threshold(250) // Minimum article length (default: 500)
.keep_classes(true) // Preserve CSS classes in output
.disable_json_ld(true); // Skip JSON-LD metadata extraction
let article = parse(html, Some("https://example.com"), Some(options));See Options for all available configuration options.
§Security
The extracted HTML content is unsanitized and may contain malicious scripts or
other dangerous content from the source document. Before rendering this HTML in a
browser or other context where scripts could execute, you should sanitize it using
a library like ammonia:
let article = parse(html, Some(url), None)?;
let safe_html = ammonia::clean(&article.content);§How It Works
Legible implements the same algorithm as Readability.js:
- Document Preparation - Removes scripts, normalizes markup, fixes lazy-loaded images
- Metadata Extraction - Extracts title, byline, and other metadata from JSON-LD, OpenGraph tags, and meta elements
- Content Scoring - Scores DOM nodes based on tag type, text density, and class/id patterns
- Candidate Selection - Identifies the highest-scoring content container
- Content Cleaning - Removes low-scoring elements, empty containers, and non-content markup
Structs§
- Article
- The extracted article content.
- Document
- A pre-parsed HTML document.
- Options
- Configuration options for the
parse()function. - Readerable
Options - Options for the
is_probably_readerablefunction.
Enums§
- Error
- Errors that can occur during article parsing.
Functions§
- is_
probably_ readerable - Check if a document is probably readerable without parsing the whole thing.
- parse
- Parse an HTML document and extract the article content.
Type Aliases§
- Result
- Result type alias for readability operations.