Expand description
§ReadabilityRS
A Rust port of Mozilla’s Readability library for extracting article content from web pages.
This library is a faithful port of the Mozilla Readability JavaScript library, used in Firefox Reader View.
§Overview
ReadabilityRS provides intelligent extraction of main article content from HTML documents, removing clutter such as advertisements, navigation elements, and other non-essential content. It also extracts metadata like article title, author (byline), publish date, and more.
§Key Features
- Content Extraction: Intelligently identifies and extracts main article content
- Metadata Extraction: Extracts title, author, description, site name, language, and publish date
- JSON-LD Support: Parses structured data from JSON-LD markup
- Multiple Retry Strategies: Uses adaptive algorithms to handle various page layouts
- Customizable Options: Configure thresholds, scoring, and behavior
- Pre-flight Check: Quick check to determine if a page is likely readable
§Basic Usage
use readabilityrs::{Readability, ReadabilityOptions};
let html = r#"<html><body><article><h1>Title</h1><p>Content...</p></article></body></html>"#;
let url = "https://example.com/article";
let options = ReadabilityOptions::default();
let readability = Readability::new(html, Some(url), Some(options)).unwrap();
if let Some(article) = readability.parse() {
println!("Title: {:?}", article.title);
println!("Content: {:?}", article.content);
println!("Author: {:?}", article.byline);
}§Advanced Usage
§Custom Options
use readabilityrs::{Readability, ReadabilityOptions};
let html = "<html>...</html>";
let options = ReadabilityOptions::builder()
.char_threshold(300)
.nb_top_candidates(10)
.keep_classes(true)
.build();
let readability = Readability::new(html, None, Some(options)).unwrap();
let article = readability.parse();§Pre-flight Check
Use is_probably_readerable to quickly check if a document is likely to be parseable
before doing the full parse:
use readabilityrs::is_probably_readerable;
let html = "<html>...</html>";
if is_probably_readerable(html, None) {
// Proceed with full parsing
} else {
// Skip parsing or use alternative strategy
}§Error Handling
use readabilityrs::{Readability, ReadabilityError};
let html = "<html>...</html>";
let url = "not a valid url";
match Readability::new(html, Some(url), None) {
Ok(readability) => {
if let Some(article) = readability.parse() {
println!("Success!");
}
}
Err(ReadabilityError::InvalidUrl(url)) => {
eprintln!("Invalid URL: {}", url);
}
Err(e) => {
eprintln!("Error: {}", e);
}
}§Algorithm
The extraction algorithm works in several phases. First, scripts and styles are removed to prepare the document. Then potential content containers are identified throughout the page. These candidates are scored based on various content signals like paragraph count, text length, and link density. The best candidate is selected using adaptive strategies with multiple fallback approaches. Nearby high-quality content is aggregated by examining sibling elements. Finally, the extracted content goes through post-processing to clean and finalize the output.
§Compatibility
This implementation strives to match the behavior of Mozilla’s Readability.js as closely as possible while leveraging Rust’s type system and safety guarantees.
Structs§
- Article
- Represents a successfully parsed article with extracted content and metadata.
- Readability
- The main Readability parser.
- Readability
Options - Configuration options for the Readability parser.
- Readerable
Options - Options for the readability pre-flight check.
Enums§
- Readability
Error - Errors that can occur during readability parsing.
Functions§
- is_
probably_ readerable - Quick check to determine if a document is likely to be readerable.
Type Aliases§
- Result
- Result type alias for readability operations.