Expand description
§Readability
A Rust port of Mozilla’s Readability.js library for extracting readable content from web pages.
This library provides functionality to parse HTML documents and extract the main article content, removing navigation, ads, and other clutter to present clean, readable text.
§Example
use readability_rust::{Readability, ReadabilityOptions};
let html = r#"
<html>
<body>
<article>
<h1>Article Title</h1>
<p>This is the main content of the article.</p>
</article>
</body>
</html>
"#;
let mut parser = Readability::new(html, None).unwrap();
if let Some(article) = parser.parse() {
println!("Title: {:?}", article.title);
println!("Content: {:?}", article.content);
}Structs§
- Article
- Represents an extracted article
- Readability
- The main Readability parser
- Readability
Flags - Feature flags for controlling readability behavior
- Readability
Options - Configuration options for the Readability parser
Enums§
- Readability
Error - Errors that can occur during readability parsing
Functions§
- clean_
text - Remove extra whitespace and normalize text
- contains_
ad_ words - Check if a string contains ad-related words
- contains_
loading_ words - Check if a string contains loading words
- count_
commas - Count commas in text
- extract_
text_ content - Extract text content and handle encoding
- get_
char_ count - Get the character count of text
- get_
inner_ text - Get the inner text content of an element
- get_
link_ density - Get link density for an element
- get_
node_ ancestors - Get node ancestors up to maxDepth
- has_
ancestor_ tag - Check if element has ancestor with specific tag
- has_
child_ block_ element - Check if an element has child block elements
- has_
content - Check if text has content (non-whitespace)
- has_
negative_ indicators - Check if a string has negative content indicators
- has_
positive_ indicators - Check if a string has positive content indicators
- has_
single_ tag_ inside_ element - Check if an element has a single tag inside
- is_
b64_ data_ url - Check if a URL is a base64 data URL
- is_
byline - Check if a string contains byline indicators
- is_
element_ without_ content - Check if an element is without content
- is_
extraneous_ content - Check if a string matches extraneous content patterns
- is_
hash_ url - Check if a URL is a hash URL
- is_
json_ ld_ article_ type - Check if text matches JSON-LD article types
- is_
next_ link - Check if a string is a next link
- is_
node_ visible - Check if an element is probably visible
- is_
phrasing_ content - Check if an element is phrasing content
- is_
prev_ link - Check if a string is a previous link
- is_
probably_ readerable - Check if a document is likely to be readable/parseable
- is_
share_ element - Check if a string matches share element patterns
- is_
single_ image - Check if an element is a single image
- is_
title_ candidate - Check if text looks like a title
- is_
unlikely_ candidate - Check if a string matches the unlikely candidates pattern
- is_url
- Check if a string is a valid URL
- is_
video_ url - Check if a URL is a video URL
- is_
whitespace - Check if text is only whitespace
- normalize_
whitespace - Normalize whitespace in text
- replace_
font_ tags - Replace font tags in HTML
- should_
clean_ attribute - Clean attributes from an element (conceptual - actual implementation would modify DOM)
- to_
absolute_ uri - Convert relative URLs to absolute URLs
- tokenize_
text - Tokenize text
- unescape_
html_ entities - Unescape HTML entities
- word_
count - Word count for text