Crate readability_rust

Crate readability_rust 

Source
Expand description

§Readability

A Rust port of Mozilla’s Readability.js library for extracting readable content from web pages.

This library provides functionality to parse HTML documents and extract the main article content, removing navigation, ads, and other clutter to present clean, readable text.

§Example

use readability_rust::{Readability, ReadabilityOptions};

let html = r#"
    <html>
    <body>
        <article>
            <h1>Article Title</h1>
            <p>This is the main content of the article.</p>
        </article>
    </body>
    </html>
"#;

let mut parser = Readability::new(html, None).unwrap();
if let Some(article) = parser.parse() {
    println!("Title: {:?}", article.title);
    println!("Content: {:?}", article.content);
}

Structs§

Article
Represents an extracted article
Readability
The main Readability parser
ReadabilityFlags
Feature flags for controlling readability behavior
ReadabilityOptions
Configuration options for the Readability parser

Enums§

ReadabilityError
Errors that can occur during readability parsing

Functions§

clean_text
Remove extra whitespace and normalize text
contains_ad_words
Check if a string contains ad-related words
contains_loading_words
Check if a string contains loading words
count_commas
Count commas in text
extract_text_content
Extract text content and handle encoding
get_char_count
Get the character count of text
get_inner_text
Get the inner text content of an element
get_link_density
Get link density for an element
get_node_ancestors
Get node ancestors up to maxDepth
has_ancestor_tag
Check if element has ancestor with specific tag
has_child_block_element
Check if an element has child block elements
has_content
Check if text has content (non-whitespace)
has_negative_indicators
Check if a string has negative content indicators
has_positive_indicators
Check if a string has positive content indicators
has_single_tag_inside_element
Check if an element has a single tag inside
is_b64_data_url
Check if a URL is a base64 data URL
is_byline
Check if a string contains byline indicators
is_element_without_content
Check if an element is without content
is_extraneous_content
Check if a string matches extraneous content patterns
is_hash_url
Check if a URL is a hash URL
is_json_ld_article_type
Check if text matches JSON-LD article types
is_next_link
Check if a string is a next link
is_node_visible
Check if an element is probably visible
is_phrasing_content
Check if an element is phrasing content
is_prev_link
Check if a string is a previous link
is_probably_readerable
Check if a document is likely to be readable/parseable
is_share_element
Check if a string matches share element patterns
is_single_image
Check if an element is a single image
is_title_candidate
Check if text looks like a title
is_unlikely_candidate
Check if a string matches the unlikely candidates pattern
is_url
Check if a string is a valid URL
is_video_url
Check if a URL is a video URL
is_whitespace
Check if text is only whitespace
normalize_whitespace
Normalize whitespace in text
replace_font_tags
Replace font tags in HTML
should_clean_attribute
Clean attributes from an element (conceptual - actual implementation would modify DOM)
to_absolute_uri
Convert relative URLs to absolute URLs
tokenize_text
Tokenize text
unescape_html_entities
Unescape HTML entities
word_count
Word count for text