Crate readability_rust

Expand description

§Readability

A Rust port of Mozilla’s Readability.js library for extracting readable content from web pages.

This library provides functionality to parse HTML documents and extract the main article content, removing navigation, ads, and other clutter to present clean, readable text.

§Example

use readability_rust::{Readability, ReadabilityOptions};

let html = r#"
    <html>
    <body>
        <article>
            <h1>Article Title</h1>
            <p>This is the main content of the article.</p>
        </article>
    </body>
    </html>
"#;

let mut parser = Readability::new(html, None).unwrap();
if let Some(article) = parser.parse() {
    println!("Title: {:?}", article.title);
    println!("Content: {:?}", article.content);
}

Structs§

Article: Represents an extracted article
Readability: The main Readability parser
ReadabilityFlags: Feature flags for controlling readability behavior
ReadabilityOptions: Configuration options for the Readability parser

Enums§

ReadabilityError: Errors that can occur during readability parsing

Functions§

clean_text: Remove extra whitespace and normalize text
contains_ad_words: Check if a string contains ad-related words
contains_loading_words: Check if a string contains loading words
count_commas: Count commas in text
extract_text_content: Extract text content and handle encoding
get_char_count: Get the character count of text
get_inner_text: Get the inner text content of an element
get_link_density: Get link density for an element
get_node_ancestors: Get node ancestors up to maxDepth
has_ancestor_tag: Check if element has ancestor with specific tag
has_child_block_element: Check if an element has child block elements
has_content: Check if text has content (non-whitespace)
has_negative_indicators: Check if a string has negative content indicators
has_positive_indicators: Check if a string has positive content indicators
has_single_tag_inside_element: Check if an element has a single tag inside
is_b64_data_url: Check if a URL is a base64 data URL
is_byline: Check if a string contains byline indicators
is_element_without_content: Check if an element is without content
is_extraneous_content: Check if a string matches extraneous content patterns
is_hash_url: Check if a URL is a hash URL
is_json_ld_article_type: Check if text matches JSON-LD article types
is_next_link: Check if a string is a next link
is_node_visible: Check if an element is probably visible
is_phrasing_content: Check if an element is phrasing content
is_prev_link: Check if a string is a previous link
is_probably_readerable: Check if a document is likely to be readable/parseable
is_share_element: Check if a string matches share element patterns
is_single_image: Check if an element is a single image
is_title_candidate: Check if text looks like a title
is_unlikely_candidate: Check if a string matches the unlikely candidates pattern
is_url: Check if a string is a valid URL
is_video_url: Check if a URL is a video URL
is_whitespace: Check if text is only whitespace
normalize_whitespace: Normalize whitespace in text
replace_font_tags: Replace font tags in HTML
should_clean_attribute: Clean attributes from an element (conceptual - actual implementation would modify DOM)
to_absolute_uri: Convert relative URLs to absolute URLs
tokenize_text: Tokenize text
unescape_html_entities: Unescape HTML entities
word_count: Word count for text

Crate readability_rust

Crate readability_rust Copy item path

§Readability

§Example

Structs§

Enums§

Functions§

Crate readability_rust