Crate legible

Expand description

§Legible

A Rust port of Mozilla’s Readability.js for extracting readable content from web pages.

This library provides functionality to extract the main content from HTML documents, stripping away navigation, ads, and other non-content elements to produce clean, readable article content.

§Quick Start

use legible::parse;

let html = r#"
    <html>
    <head><title>My Article</title></head>
    <body>
        <nav>Navigation</nav>
        <article>
            <h1>Article Title</h1>
            <p>This is the main content of the article. It contains several
            paragraphs of text that make up the body of the article.</p>
            <p>More content here to ensure we have enough text for the
            readability algorithm to work with properly.</p>
        </article>
        <footer>Footer</footer>
    </body>
    </html>
"#;

match parse(html, Some("https://example.com"), None) {
    Ok(article) => {
        println!("Title: {}", article.title);
        println!("Byline: {:?}", article.byline);
        println!("Content: {}", article.content);
        println!("Text: {}", article.text_content);
    }
    Err(e) => eprintln!("Error: {}", e),
}

The returned Article contains:

title - The article title
content - The article content as HTML
text_content - The article content as plain text
byline - The author byline
excerpt - A short excerpt from the article
site_name - The site name
published_time - The published time
dir - Text direction (ltr or rtl)
lang - Document language
length - Length of the text content

§Checking Readability

You can quickly check if a document is likely to be parseable without running the full algorithm:

use legible::is_probably_readerable;

let html = "<html><body><article>Long article content...</article></body></html>";
if is_probably_readerable(html, None) {
    println!("Document appears to be readerable");
}

§Pre-parsed Document

If you want to check readability before parsing, use Document to avoid parsing the HTML twice:

use legible::Document;

let html = r#"
    <html>
    <head><title>My Article</title></head>
    <body>
        <article>
            <h1>Article Title</h1>
            <p>This is the main content of the article. It contains several
            paragraphs of text that make up the body of the article.</p>
            <p>More content here to ensure we have enough text for the
            readability algorithm to work with properly.</p>
        </article>
    </body>
    </html>
"#;

let doc = Document::new(html);

if doc.is_probably_readerable(None) {
    match doc.parse(Some("https://example.com"), None) {
        Ok(article) => println!("Title: {}", article.title),
        Err(e) => eprintln!("Error: {}", e),
    }
}

§Configuration

Use the Options builder to customize parsing behavior:

use legible::{parse, Options};

let html = "<html><body><article>Content...</article></body></html>";

let options = Options::new()
    .char_threshold(250)        // Minimum article length (default: 500)
    .keep_classes(true)         // Preserve CSS classes in output
    .disable_json_ld(true);     // Skip JSON-LD metadata extraction

let article = parse(html, Some("https://example.com"), Some(options));

See Options for all available configuration options.

§Security

The extracted HTML content is unsanitized and may contain malicious scripts or other dangerous content from the source document. Before rendering this HTML in a browser or other context where scripts could execute, you should sanitize it using a library like ammonia:

let article = parse(html, Some(url), None)?;
let safe_html = ammonia::clean(&article.content);

§How It Works

Legible implements the same algorithm as Readability.js:

Document Preparation - Removes scripts, normalizes markup, fixes lazy-loaded images
Metadata Extraction - Extracts title, byline, and other metadata from JSON-LD, OpenGraph tags, and meta elements
Content Scoring - Scores DOM nodes based on tag type, text density, and class/id patterns
Candidate Selection - Identifies the highest-scoring content container
Content Cleaning - Removes low-scoring elements, empty containers, and non-content markup

Structs§

Article: The extracted article content.
Document: A pre-parsed HTML document.
Options: Configuration options for the parse() function.
ReaderableOptions: Options for the is_probably_readerable function.

Enums§

Error: Errors that can occur during article parsing.

Functions§

is_probably_readerable: Check if a document is probably readerable without parsing the whole thing.
parse: Parse an HTML document and extract the article content.

Type Aliases§

Result: Result type alias for readability operations.

Crate legible

Crate legible Copy item path

§Legible

§Quick Start

§Checking Readability

§Pre-parsed Document

§Configuration

§Security

§How It Works

Structs§

Enums§

Functions§

Type Aliases§

Crate legible