Skip to main content

Crate legible

Crate legible 

Source
Expand description

§Legible

A Rust port of Mozilla’s Readability.js for extracting readable content from web pages.

This library provides functionality to extract the main content from HTML documents, stripping away navigation, ads, and other non-content elements to produce clean, readable article content.

§Quick Start

use legible::parse;

let html = r#"
    <html>
    <head><title>My Article</title></head>
    <body>
        <nav>Navigation</nav>
        <article>
            <h1>Article Title</h1>
            <p>This is the main content of the article. It contains several
            paragraphs of text that make up the body of the article.</p>
            <p>More content here to ensure we have enough text for the
            readability algorithm to work with properly.</p>
        </article>
        <footer>Footer</footer>
    </body>
    </html>
"#;

match parse(html, Some("https://example.com"), None) {
    Ok(article) => {
        println!("Title: {}", article.title);
        println!("Byline: {:?}", article.byline);
        println!("Content: {}", article.content);
        println!("Text: {}", article.text_content);
    }
    Err(e) => eprintln!("Error: {}", e),
}

The returned Article contains:

  • title - The article title
  • content - The article content as HTML
  • text_content - The article content as plain text
  • byline - The author byline
  • excerpt - A short excerpt from the article
  • site_name - The site name
  • published_time - The published time
  • dir - Text direction (ltr or rtl)
  • lang - Document language
  • length - Length of the text content

§Checking Readability

You can quickly check if a document is likely to be parseable without running the full algorithm:

use legible::is_probably_readerable;

let html = "<html><body><article>Long article content...</article></body></html>";
if is_probably_readerable(html, None) {
    println!("Document appears to be readerable");
}

§Pre-parsed Document

If you want to check readability before parsing, use Document to avoid parsing the HTML twice:

use legible::Document;

let html = r#"
    <html>
    <head><title>My Article</title></head>
    <body>
        <article>
            <h1>Article Title</h1>
            <p>This is the main content of the article. It contains several
            paragraphs of text that make up the body of the article.</p>
            <p>More content here to ensure we have enough text for the
            readability algorithm to work with properly.</p>
        </article>
    </body>
    </html>
"#;

let doc = Document::new(html);

if doc.is_probably_readerable(None) {
    match doc.parse(Some("https://example.com"), None) {
        Ok(article) => println!("Title: {}", article.title),
        Err(e) => eprintln!("Error: {}", e),
    }
}

§Configuration

Use the Options builder to customize parsing behavior:

use legible::{parse, Options};

let html = "<html><body><article>Content...</article></body></html>";

let options = Options::new()
    .char_threshold(250)        // Minimum article length (default: 500)
    .keep_classes(true)         // Preserve CSS classes in output
    .disable_json_ld(true);     // Skip JSON-LD metadata extraction

let article = parse(html, Some("https://example.com"), Some(options));

See Options for all available configuration options.

§Security

The extracted HTML content is unsanitized and may contain malicious scripts or other dangerous content from the source document. Before rendering this HTML in a browser or other context where scripts could execute, you should sanitize it using a library like ammonia:

let article = parse(html, Some(url), None)?;
let safe_html = ammonia::clean(&article.content);

§How It Works

Legible implements the same algorithm as Readability.js:

  1. Document Preparation - Removes scripts, normalizes markup, fixes lazy-loaded images
  2. Metadata Extraction - Extracts title, byline, and other metadata from JSON-LD, OpenGraph tags, and meta elements
  3. Content Scoring - Scores DOM nodes based on tag type, text density, and class/id patterns
  4. Candidate Selection - Identifies the highest-scoring content container
  5. Content Cleaning - Removes low-scoring elements, empty containers, and non-content markup

Structs§

Article
The extracted article content.
Document
A pre-parsed HTML document.
Options
Configuration options for the parse() function.
ReaderableOptions
Options for the is_probably_readerable function.

Enums§

Error
Errors that can occur during article parsing.

Functions§

is_probably_readerable
Check if a document is probably readerable without parsing the whole thing.
parse
Parse an HTML document and extract the article content.

Type Aliases§

Result
Result type alias for readability operations.