Crate readability_js

Expand description

Extract clean, readable content from web pages using Mozilla’s Readability.js algorithm.

This crate provides both a Rust library and CLI tool for extracting the main content from HTML documents, removing navigation, ads, and other clutter. It uses the same algorithm as Firefox Reader Mode.

For background on why this crate exists and design decisions, see the blog post.

§Algorithm

This crate embeds Mozilla’s Readability.js library using a JavaScript engine. It uses the same algorithm that processes articles in Firefox Reader Mode, providing high accuracy on modern web content including single-page applications and complex layouts.

Unlike pure Rust port crates available, this approach sacrifices some performance for extraction accuracy and ongoing improvements from Mozilla’s team.

§Quick Start

use readability_js::Readability;

let html = r#"<html><body><h1>Article Title</h1><p>Main content...</p></body></html>"#;
let reader = Readability::new()?;
let article = reader.parse(&html)?;

println!("Title: {}", article.title);
println!("Content: {}", article.content);

§Parsing with URL Context

Providing a URL improves link resolution and metadata extraction:

use readability_js::Readability;

let reader = Readability::new()?;
let article = reader.parse_with_url(&html, "https://example.com/article")?;

§Custom Options

Configure the parsing behavior with ReadabilityOptions:

use readability_js::{Readability, ReadabilityOptions};

let options = ReadabilityOptions::new()
    .char_threshold(500)
    .keep_classes(true);

let reader = Readability::new()?;
let article = reader.parse_with_options(&html, Some("https://example.com"), Some(options))?;

§Performance Considerations

Creating a Readability instance is expensive (~30ms) as it initializes a JavaScript engine. Once created, parsing individual documents is fast (~10ms). Reuse the same instance when processing multiple documents:

use readability_js::Readability;

let reader = Readability::new()?;
for html in documents {
    let article = reader.parse(&html)?;
    process_article(article);
}

§Error Handling

The most common error is ReadabilityError::ReadabilityCheckFailed, which occurs when the algorithm cannot extract sufficient readable content:

use readability_js::{Readability, ReadabilityError, ReadabilityOptions};

let reader = Readability::new()?;
match reader.parse(&html) {
    Ok(article) => println!("Extracted: {}", article.title),
    Err(ReadabilityError::ReadabilityCheckFailed) => {
        // Try with lower threshold
        let options = ReadabilityOptions::new().char_threshold(100);
        let article = reader.parse_with_options(&html, None, Some(options))?;
        println!("Extracted with relaxed settings: {}", article.title);
    }
    Err(e) => return Err(e),
}

§CLI Usage

The CLI tool extracts content and converts it to clean Markdown:

# Install the CLI tool
cargo install readability-js-cli

# Process local files
readable article.html > article.md

# Fetch and process URLs
readable https://example.com/news > news.md

# Process from stdin (great for pipelines)
curl -s https://site.com/article | readable > clean.md

# View directly in terminal
readable https://news.site/story | less

The CLI automatically:

Detects whether input is a file path or URL
Fetches web content with proper headers
Converts the clean HTML to Markdown
Handles errors gracefully

§Troubleshooting

§“Content failed readability check”

This happens when the page doesn’t contain enough readable content or the algorithm can’t distinguish content from navigation. Try:

use readability_js::{Readability, ReadabilityOptions};

let options = ReadabilityOptions::new()
    .char_threshold(100)         // Lower threshold (default: ~140)
    .nb_top_candidates(10)       // Consider more candidates
    .link_density_modifier(2.0); // More permissive with links

let reader = Readability::new()?;
let article = reader.parse_with_options(&html, None, Some(options))?;

§Poor extraction quality

If the extracted content is incomplete or includes unwanted elements:

use readability_js::{Readability, ReadabilityOptions};

// Better link resolution and metadata extraction
let reader = Readability::new()?;
let article = reader.parse_with_url(&html, "https://example.com/article")?;

// Or preserve important CSS classes
let options = ReadabilityOptions::new()
    .keep_classes(true)
    .classes_to_preserve(vec!["highlight".into(), "code".into(), "caption".into()]);
let article = reader.parse_with_options(&html, None, Some(options))?;

§Memory or performance issues

For very large documents or resource-constrained environments:

use readability_js::{Readability, ReadabilityOptions};

let options = ReadabilityOptions::new()
    .max_elems_to_parse(1000)   // Limit processing
    .nb_top_candidates(3);      // Fewer candidates = faster

let reader = Readability::new()?;
let article = reader.parse_with_options(&html, None, Some(options))?;

Structs§

Article: Parsed article content and metadata extracted by Readability.
Readability: The main readability parser that extracts clean content from HTML.
ReadabilityOptions: Configuration options for content extraction.

Enums§

Direction
ReadabilityError: Errors that can occur during content extraction.