readability-js

Extract clean, readable content from web pages using Mozilla's Readability.js algorithm.

This crate provides both a Rust library and CLI tool for extracting main content from HTML documents, removing navigation, ads, and other clutter. It uses the same algorithm that powers Firefox Reader Mode.

Installation

CLI Tool

cargo install readability-js-cli

Library

Add to your Cargo.toml:

[dependencies]
readability-js = "0.1"

Quick Start

CLI Usage

# Extract from URL
readable https://example.com/article > article.md

# Process local file
readable article.html > clean.md

# Use in pipelines
curl -s https://news.site/story | readable | less

Library Usage

use readability_js::Readability;

// Create parser (reuse for multiple documents)
let reader = Readability::new()?;

// Extract content
let html = std::fs::read_to_string("article.html")?;
let article = reader.parse_with_url(&html, "https://example.com")?;

println!("Title: {}", article.title);
println!("Author: {}", article.byline.unwrap_or_default());
println!("Content: {}", article.content);

Features

Production Algorithm: Uses Mozilla's Readability.js from Firefox
Rich Metadata: Extracts titles, authors, publication dates, and content
Multiple Formats: HTML and plain text output
CLI Tool: Converts to clean Markdown
High Performance: Reusable parser instances for batch processing
Error Recovery: Handles malformed HTML and edge cases

Why `readability-js`?

This crate uses Mozilla's actual Readability.js library implementation - the same code that powers Firefox Reader Mode. Creating a Readability instance takes ~30ms while processing a document takes ~10ms which is good enough for most applications, negligible compared to the accuracy benefits.

Documentation

How It Works

This crate embeds Mozilla's Readability.js library using a JavaScript engine. The algorithm:

Analyzes page structure and content patterns
Identifies the main content container
Removes navigation, ads, and sidebar elements
Extracts metadata from HTML meta tags and content
Returns clean HTML suitable for reading

License

Licensed under the Universal Permissive License v1.0 (UPL-1.0)

readability-js 0.1.2