readability-js
Extract clean, readable content from web pages using Mozilla's Readability.js algorithm.
This crate provides both a Rust library and CLI tool for extracting main content from HTML documents, removing navigation, ads, and other clutter. It uses the same algorithm that powers Firefox Reader Mode.
Installation
CLI Tool
Library
Add to your Cargo.toml
:
[]
= "0.1"
Quick Start
CLI Usage
# Extract from URL
# Process local file
# Use in pipelines
| |
Library Usage
use Readability;
// Create parser (reuse for multiple documents)
let reader = new?;
// Extract content
let html = read_to_string?;
let article = reader.parse_with_url?;
println!;
println!;
println!;
Features
- Production Algorithm: Uses Mozilla's Readability.js from Firefox
- Rich Metadata: Extracts titles, authors, publication dates, and content
- Multiple Formats: HTML and plain text output
- CLI Tool: Converts to clean Markdown
- High Performance: Reusable parser instances for batch processing
- Error Recovery: Handles malformed HTML and edge cases
Why readability-js
?
This crate uses Mozilla's actual Readability.js library implementation - the same code that powers Firefox Reader Mode. Creating a Readability
instance takes ~30ms while processing a document takes ~10ms which is good enough for most applications, negligible compared to the accuracy benefits.
Documentation
How It Works
This crate embeds Mozilla's Readability.js library using a JavaScript engine. The algorithm:
- Analyzes page structure and content patterns
- Identifies the main content container
- Removes navigation, ads, and sidebar elements
- Extracts metadata from HTML meta tags and content
- Returns clean HTML suitable for reading
License
Licensed under the Universal Permissive License v1.0 (UPL-1.0)