readability-rust
A Rust port of Mozilla's Readability.js library for extracting readable content from web pages.
This library provides functionality to parse HTML documents and extract the main article content, removing navigation, ads, and other clutter to present clean, readable text.
Features
- Content Extraction: Identifies and extracts the main article content from web pages
- Metadata Parsing: Extracts titles, authors, publication dates, and other metadata
- Content Scoring: Uses Mozilla's proven algorithms to score and rank content elements
- Readability Assessment: Determines if a page is likely to contain readable content
- CLI Tool: Command-line interface for processing HTML files and URLs
- Multiple Output Formats: JSON, plain text, and cleaned HTML output
- Unicode Support: Handles international text, emojis, and special characters
- Error Handling: Graceful handling of malformed HTML and edge cases
Installation
Add this to your Cargo.toml:
[]
= "0.1.0"
Library Usage
Basic Article Extraction
use ;
let html = r#"
<!DOCTYPE html>
<html>
<head>
<title>Sample Article</title>
<meta name="author" content="John Doe">
</head>
<body>
<article>
<h1>Article Title</h1>
<p>This is the main content of the article...</p>
<p>More substantial content here...</p>
</article>
<aside>Sidebar content to be removed</aside>
</body>
</html>
"#;
let mut parser = new.unwrap;
if let Some = parser.parse
Custom Configuration
use ;
let options = ReadabilityOptions ;
let mut parser = new.unwrap;
let article = parser.parse;
Readability Assessment
use is_probably_readerable;
let html = "<html><body><p>Short content</p></body></html>";
if is_probably_readerable else
CLI Usage
The library includes a command-line tool for processing HTML files:
Installation
Basic Usage
# Process a local HTML file
# Process from stdin
|
# Output as JSON
# Output as plain text
# Check if content is readable
# Debug mode with verbose output
CLI Options
Usage: readability-rust [OPTIONS]
Options:
-i, --input <FILE> Input HTML file (use '-' for stdin)
-o, --output <FILE> Output file (default: stdout)
-f, --format <FORMAT> Output format [default: json] [possible values: json, text, html]
--base-uri <URI> Base URI for resolving relative URLs
--debug Enable debug output
--check Only check if content is readable
--char-threshold <N> Minimum character threshold [default: 500]
--keep-classes Keep CSS classes in output
--disable-json-ld Disable JSON-LD parsing
-h, --help Print help
-V, --version Print version
API Reference
Core Types
Readability
The main parser struct for extracting content from HTML documents.
ReadabilityOptions
Configuration options for customizing parsing behavior:
debug: Enable debug loggingchar_threshold: Minimum character count for contentkeep_classes: Preserve CSS classes in outputdisable_json_ld: Skip JSON-LD metadata parsing
Article
Represents extracted article content:
title: Article titlecontent: Cleaned HTML contenttext_content: Plain text contentlength: Content length in charactersbyline: Author informationexcerpt: Article excerpt/descriptionsite_name: Site namelang: Content languagepublished_time: Publication date
Functions
is_probably_readerable(html: &str, options: Option<ReadabilityOptions>) -> bool
Determines if an HTML document likely contains readable content.
Algorithm
This implementation follows Mozilla's Readability.js algorithm:
- Preprocessing: Remove script tags and prepare the document
- Content Discovery: Identify potential content-bearing elements
- Scoring: Score elements based on various factors:
- Element types (article, p, div, etc.)
- Class names and IDs
- Text length and density
- Link density
- Candidate Selection: Choose the best content candidates
- Content Extraction: Extract and clean the selected content
- Post-processing: Final cleanup and formatting
Testing
The library includes comprehensive tests covering:
# Run all tests
# Run with output
# Run specific test categories
Mozilla Readability Reference
This project includes the original Mozilla Readability.js library as a submodule for reference:
# Initialize the submodule
# View the original JavaScript implementation
The original implementation can be found at: https://github.com/mozilla/readability
Performance
The Rust implementation provides significant performance benefits:
- Memory Safety: No runtime memory errors
- Zero-cost Abstractions: Compile-time optimizations
- Concurrent Processing: Safe parallel processing capabilities
- Small Binary Size: Minimal runtime dependencies
Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
Development Setup
License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
The original Mozilla Readability.js library is also licensed under Apache License 2.0.
Acknowledgments
- Mozilla Readability.js - The original JavaScript implementation
- Arc90's Readability - The original inspiration
- The Rust community for excellent crates and tooling