Skip to main content

readabilityrs/
lib.rs

1//! # ReadabilityRS
2//!
3//! A Rust port of Mozilla's Readability library for extracting article content from web pages.
4//!
5//! This library is a faithful port of the [Mozilla Readability](https://github.com/mozilla/readability)
6//! JavaScript library, used in Firefox Reader View.
7//!
8//! ## Overview
9//!
10//! ReadabilityRS provides intelligent extraction of main article content from HTML documents,
11//! removing clutter such as advertisements, navigation elements, and other non-essential content.
12//! It also extracts metadata like article title, author (byline), publish date, and more.
13//!
14//! ## Key Features
15//!
16//! - **Content Extraction**: Intelligently identifies and extracts main article content
17//! - **Metadata Extraction**: Extracts title, author, description, site name, language, and publish date
18//! - **JSON-LD Support**: Parses structured data from JSON-LD markup
19//! - **Multiple Retry Strategies**: Uses adaptive algorithms to handle various page layouts
20//! - **Customizable Options**: Configure thresholds, scoring, and behavior
21//! - **Pre-flight Check**: Quick check to determine if a page is likely readable
22//!
23//! ## Basic Usage
24//!
25//! ```rust,no_run
26//! use readabilityrs::{Readability, ReadabilityOptions};
27//!
28//! let html = r#"<html><body><article><h1>Title</h1><p>Content...</p></article></body></html>"#;
29//! let url = "https://example.com/article";
30//!
31//! let options = ReadabilityOptions::default();
32//! let readability = Readability::new(html, Some(url), Some(options)).unwrap();
33//!
34//! if let Some(article) = readability.parse() {
35//!     println!("Title: {:?}", article.title);
36//!     println!("Content: {:?}", article.content);
37//!     println!("Author: {:?}", article.byline);
38//! }
39//! ```
40//!
41//! ## Advanced Usage
42//!
43//! ### Custom Options
44//!
45//! ```rust,no_run
46//! use readabilityrs::{Readability, ReadabilityOptions};
47//!
48//! let html = "<html>...</html>";
49//!
50//! let options = ReadabilityOptions::builder()
51//!     .char_threshold(300)
52//!     .nb_top_candidates(10)
53//!     .keep_classes(true)
54//!     .build();
55//!
56//! let readability = Readability::new(html, None, Some(options)).unwrap();
57//! let article = readability.parse();
58//! ```
59//!
60//! ### Pre-flight Check
61//!
62//! Use [`is_probably_readerable`] to quickly check if a document is likely to be parseable
63//! before doing the full parse:
64//!
65//! ```rust,no_run
66//! use readabilityrs::is_probably_readerable;
67//!
68//! let html = "<html>...</html>";
69//!
70//! if is_probably_readerable(html, None) {
71//!     // Proceed with full parsing
72//! } else {
73//!     // Skip parsing or use alternative strategy
74//! }
75//! ```
76//!
77//! ## Error Handling
78//!
79//! ```rust,no_run
80//! use readabilityrs::{Readability, ReadabilityError};
81//!
82//! let html = "<html>...</html>";
83//! let url = "not a valid url";
84//!
85//! match Readability::new(html, Some(url), None) {
86//!     Ok(readability) => {
87//!         if let Some(article) = readability.parse() {
88//!             println!("Success!");
89//!         }
90//!     }
91//!     Err(ReadabilityError::InvalidUrl(url)) => {
92//!         eprintln!("Invalid URL: {}", url);
93//!     }
94//!     Err(e) => {
95//!         eprintln!("Error: {}", e);
96//!     }
97//! }
98//! ```
99//!
100//! ## Algorithm
101//!
102//! The extraction algorithm works in several phases. First, scripts and styles are removed
103//! to prepare the document. Then potential content containers are identified throughout the page.
104//! These candidates are scored based on various content signals like paragraph count, text length,
105//! and link density. The best candidate is selected using adaptive strategies with multiple fallback
106//! approaches. Nearby high-quality content is aggregated by examining sibling elements. Finally,
107//! the extracted content goes through post-processing to clean and finalize the output.
108//!
109//! ## Compatibility
110//!
111//! This implementation strives to match the behavior of Mozilla's Readability.js as closely
112//! as possible while leveraging Rust's type system and safety guarantees.
113
114mod article;
115mod cleaner;
116mod constants;
117mod content_extractor;
118mod dom_utils;
119mod error;
120mod metadata;
121mod options;
122mod post_processor;
123mod readability;
124mod readerable;
125mod scoring;
126mod utils;
127
128// Public exports
129pub use article::Article;
130pub use error::{ReadabilityError, Result};
131pub use options::ReadabilityOptions;
132pub use readability::Readability;
133pub use readerable::{is_probably_readerable, ReaderableOptions};