1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
//! # ReadabilityRS
//!
//! A Rust port of Mozilla's Readability library for extracting article content from web pages.
//!
//! This library is a faithful port of the [Mozilla Readability](https://github.com/mozilla/readability)
//! JavaScript library, used in Firefox Reader View.
//!
//! ## Overview
//!
//! ReadabilityRS provides intelligent extraction of main article content from HTML documents,
//! removing clutter such as advertisements, navigation elements, and other non-essential content.
//! It also extracts metadata like article title, author (byline), publish date, and more.
//!
//! ## Key Features
//!
//! - **Content Extraction**: Intelligently identifies and extracts main article content
//! - **Markdown Output**: Optional HTML-to-Markdown conversion with content standardization
//! - **Metadata Extraction**: Extracts title, author, description, site name, language, and publish date
//! - **JSON-LD Support**: Parses structured data from JSON-LD markup
//! - **Multiple Retry Strategies**: Uses adaptive algorithms to handle various page layouts
//! - **Customizable Options**: Configure thresholds, scoring, and behavior
//! - **Pre-flight Check**: Quick check to determine if a page is likely readable
//!
//! ## Basic Usage
//!
//! ```rust,no_run
//! use readabilityrs::{Readability, ReadabilityOptions};
//!
//! let html = r#"<html><body><article><h1>Title</h1><p>Content...</p></article></body></html>"#;
//! let url = "https://example.com/article";
//!
//! let options = ReadabilityOptions::default();
//! let readability = Readability::new(html, Some(url), Some(options)).unwrap();
//!
//! if let Some(article) = readability.parse() {
//! println!("Title: {:?}", article.title);
//! println!("Content: {:?}", article.content);
//! println!("Author: {:?}", article.byline);
//! }
//! ```
//!
//! ## Advanced Usage
//!
//! ### Custom Options
//!
//! ```rust,no_run
//! use readabilityrs::{Readability, ReadabilityOptions};
//!
//! let html = "<html>...</html>";
//!
//! let options = ReadabilityOptions::builder()
//! .char_threshold(300)
//! .nb_top_candidates(10)
//! .keep_classes(true)
//! .build();
//!
//! let readability = Readability::new(html, None, Some(options)).unwrap();
//! let article = readability.parse();
//! ```
//!
//! ### Pre-flight Check
//!
//! Use [`is_probably_readerable`] to quickly check if a document is likely to be parseable
//! before doing the full parse:
//!
//! ```rust,no_run
//! use readabilityrs::is_probably_readerable;
//!
//! let html = "<html>...</html>";
//!
//! if is_probably_readerable(html, None) {
//! // Proceed with full parsing
//! } else {
//! // Skip parsing or use alternative strategy
//! }
//! ```
//!
//! ## Error Handling
//!
//! ```rust,no_run
//! use readabilityrs::{Readability, ReadabilityError};
//!
//! let html = "<html>...</html>";
//! let url = "not a valid url";
//!
//! match Readability::new(html, Some(url), None) {
//! Ok(readability) => {
//! if let Some(article) = readability.parse() {
//! println!("Success!");
//! }
//! }
//! Err(ReadabilityError::InvalidUrl(url)) => {
//! eprintln!("Invalid URL: {}", url);
//! }
//! Err(e) => {
//! eprintln!("Error: {}", e);
//! }
//! }
//! ```
//!
//! ## Algorithm
//!
//! The extraction algorithm works in several phases. First, scripts and styles are removed
//! to prepare the document. Then potential content containers are identified throughout the page.
//! These candidates are scored based on various content signals like paragraph count, text length,
//! and link density. The best candidate is selected using adaptive strategies with multiple fallback
//! approaches. Nearby high-quality content is aggregated by examining sibling elements. Finally,
//! the extracted content goes through post-processing to clean and finalize the output.
//!
//! ## Compatibility
//!
//! This implementation strives to match the behavior of Mozilla's Readability.js as closely
//! as possible while leveraging Rust's type system and safety guarantees.
// Public exports
pub use Article;
pub use ;
pub use MarkdownOptions;
pub use ReadabilityOptions;
pub use Readability;
pub use ;