1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
//! # Legible
//!
//! A Rust port of Mozilla's [Readability.js](https://github.com/mozilla/readability)
//! for extracting readable content from web pages.
//!
//! This library provides functionality to extract the main content from HTML documents,
//! stripping away navigation, ads, and other non-content elements to produce clean,
//! readable article content.
//!
//! ## Quick Start
//!
//! ```rust
//! use legible::parse;
//!
//! let html = r#"
//! <html>
//! <head><title>My Article</title></head>
//! <body>
//! <nav>Navigation</nav>
//! <article>
//! <h1>Article Title</h1>
//! <p>This is the main content of the article. It contains several
//! paragraphs of text that make up the body of the article.</p>
//! <p>More content here to ensure we have enough text for the
//! readability algorithm to work with properly.</p>
//! </article>
//! <footer>Footer</footer>
//! </body>
//! </html>
//! "#;
//!
//! match parse(html, Some("https://example.com"), None) {
//! Ok(article) => {
//! println!("Title: {}", article.title);
//! println!("Byline: {:?}", article.byline);
//! println!("Content: {}", article.content);
//! println!("Text: {}", article.text_content);
//! }
//! Err(e) => eprintln!("Error: {}", e),
//! }
//! ```
//!
//! The returned [`Article`] contains:
//! - `title` - The article title
//! - `content` - The article content as HTML
//! - `text_content` - The article content as plain text
//! - `byline` - The author byline
//! - `excerpt` - A short excerpt from the article
//! - `site_name` - The site name
//! - `published_time` - The published time
//! - `dir` - Text direction (ltr or rtl)
//! - `lang` - Document language
//! - `length` - Length of the text content
//!
//! ## Checking Readability
//!
//! You can quickly check if a document is likely to be parseable without running
//! the full algorithm:
//!
//! ```rust
//! use legible::is_probably_readerable;
//!
//! let html = "<html><body><article>Long article content...</article></body></html>";
//! if is_probably_readerable(html, None) {
//! println!("Document appears to be readerable");
//! }
//! ```
//!
//! ## Pre-parsed Document
//!
//! If you want to check readability before parsing, use [`Document`] to avoid
//! parsing the HTML twice:
//!
//! ```rust
//! use legible::Document;
//!
//! let html = r#"
//! <html>
//! <head><title>My Article</title></head>
//! <body>
//! <article>
//! <h1>Article Title</h1>
//! <p>This is the main content of the article. It contains several
//! paragraphs of text that make up the body of the article.</p>
//! <p>More content here to ensure we have enough text for the
//! readability algorithm to work with properly.</p>
//! </article>
//! </body>
//! </html>
//! "#;
//!
//! let doc = Document::new(html);
//!
//! if doc.is_probably_readerable(None) {
//! match doc.parse(Some("https://example.com"), None) {
//! Ok(article) => println!("Title: {}", article.title),
//! Err(e) => eprintln!("Error: {}", e),
//! }
//! }
//! ```
//!
//! ## Configuration
//!
//! Use the [`Options`] builder to customize parsing behavior:
//!
//! ```rust
//! use legible::{parse, Options};
//!
//! let html = "<html><body><article>Content...</article></body></html>";
//!
//! let options = Options::new()
//! .char_threshold(250) // Minimum article length (default: 500)
//! .keep_classes(true) // Preserve CSS classes in output
//! .disable_json_ld(true); // Skip JSON-LD metadata extraction
//!
//! let article = parse(html, Some("https://example.com"), Some(options));
//! ```
//!
//! See [`Options`] for all available configuration options.
//!
//! ## Security
//!
//! The extracted HTML content is **unsanitized** and may contain malicious scripts or
//! other dangerous content from the source document. Before rendering this HTML in a
//! browser or other context where scripts could execute, you should sanitize it using
//! a library like [`ammonia`](https://docs.rs/ammonia):
//!
//! ```rust,ignore
//! let article = parse(html, Some(url), None)?;
//! let safe_html = ammonia::clean(&article.content);
//! ```
//!
//! ## How It Works
//!
//! Legible implements the same algorithm as Readability.js:
//!
//! 1. **Document Preparation** - Removes scripts, normalizes markup, fixes lazy-loaded images
//! 2. **Metadata Extraction** - Extracts title, byline, and other metadata from JSON-LD,
//! OpenGraph tags, and meta elements
//! 3. **Content Scoring** - Scores DOM nodes based on tag type, text density, and class/id patterns
//! 4. **Candidate Selection** - Identifies the highest-scoring content container
//! 5. **Content Cleaning** - Removes low-scoring elements, empty containers, and non-content markup
pub use Document;
pub use ;
pub use ;
pub use Article;
pub use is_probably_readerable;
/// Parse an HTML document and extract the article content.
///
/// This is the main entry point for content extraction. It parses the HTML, identifies
/// the main article content, and returns an [`Article`] with the extracted content
/// and metadata.
///
/// # Arguments
///
/// * `html` - The HTML content to parse
/// * `url` - Optional base URL for resolving relative links. If provided, relative URLs
/// in the extracted content will be converted to absolute URLs.
/// * `options` - Optional [`Options`] to customize parsing behavior
///
/// # Errors
///
/// Returns an error if:
/// - The provided URL is invalid ([`Error::InvalidUrl`])
/// - The document has no `<body>` element ([`Error::NoBody`])
/// - No article content could be extracted ([`Error::NoContent`])
/// - The document exceeds `max_elems_to_parse` ([`Error::TooManyElements`])
///
/// # Example
///
/// ```rust
/// use legible::{parse, Options};
///
/// let html = "<html><body><article>Content...</article></body></html>";
///
/// // Basic usage
/// let article = parse(html, None, None);
///
/// // With URL for resolving relative links
/// let article = parse(html, Some("https://example.com/article"), None);
///
/// // With custom options
/// let options = Options::new().char_threshold(250);
/// let article = parse(html, Some("https://example.com"), Some(options));
/// ```