1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
//! A Rust port of Mozilla's [Readability](https://github.com/nicolo-ribaudo/readability) algorithm
//! for extracting the main article content from an HTML page.
//!
//! ## Quick start
//!
//! ```rust
//! use readable_rs::{extract, ExtractOptions};
//!
//! let html = "<html><body><article><p>The actual article text goes here.</p></article></body></html>";
//! let product = extract(html, "https://example.com/article", ExtractOptions::default());
//!
//! // product.content holds the extracted DOM (or None if nothing was found)
//! // product.title, product.by_line, product.sitename, etc. hold metadata
//! ```
//!
//! ## Module layout
//!
//! * **Top level** – [`extract`] is the single entry-point. [`Product`] and
//! [`ExtractOptions`] are the main public types.
//! * [`parser`] – thin wrappers around the underlying HTML parser ([`parser::NodeRef`],
//! [`parser::parse_html`]).
//! * [`shared_utils`] – a curated set of DOM helpers useful when post-processing
//! the extracted content (URL resolution, text normalisation, etc.).
//! * [`NodeExt`] / [`NodeScoreStore`] – the trait and store that the scorer uses
//! to attach readability metadata to DOM nodes without modifying the nodes themselves.
pub use ;
pub use NodeScoreStore;
pub use ;
/// Convenience re-exports of DOM helpers for post-processing extracted content.
///
/// These are a stable, curated subset of the internal utility library.
/// Thin wrappers around the underlying HTML parser.
///
/// [`NodeRef`] is the reference-counted DOM node type used throughout the crate.
/// [`parse_html`] parses a complete HTML document into a [`NodeRef`] tree.
/// Extract the main article content from an HTML page.
///
/// This is the primary entry-point of the crate. It implements the Readability
/// algorithm: scoring candidate nodes by content density, pruning navigation /
/// boilerplate, and returning the best content subtree along with any metadata
/// (title, byline, etc.) that could be extracted.
///
/// # Arguments
///
/// * `html_str` – the raw HTML source of the page.
/// * `doc_uri` – the URL the page was fetched from. Used to resolve relative
/// URLs in `<a href>`, `<img src>`, `srcset`, etc.
/// * `options` – tuning knobs for the extraction algorithm. [`ExtractOptions::default()`]
/// is a sensible starting point.
///
/// # Returns
///
/// A [`Product`] whose `content` field is `Some` if article content was found,
/// or `None` if the page did not contain extractable content.
///
/// # Examples
///
/// ```rust
/// use readable_rs::{extract, ExtractOptions};
///
/// let html = "<html><body><p>Short.</p></body></html>";
/// let product = extract(html, "https://example.com", ExtractOptions::default());
/// // product.content may be None — the paragraph is below the default char_threshold.
/// ```