1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
//! HTML cleaning, sanitization, and text processing utilities.
//!
//! This crate provides generic HTML cleaning operations useful for web scraping,
//! content extraction, and HTML sanitization.
//!
//! # Quick Start
//!
//! ```
//! use html_cleaning::{HtmlCleaner, CleaningOptions};
//! use dom_query::Document;
//!
//! // Create a cleaner with custom options
//! let options = CleaningOptions::builder()
//! .remove_tags(&["script", "style"])
//! .build();
//! let cleaner = HtmlCleaner::with_options(options);
//!
//! let html = "<html><body><script>bad</script><p>Hello!</p></body></html>";
//! let doc = Document::from(html);
//!
//! cleaner.clean(&doc);
//! assert!(doc.select("script").is_empty());
//! assert!(doc.select("p").exists());
//! ```
//!
//! # Features
//!
//! - **HTML Cleaning**: Remove unwanted elements (scripts, styles, forms)
//! - **Tag Stripping**: Remove tags while preserving text content
//! - **Text Normalization**: Collapse whitespace, trim text
//! - **Link Processing**: Make URLs absolute, filter links
//! - **Content Deduplication**: LRU-based duplicate detection
//! - **Presets**: Ready-to-use configurations for common scenarios
//!
//! # Feature Flags
//!
//! | Feature | Default | Description |
//! |---------|---------|-------------|
//! | `presets` | Yes | Include prebuilt cleaning configurations |
//! | `regex` | No | Enable regex-based selectors |
//! | `url` | No | Enable URL processing with the `url` crate |
//! | `full` | No | Enable all features |
//!
//! # Modules
//!
//! - [`cleaner`] - Core `HtmlCleaner` and cleaning operations
//! - [`text`] - Text processing utilities
//! - [`tree`] - lxml-style text/tail tree manipulation
//! - [`dom`] - DOM helper utilities
//! - [`dedup`] - Content deduplication
//! - [`presets`] - Ready-to-use cleaning configurations (feature: `presets`)
//! - [`links`] - URL and link processing (feature: `url`)
// Core modules - always available
// Feature-gated modules
// Links module is always available - it provides basic URL utilities without dependencies.
// When the `url` feature is enabled, it uses the `url` crate for more robust parsing.
// When disabled, it uses simple string-based fallbacks.
// Re-export core types
pub use HtmlCleaner;
pub use ;
pub use ;
// Re-export dom_query types for convenience
pub use ;