Expand description
HTML cleaning, sanitization, and text processing utilities.
This crate provides generic HTML cleaning operations useful for web scraping, content extraction, and HTML sanitization.
§Quick Start
use html_cleaning::{HtmlCleaner, CleaningOptions};
use dom_query::Document;
// Create a cleaner with custom options
let options = CleaningOptions::builder()
.remove_tags(&["script", "style"])
.build();
let cleaner = HtmlCleaner::with_options(options);
let html = "<html><body><script>bad</script><p>Hello!</p></body></html>";
let doc = Document::from(html);
cleaner.clean(&doc);
assert!(doc.select("script").is_empty());
assert!(doc.select("p").exists());§Features
- HTML Cleaning: Remove unwanted elements (scripts, styles, forms)
- Tag Stripping: Remove tags while preserving text content
- Text Normalization: Collapse whitespace, trim text
- Link Processing: Make URLs absolute, filter links
- Content Deduplication: LRU-based duplicate detection
- Presets: Ready-to-use configurations for common scenarios
§Feature Flags
| Feature | Default | Description |
|---|---|---|
presets | Yes | Include prebuilt cleaning configurations |
regex | No | Enable regex-based selectors |
url | No | Enable URL processing with the url crate |
full | No | Enable all features |
§Modules
Re-exports§
pub use cleaner::HtmlCleaner;pub use error::Error;pub use error::Result;pub use options::CleaningOptions;pub use options::CleaningOptionsBuilder;
Modules§
- cleaner
- Core HTML cleaning functionality.
- dedup
- Content deduplication utilities.
- dom
- DOM helper utilities.
- error
- Error types for html-cleaning.
- links
- URL and link processing utilities.
- options
- Configuration options for HTML cleaning.
- presets
- Prebuilt cleaning configurations.
- text
- Text processing utilities.
- tree
- Tree manipulation with lxml-style text/tail model.
Structs§
- Document
- Document represents an HTML document to be manipulated.
- Selection
- Selection represents a collection of nodes matching some criteria. The
initial Selection object can be created by using
Document::select, and then manipulated using methods itself.