html-cleaning
HTML cleaning, sanitization, and text processing utilities for Rust.
Features
- HTML Cleaning: Remove unwanted elements (scripts, styles, forms)
- Tag Stripping: Remove tags while preserving text content
- Text Normalization: Collapse whitespace, trim text
- Link Processing: Make URLs absolute, filter links
- Content Deduplication: LRU-based duplicate detection
- Markdown Output: Convert HTML to Markdown with structure preservation
- Presets: Ready-to-use configurations for common scenarios
Quick Start
use ;
use Document;
// Use a preset for quick setup
let cleaner = with_options;
let html = "<html><body><script>bad</script><p>Hello!</p></body></html>";
let doc = from;
cleaner.clean;
// Scripts removed, paragraph content preserved
Installation
Add to your Cargo.toml:
[]
= "0.3"
With all features:
[]
= { = "0.3", = ["full"] }
Usage Examples
Basic Cleaning
use ;
let options = CleaningOptions ;
let cleaner = with_options;
Using the Builder Pattern
use CleaningOptions;
let options = builder
.remove_tags
.remove_selectors
.prune_empty
.normalize_whitespace
.build;
Using Presets
use presets;
// Minimal: Just scripts and styles
let minimal = minimal;
// Standard: + forms, iframes, objects
let standard = standard;
// Aggressive: + nav, header, footer, aside
let aggressive = aggressive;
// Article extraction: Optimized for content extraction
let article = article_extraction;
Text Processing
use text;
let has_content = has_content; // true
let normalized = normalize; // "multiple spaces"
let words = word_count; // 2
HTML to Markdown
use html_to_markdown;
let html = "<h1>Title</h1><p>Content with <strong>bold</strong></p>";
let md = html_to_markdown;
// Output: "# Title\n\nContent with **bold**\n"
Feature Flags
| Feature | Default | Description |
|---|---|---|
presets |
Yes | Include prebuilt cleaning configurations |
regex |
No | Enable regex-based selectors |
url |
No | Enable URL processing with the url crate |
markdown |
No | Enable HTML to Markdown conversion |
full |
No | Enable all features |
Modules
| Module | Description |
|---|---|
cleaner |
Core HtmlCleaner and cleaning operations |
text |
Text processing utilities |
tree |
lxml-style text/tail tree manipulation |
dom |
DOM helper utilities |
dedup |
Content deduplication |
presets |
Ready-to-use cleaning configurations |
links |
URL and link processing (feature: url) |
markdown |
HTML to Markdown conversion (feature: markdown) |
Presets Reference
minimal()
- Removes:
script,style,noscript - Best for: Quick sanitization
standard()
- Removes:
script,style,noscript,form,iframe,object,embed,svg,canvas,video,audio - Enables:
prune_empty,normalize_whitespace - Best for: General web scraping
aggressive()
- Includes all of
standard()plus: - Removes:
nav,header,footer,aside,figure,figcaption - Enables:
strip_attributes(preserveshref,src,alt) - Best for: Maximum content extraction
article_extraction()
- Optimized for article content extraction
- Removes navigation and layout elements
- Strips wrapper tags (
div,span) while preserving content - Best for: News articles, blog posts
trafilatura()
- Full web content extraction cleaning (50 tags removed, 18 stripped)
- Comment removal and attribute whitelist cleaning
- Empty element pruning across 22 tag types
- Best for: Use with rs-trafilatura extraction pipeline
Related Projects
- rs-trafilatura - Web content extraction library (uses html-cleaning)
- dom_query - DOM manipulation library
License
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE)
- MIT license (LICENSE-MIT)
at your option.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.