Skip to main content

Crate html_cleaning

Crate html_cleaning 

Source
Expand description

HTML cleaning, sanitization, and text processing utilities.

This crate provides generic HTML cleaning operations useful for web scraping, content extraction, and HTML sanitization.

§Quick Start

use html_cleaning::{HtmlCleaner, CleaningOptions};
use dom_query::Document;

// Create a cleaner with custom options
let options = CleaningOptions::builder()
    .remove_tags(&["script", "style"])
    .build();
let cleaner = HtmlCleaner::with_options(options);

let html = "<html><body><script>bad</script><p>Hello!</p></body></html>";
let doc = Document::from(html);

cleaner.clean(&doc);
assert!(doc.select("script").is_empty());
assert!(doc.select("p").exists());

§Features

  • HTML Cleaning: Remove unwanted elements (scripts, styles, forms)
  • Tag Stripping: Remove tags while preserving text content
  • Text Normalization: Collapse whitespace, trim text
  • Link Processing: Make URLs absolute, filter links
  • Content Deduplication: LRU-based duplicate detection
  • Presets: Ready-to-use configurations for common scenarios

§Feature Flags

FeatureDefaultDescription
presetsYesInclude prebuilt cleaning configurations
regexNoEnable regex-based selectors
urlNoEnable URL processing with the url crate
fullNoEnable all features

§Modules

  • cleaner - Core HtmlCleaner and cleaning operations
  • text - Text processing utilities
  • tree - lxml-style text/tail tree manipulation
  • dom - DOM helper utilities
  • dedup - Content deduplication
  • presets - Ready-to-use cleaning configurations (feature: presets)
  • links - URL and link processing (feature: url)

Re-exports§

pub use cleaner::HtmlCleaner;
pub use error::Error;
pub use error::Result;
pub use options::CleaningOptions;
pub use options::CleaningOptionsBuilder;

Modules§

cleaner
Core HTML cleaning functionality.
dedup
Content deduplication utilities.
dom
DOM helper utilities.
error
Error types for html-cleaning.
links
URL and link processing utilities.
options
Configuration options for HTML cleaning.
presets
Prebuilt cleaning configurations.
text
Text processing utilities.
tree
Tree manipulation with lxml-style text/tail model.

Structs§

Document
Document represents an HTML document to be manipulated.
Selection
Selection represents a collection of nodes matching some criteria. The initial Selection object can be created by using Document::select, and then manipulated using methods itself.