Module processing

Module processing 

Source
Expand description

DOM Content Processing Utilities

This module provides high-performance utilities for processing raw HTML content, extracting clean text, and normalizing web page content for downstream consumption.

§Features

  • HTML Cleaning: Remove scripts, styles, and other non-content elements
  • Text Extraction: Convert HTML to clean, readable text
  • Entity Decoding: Properly decode HTML entities
  • Whitespace Normalization: Clean up excessive whitespace while preserving structure
  • Truncation: Intelligently truncate content with ellipsis

§Example

use reasonkit_web::processing::{ContentProcessor, ContentProcessorConfig};

let config = ContentProcessorConfig::default();
let processor = ContentProcessor::new(config);

let html = r#"<html><head><script>evil();</script></head>
              <body><p>Hello &amp; welcome!</p></body></html>"#;

let result = processor.process(html);
assert!(result.text.contains("Hello & welcome!"));
assert!(!result.text.contains("evil"));

Structs§

ContentProcessor
Content processor for HTML documents
ContentProcessorConfig
Configuration for the content processor
ProcessedContent
Result of content processing