Skip to main content

Crate spider_agent_html

Crate spider_agent_html 

Source
Expand description

§Spider Agent HTML

HTML processing utilities for spider_agent — cleaning, content analysis integration, and diffing.

This crate provides the HTML cleaning functions extracted from spider_agent. Uses lol_html for fast, streaming HTML rewriting.

§Dependencies

  • lol_html — streaming HTML rewriter
  • aho-corasick — pattern matching (via spider_agent_types)
  • spider_agent_types — type definitions

Functions§

clean_html
Default cleaner (base level).
clean_html_base
Clean the HTML removing CSS and JS (base level).
clean_html_full
Full/aggressive HTML cleaning.
clean_html_raw
Raw passthrough - no cleaning.
clean_html_slim
Slim HTML cleaning - removes heavy elements.
clean_html_with_profile
Clean HTML using a specific profile.
clean_html_with_profile_and_intent
Clean HTML with a specific profile and intent.
smart_clean_html
Smart HTML cleaner that automatically determines the best cleaning level.