Skip to main content

Module html

Module html 

Source
Expand description

HTML text extractor.

Strips HTML tags with a simple state-machine parser and preserves visible text content. Block-level elements (p, div, h1h6, li, td, th, br) produce paragraph boundaries. <h1><h6> headings populate heading_path.

Security: no JavaScript execution, no external resource loading, no DOM construction. Pure text extraction only (RFC-015 §15).

Structs§

HtmlExtractor