Skip to main content

Module html

Module html

Expand description

HTML-to-text extraction.

Two extraction strategies:

strip_to_text (always available) – fast tag stripping with entity decoding, semantic element filtering, and Wikipedia boilerplate removal. Uses memchr for SIMD-accelerated scanning.
extract_with_readability (feature readability) – Mozilla Readability algorithm that extracts the main article content, stripping navigation, sidebars, and boilerplate.

Structs§

StripOptions: Options for HTML-to-text stripping.

Functions§

decode_entities: Decode all HTML entities in a string.
strip_to_text: Strip HTML tags and decode entities, returning clean plain text.
strip_to_text_with_options: Strip HTML tags with explicit options.