Skip to main content

Module html

Module html 

Source
Expand description

HTML-to-text extraction.

Two extraction strategies:

  1. strip_to_text (always available) – fast tag stripping with entity decoding, semantic element filtering, and Wikipedia boilerplate removal. Uses memchr for SIMD-accelerated scanning.

  2. extract_with_readability (feature readability) – Mozilla Readability algorithm that extracts the main article content, stripping navigation, sidebars, and boilerplate.

Structs§

StripOptions
Options for HTML-to-text stripping.

Functions§

decode_entities
Decode all HTML entities in a string.
strip_to_text
Strip HTML tags and decode entities, returning clean plain text.
strip_to_text_with_options
Strip HTML tags with explicit options.