Expand description
HTML-to-text extraction.
Two extraction strategies:
-
strip_to_text(always available) – fast tag stripping with entity decoding, semantic element filtering, and Wikipedia boilerplate removal. Usesmemchrfor SIMD-accelerated scanning. -
extract_with_readability(featurereadability) – Mozilla Readability algorithm that extracts the main article content, stripping navigation, sidebars, and boilerplate.
Structs§
- Strip
Options - Options for HTML-to-text stripping.
Functions§
- decode_
entities - Decode all HTML entities in a string.
- strip_
to_ text - Strip HTML tags and decode entities, returning clean plain text.
- strip_
to_ text_ with_ options - Strip HTML tags with explicit options.