Skip to main content

Module extract

Module extract 

Source
Expand description

Token extraction — stage 2 (scalar state machine). Token extraction — stage 2 of the two-stage tokenizer pipeline.

Uses the structural index from stage 1 to locate < and > boundaries, then scans the actual input bytes between them to extract tag names, attributes, comments, and text content. This hybrid approach combines SIMD-accelerated delimiter finding with scalar content parsing.

Functions§

extract_tokens
Extract tokens from pre-indexed UTF-8 input.
extract_tokens_bytes
Extract tokens from pre-indexed raw bytes after UTF-8 validation.