Expand description
SIMD-accelerated HTML tokenizer.
Uses a two-stage pipeline inspired by simdjson:
- Structural indexing (SIMD): scan input in 64-byte blocks, produce per-delimiter bitmasks, then apply quote-aware masking.
- Token extraction (scalar): walk the structural index to emit tokens via a branchless state machine.
§Quick Start
use fhp_tokenizer::tokenize;
let tokens = tokenize("<div>hello</div>");
assert!(tokens.len() >= 3);Modules§
- entity
- Entity decoding with SIMD fast-path. Entity decoding with SIMD fast-path.
- extract
- Token extraction — stage 2 (scalar state machine). Token extraction — stage 2 of the two-stage tokenizer pipeline.
- state_
machine - Branchless state machine for token extraction. Branchless state machine for HTML token extraction.
- streaming
- Streaming (chunk-based) tokenizer — see
crate::streaming::StreamTokenizer. Streaming (chunk-based) tokenizer. - structural
- Structural character indexer — SIMD-powered bitmask generation (stage 1). Structural character indexer for HTML input.
- token
- Token types emitted by the tokenizer. Token types emitted by the HTML tokenizer.
Traits§
- Tree
Sink - Trait for receiving parsed HTML events directly, bypassing Token allocation.
Functions§
- tokenize
- Tokenize an HTML string into a sequence of tokens.
- tokenize_
into - Tokenize HTML directly into a
TreeSink, bypassing all intermediate allocations. - tokenize_
with - Tokenize HTML and feed each token to a callback — zero intermediate allocation.