Skip to main content

Crate fhp_tokenizer

Crate fhp_tokenizer 

Source
Expand description

SIMD-accelerated HTML tokenizer.

Uses a two-stage pipeline inspired by simdjson:

  1. Structural indexing (SIMD): scan input in 64-byte blocks, produce per-delimiter bitmasks, then apply quote-aware masking.
  2. Token extraction (scalar): walk the structural index to emit tokens via a branchless state machine.

§Quick Start

use fhp_tokenizer::tokenize;

let tokens = tokenize("<div>hello</div>");
assert!(tokens.len() >= 3);

Modules§

entity
Entity decoding with SIMD fast-path. Entity decoding with SIMD fast-path.
extract
Token extraction — stage 2 (scalar state machine). Token extraction — stage 2 of the two-stage tokenizer pipeline.
state_machine
Branchless state machine for token extraction. Branchless state machine for HTML token extraction.
streaming
Streaming (chunk-based) tokenizer — see crate::streaming::StreamTokenizer. Streaming (chunk-based) tokenizer.
structural
Structural character indexer — SIMD-powered bitmask generation (stage 1). Structural character indexer for HTML input.
token
Token types emitted by the tokenizer. Token types emitted by the HTML tokenizer.

Traits§

TreeSink
Trait for receiving parsed HTML events directly, bypassing Token allocation.

Functions§

tokenize
Tokenize an HTML string into a sequence of tokens.
tokenize_into
Tokenize HTML directly into a TreeSink, bypassing all intermediate allocations.
tokenize_with
Tokenize HTML and feed each token to a callback — zero intermediate allocation.