fast-html-parser
SIMD-optimized HTML parser for Rust, built for web scraping workloads.
Uses SIMD instructions (SSE4.2, AVX2, NEON) for tokenization and builds a cache-line aligned arena-based DOM tree for fast traversal.
Installation
[]
= "0.1"
To enable optional features:
[]
= { = "0.1", = ["xpath", "encoding", "async-tokio"] }
Quick Start
use HtmlParser;
let doc = parse.unwrap;
assert_eq!;
CSS Selectors
use *;
let doc = parse.unwrap;
let items = doc.select.unwrap;
assert_eq!;
Compiled Selectors
Pre-compile a selector once and reuse it across many documents — ideal for scraping loops:
use *;
let selector = new.unwrap;
for html in &
Zero-Copy Parsing
When you already own a String (e.g. from an HTTP response), avoid the internal memcpy:
use HtmlParser;
let body = Stringfrom;
let doc = parse_owned.unwrap;
assert_eq!;
XPath
use *;
let doc = parse.unwrap;
let result = doc.xpath.unwrap;
Builder Pattern
use HtmlParser;
let parser = builder
.max_input_size
.fragment_mode
.build;
let doc = parser.parse_str.unwrap;
Streaming
use parse_stream;
let html = b"<div><p>Hello</p></div>";
let doc = parse_stream.unwrap;
Encoding Detection
use HtmlParser;
// Automatically detects encoding from BOM or <meta charset>
let doc = parse_bytes.unwrap;
Feature Flags
| Feature | Default | Description |
|---|---|---|
css-selector |
Yes | CSS selector engine (type, class, ID, attribute, pseudo-class, combinators) |
entity-decode |
Yes | HTML entity decoding |
xpath |
No | XPath expression support |
encoding |
No | Auto-detect encoding from raw bytes (BOM, meta charset) |
async-tokio |
No | Async parsing via Tokio |
Architecture
The parser is organized as a workspace of focused crates:
| Crate | Purpose |
|---|---|
fhp-core |
Interned HTML tags (PHF), entity table, error types |
fhp-simd |
SIMD abstraction layer with runtime dispatch |
fhp-tokenizer |
Two-phase tokenizer (structural indexing + token extraction) |
fhp-tree |
Arena-based DOM tree with 64-byte aligned nodes |
fhp-selector |
CSS selector engine with bloom filter + XPath evaluator |
fhp-encoding |
Encoding detection and conversion via encoding_rs |
fast-html-parser |
Facade crate that re-exports everything |
Performance
Benchmarked on ARM64 (Apple Silicon, NEON):
SIMD Throughput
| Operation | Throughput |
|---|---|
| skip_whitespace | 10.2 GiB/s |
| find_delimiters | 8.3 GiB/s |
| classify_bytes | 6.2 GiB/s |
NEON achieves ~5-5.5x speedup over scalar fallback.
Real-World Parse Throughput
| Page | Size | Time | Throughput | vs tl | vs scraper |
|---|---|---|---|---|---|
| Hacker News | 34 KB | ~105 µs | 314 MiB/s | 1.2x slower | 7x faster |
| GitHub | 301 KB | ~323 µs | 893 MiB/s | 1.1x slower | 3x faster |
| Stack Overflow | 415 KB | ~640 µs | 620 MiB/s | 1.1x faster | 2x faster |
| Wikipedia | 590 KB | ~1.08 ms | 521 MiB/s | 1.4x faster | 4x faster |
CSS Selector (100KB HTML)
| Selector | Time |
|---|---|
Tag (p) |
~10 µs |
Class (.highlight) |
~15 µs |
ID (#main) |
~13 µs |
Descendant (div p) |
~70 µs |
Complex (div > ul li a) |
~63 µs |
Per-node hash rejection (64-bit class bloom filter, ID FNV-1a) with precomputed hashes at selector parse time provides fast early exit for non-matching nodes. :nth-child uses O(1) cached element index.
Run benchmarks locally:
Examples
License
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT License (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Contribution
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.