scrape-core
High-performance HTML parsing library core. Pure Rust implementation with no FFI dependencies.
Installation
[]
= "0.2"
Or with cargo:
[!IMPORTANT] Requires Rust 1.88 or later.
Usage
use Soup;
let html = r#"
<html>
<body>
<div class="content">Hello, World!</div>
<div class="content">Another div</div>
</body>
</html>
"#;
let soup = new;
// Find first element by tag
if let Some = soup.find
// CSS selectors
for el in soup.select
Features
Enable optional features in Cargo.toml:
[]
= { = "0.2", = ["simd", "parallel"] }
| Feature | Description | Default |
|---|---|---|
simd |
SIMD-accelerated byte scanning (SSE4.2, AVX2, NEON, WASM SIMD128) | No |
parallel |
Parallel batch processing via Rayon | No |
[!TIP] Start with default features for fastest compile times. Add
simdfor production workloads.
Performance
Performance improvements across all metrics:
| Metric | Result | vs Competitors |
|---|---|---|
| Parse 1KB | 11 µs | 20-38x faster |
| Parse 100KB | 2.96 ms | 9.5-22x faster |
| Parse 1MB | 15.5 ms | 66-135x faster |
| Query (by class) | 20 ns | 40,000x faster |
| Memory (100MB doc) | 145 MB | 14-22x smaller |
Architecture optimizations:
- SIMD-accelerated class selector matching — 2-10x faster on large documents
- Selector fast-paths — Direct optimization for tag-only, class-only, ID-only patterns
- Arena-based DOM allocation — Cache-friendly, zero per-node heap allocations
- 50-70% memory reduction — Zero-copy serialization via Cow
- Parallel batch processing — Rayon-powered when
parallelfeature is enabled
See full comparative benchmarks in the main project README comparing against BeautifulSoup4, lxml, Cheerio, and other Rust parsers.
Type Safety
Compile-time safety via the typestate pattern:
- Document lifecycle states — Building (construction) → Queryable (ready) → Sealed (immutable)
- Sealed traits — Prevent unintended implementations while allowing future extensions
- Zero runtime overhead — State encoding uses PhantomData with no allocation cost
- Trait abstractions — HtmlSerializer trait and ElementFilter iterators for consistent DOM access
All safety guarantees are verified at compile time with zero performance impact.
Architecture
scrape-core/
├── dom/ # Arena-based DOM representation
├── parser/ # html5ever integration
├── query/ # CSS selector engine
├── simd/ # Platform-specific SIMD acceleration
└── parallel/ # Rayon-based parallelization
Built on Servo and Cloudflare
Parsing & Selection (Servo browser engine):
- html5ever — Spec-compliant HTML5 parser
- selectors — CSS selector matching engine
- cssparser — CSS parser
- markup5ever — Common HTML/XML tree data structures
Streaming Parser (Cloudflare):
- lol_html — High-performance streaming HTML parser with constant-memory event-driven API
MSRV policy
Minimum Supported Rust Version: 1.88. MSRV increases are minor version bumps.
Related packages
This crate is part of fast-scrape:
| Platform | Package |
|---|---|
| Python | fast-scrape |
| Node.js | @fast-scrape/node |
| WASM | @fast-scrape/wasm |
License
Licensed under either of Apache License, Version 2.0 or MIT License at your option.