scrape-core

High-performance HTML parsing library core. Pure Rust implementation with no FFI dependencies.

Installation

[dependencies]
scrape-core = "0.2"

Or with cargo:

cargo add scrape-core

[!IMPORTANT] Requires Rust 1.88 or later.

Usage

use scrape_core::Soup;

let html = r#"
    <html>
        <body>
            <div class="content">Hello, World!</div>
            <div class="content">Another div</div>
        </body>
    </html>
"#;

let soup = Soup::new(html);

// Find first element by tag
if let Some(div) = soup.find("div") {
    println!("Text: {}", div.text());
}

// CSS selectors
for el in soup.select("div.content") {
    println!("{}", el.inner_html());
}

Features

Enable optional features in Cargo.toml:

[dependencies]
scrape-core = { version = "0.2", features = ["simd", "parallel"] }

Feature	Description	Default
`simd`	SIMD-accelerated byte scanning (SSE4.2, AVX2, NEON, WASM SIMD128)	No
`parallel`	Parallel batch processing via Rayon	No

[!TIP] Start with default features for fastest compile times. Add simd for production workloads.

Performance

Performance improvements across all metrics:

Metric	Result	vs Competitors
Parse 1KB	11 µs	20-38x faster
Parse 100KB	2.96 ms	9.5-22x faster
Parse 1MB	15.5 ms	66-135x faster
Query (by class)	20 ns	40,000x faster
Memory (100MB doc)	145 MB	14-22x smaller

Architecture optimizations:

SIMD-accelerated class selector matching — 2-10x faster on large documents
Selector fast-paths — Direct optimization for tag-only, class-only, ID-only patterns
Arena-based DOM allocation — Cache-friendly, zero per-node heap allocations
50-70% memory reduction — Zero-copy serialization via Cow
Parallel batch processing — Rayon-powered when parallel feature is enabled

See full comparative benchmarks in the main project README comparing against BeautifulSoup4, lxml, Cheerio, and other Rust parsers.

Type Safety

Compile-time safety via the typestate pattern:

Document lifecycle states — Building (construction) → Queryable (ready) → Sealed (immutable)
Sealed traits — Prevent unintended implementations while allowing future extensions
Zero runtime overhead — State encoding uses PhantomData with no allocation cost
Trait abstractions — HtmlSerializer trait and ElementFilter iterators for consistent DOM access

All safety guarantees are verified at compile time with zero performance impact.

Architecture

scrape-core/
├── dom/       # Arena-based DOM representation
├── parser/    # html5ever integration
├── query/     # CSS selector engine
├── simd/      # Platform-specific SIMD acceleration
└── parallel/  # Rayon-based parallelization

Built on Servo and Cloudflare

Parsing & Selection (Servo browser engine):

html5ever — Spec-compliant HTML5 parser
selectors — CSS selector matching engine
cssparser — CSS parser
markup5ever — Common HTML/XML tree data structures

Streaming Parser (Cloudflare):

lol_html — High-performance streaming HTML parser with constant-memory event-driven API

MSRV policy

Minimum Supported Rust Version: 1.88. MSRV increases are minor version bumps.

Related packages

This crate is part of fast-scrape:

Platform	Package
Python	`fast-scrape`
Node.js	`@fast-scrape/node`
WASM	`@fast-scrape/wasm`

License

Licensed under either of Apache License, Version 2.0 or MIT License at your option.

scrape-core 0.2.4