scrape-core 0.2.2

High-performance HTML parsing library core
Documentation

scrape-core

Crates.io docs.rs MSRV License

High-performance HTML parsing library core. Pure Rust implementation with no FFI dependencies.

Installation

[dependencies]
scrape-core = "0.2"

Or with cargo:

cargo add scrape-core

[!IMPORTANT] Requires Rust 1.88 or later.

Usage

use scrape_core::Soup;

let html = r#"
    <html>
        <body>
            <div class="content">Hello, World!</div>
            <div class="content">Another div</div>
        </body>
    </html>
"#;

let soup = Soup::new(html);

// Find first element by tag
if let Some(div) = soup.find("div") {
    println!("Text: {}", div.text());
}

// CSS selectors
for el in soup.select("div.content") {
    println!("{}", el.inner_html());
}

Features

Enable optional features in Cargo.toml:

[dependencies]
scrape-core = { version = "0.2", features = ["simd", "parallel"] }
Feature Description Default
simd SIMD-accelerated byte scanning (SSE4.2, AVX2, NEON, WASM SIMD128) No
parallel Parallel batch processing via Rayon No

[!TIP] Start with default features for fastest compile times. Add simd for production workloads.

Performance

Performance improvements across all metrics:

Metric Result vs Competitors
Parse 1KB 11 µs 20-38x faster
Parse 100KB 2.96 ms 9.5-22x faster
Parse 1MB 15.5 ms 66-135x faster
Query (by class) 20 ns 40,000x faster
Memory (100MB doc) 145 MB 14-22x smaller

Architecture optimizations:

  • SIMD-accelerated class selector matching — 2-10x faster on large documents
  • Selector fast-paths — Direct optimization for tag-only, class-only, ID-only patterns
  • Arena-based DOM allocation — Cache-friendly, zero per-node heap allocations
  • 50-70% memory reduction — Zero-copy serialization via Cow
  • Parallel batch processing — Rayon-powered when parallel feature is enabled

See full comparative benchmarks in the main project README comparing against BeautifulSoup4, lxml, Cheerio, and other Rust parsers.

Type Safety

Compile-time safety via the typestate pattern:

  • Document lifecycle states — Building (construction) → Queryable (ready) → Sealed (immutable)
  • Sealed traits — Prevent unintended implementations while allowing future extensions
  • Zero runtime overhead — State encoding uses PhantomData with no allocation cost
  • Trait abstractions — HtmlSerializer trait and ElementFilter iterators for consistent DOM access

All safety guarantees are verified at compile time with zero performance impact.

Architecture

scrape-core/
├── dom/       # Arena-based DOM representation
├── parser/    # html5ever integration
├── query/     # CSS selector engine
├── simd/      # Platform-specific SIMD acceleration
└── parallel/  # Rayon-based parallelization

Built on Servo and Cloudflare

Parsing & Selection (Servo browser engine):

Streaming Parser (Cloudflare):

  • lol_html — High-performance streaming HTML parser with constant-memory event-driven API

MSRV policy

Minimum Supported Rust Version: 1.88. MSRV increases are minor version bumps.

Related packages

This crate is part of fast-scrape:

Platform Package
Python fast-scrape
Node.js @fast-scrape/node
WASM @fast-scrape/wasm

License

Licensed under either of Apache License, Version 2.0 or MIT License at your option.