scah 0.0.15

CSS selectors meet streaming XML/HTML parsing. Filter StAX events and build targeted DOMs without loading the entire document.
Documentation

scah (scan HTML)

CSS selectors meet streaming XML/HTML parsing. Filter StAX events and build targeted DOMs without loading the entire document.

Crates.io npm PyPI

What is scah?

scah is a high-performance parsing library that bridges the gap between SAX/StAX streaming efficiency and DOM convenience. Instead of loading an entire document into memory or manually tracking parser state, you declare what you want with CSS selectors; the library handles the streaming complexity and builds a targeted DOM containing only your selections.

  • Streaming core: Built on StAX; constant memory regardless of document size
  • Familiar API: CSS selectors (including combinators like >, , + (coming soon), ~ (coming soon))
  • Multi-language: Rust core with Python and TypeScript/JavaScript bindings
  • Composable queries: Chain selections and nest them with closures for structured querying; not only more efficient than flat filtering, but a fundamentally better pattern for extracting hierarchical data relationships

Quick Start

Rust

# Cargo.toml
[dependencies]
scah = "0.0.15"

Basic usage

use scah::{Query, Save, parse};

let html = r#"<ul><li><a href="/one">One</a></li><li><a href="/two">Two</a></li></ul>"#;

let queries = &[Query::all("a[href]", Save::all())
    .expect("valid selector")
    .build()];
let store = parse(html, queries);

for a in store.get("a[href]").unwrap() {
    let href = a.attribute(&store, "href").unwrap();
    let text = a.text_content(&store).unwrap_or_default();
    println!("{text}: {href}");
}
// Output:
//   One: /one
//   Two: /two

Structured querying with .then()

Instead of flat filtering, nest queries with closures. Child queries only run within the context of their parent match:

use scah::{Query, Save, parse};

let query = Query::all("main > section", Save::all())
    .expect("valid selector")
    .then(|section| {
        Ok([
            section.all("> a[href]", Save::all())?,
            section.all("div a", Save::all())?,
        ])
    })
    .expect("valid child selectors")
    .build();

let store = parse(html, &[query]);

// Access nested results through parent elements
for section in store.get("main > section").unwrap() {
    println!("Section: {}", section.inner_html.unwrap_or(""));

    if let Some(links) = section.get(&store, "> a[href]") {
        for link in links {
            println!("  Direct link: {}", link.attribute(&store, "href").unwrap());
        }
    }
}

If selectors come from user input, Query::all(...) and Query::first(...) return Result, so malformed selectors surface as SelectorParseError. For fixed selectors in examples or tests, use .expect(...) explicitly if you want panic-on-invalid-selector behavior.

Compile-time queries with query!

For selectors that are known at compile time, prefer the query! macro. It validates the selector tree during compilation and emits a StaticQuery backed by inline arrays instead of heap-allocated query storage.

use scah::{Save, parse, query};

let html = r#"
    <article>
        <h1>Title</h1>
        <a href="/one">One</a>
        <a href="/two">Two</a>
    </article>
"#;

let query = query! {
    all("article", Save::none()) => {
        first("h1", Save::only_text_content()),
        all("a[href]", Save::all()),
    }
};

let store = parse(html, &[query]);
let articles = store.get("article").unwrap();
assert_eq!(articles.len(), 1);
for article in articles {
    assert_eq!(article.get("a[href]").unwrap().count(), 2);
}

Use the runtime builder when you have dynamic sources. Use query! when the selector tree is authored in Rust code and should fail at compile time if it becomes invalid.

Save options

Control what data is captured per selector:

Constructor inner_html text_content Use case
Save::all() Yes Yes Full extraction
Save::only_inner_html() Yes No Raw markup only
Save::only_text_content() No Yes Lightweight text scraping
Save::none() No No Structure-only (attributes still saved)

Supported CSS selector syntax

Syntax Example Status
Tag name a, div Working
ID #my-id Working
Class .my-class Working
Descendant main section a Working
Child main > section Working
Attribute presence a[href] Working
Attribute exact a[href="url"] Working
Attribute prefix a[href^="https"] Working
Attribute suffix a[href$=".com"] Working
Attribute substring a[href*="example"] Working
Adjacent sibling h1 + p Coming soon
General sibling h1 ~ p Coming soon

Full API documentation: docs.rs/scah

Benchmarks

Criterion BenchMarks

The repository includes two Rust benchmark tracks:

  • Cross-library comparisons for simple all and first selectors.
  • Runtime-builder vs query! macro comparisons to measure query-construction overhead separately from execution.

Run them with:

cargo bench -p scah-benches --bench speed_bench_simple_all
cargo bench -p scah-benches --bench speed_bench_simple_first
cargo bench -p scah-benches --bench speed_bench_macro_queries

Python

pip install -U scah
from scah import Query, Save, parse 

query = Query.all("main > section", Save.all())
    .then(lambda section: [
        section.all("> a[href]", Save.all()),
        section.all("div a", Save.all()),
    ])
    .build()

store = parse(html, [query])

Benchmark's

Real Html BenchMark (html.spec.whatwg.org) (select all a tags):

WhatWg Html Spec BenchMark

Synthetic Html BenchMark (select all a tags):

Synthetic Html BenchMark

Typescript / Javascript

npm install scah@npm:@zacharymm/scah
import { Query, parse } from 'scah';

const query = Query.all('main > section', { innerHtml: true, textContent: true })
  .then((p) => [
    p.all('> a[href]', { innerHtml: true, textContent: true }),
    p.all('div a', { innerHtml: true, textContent: true }),
  ])
  .build();

const store = parse(html, [query]);

Codebase layout

The workspace is split by responsibility:

  • crates/scah: public Rust API, parser entry points, store types, and re-exports.
  • crates/scah-query-ir: selector parsing, compiled transitions, query builders, and shared query traits.
  • crates/scah-macros: the query! proc macro for compile-time query construction.
  • crates/bindings/scah-python and crates/bindings/scah-node: thin language bindings over the Rust core.
  • benches/: Criterion and Gungraun benchmarks, including macro-query benchmarks.