scah (scan HTML)
World's fastest CSS Selector.
CSS selectors meet streaming XML/HTML parsing. Filter StAX events and build targeted DOMs without loading the entire document.
What is scah?
scah is a high-performance parsing library that bridges the gap between SAX/StAX streaming efficiency and DOM convenience. Instead of loading an entire document into memory or manually tracking parser state, you declare what you want with CSS selectors; the library handles the streaming complexity and builds a targeted DOM containing only your selections.
- Streaming core: Built on StAX; constant memory regardless of document size
- Familiar API: CSS selectors (including combinators like
>,,+(coming soon),~(coming soon)) - Multi-language: Rust core with Python and TypeScript/JavaScript bindings
- Composable queries: Chain selections and nest them with closures for structured querying; not only more efficient than flat filtering, but a fundamentally better pattern for extracting hierarchical data relationships
Quick Start
Rust
# Cargo.toml
[]
= "0.0.13"
Basic usage
use ;
let html = r#"<ul><li><a href="/one">One</a></li><li><a href="/two">Two</a></li></ul>"#;
let queries = &;
let store = parse;
for a in store.get.unwrap
// Output:
// One: /one
// Two: /two
Structured querying with .then()
Instead of flat filtering, nest queries with closures. Child queries only run within the context of their parent match:
use ;
let query = all
.then
.build;
let store = parse;
// Access nested results through parent elements
for section in store.get.unwrap
Save options
Control what data is captured per selector:
| Constructor | inner_html |
text_content |
Use case |
|---|---|---|---|
Save::all() |
Yes | Yes | Full extraction |
Save::only_inner_html() |
Yes | No | Raw markup only |
Save::only_text_content() |
No | Yes | Lightweight text scraping |
Save::none() |
No | No | Structure-only (attributes still saved) |
Supported CSS selector syntax
| Syntax | Example | Status |
|---|---|---|
| Tag name | a, div |
Working |
| ID | #my-id |
Working |
| Class | .my-class |
Working |
| Descendant | main section a |
Working |
| Child | main > section |
Working |
| Attribute presence | a[href] |
Working |
| Attribute exact | a[href="url"] |
Working |
| Attribute prefix | a[href^="https"] |
Working |
| Attribute suffix | a[href$=".com"] |
Working |
| Attribute substring | a[href*="example"] |
Working |
| Adjacent sibling | h1 + p |
Coming soon |
| General sibling | h1 ~ p |
Coming soon |
Full API documentation: docs.rs/scah
Benchmarks

Python
=
=
Benchmark's
Real Html BenchMark (html.spec.whatwg.org) (select all a tags):

Synthetic Html BenchMark (select all a tags):

Typescript / Javascript
import { Query, parse } from 'scah';
const query = Query.all('main > section', { innerHtml: true, textContent: true })
.then((p) => [
p.all('> a[href]', { innerHtml: true, textContent: true }),
p.all('div a', { innerHtml: true, textContent: true }),
])
.build();
const store = parse(html, [query]);