# scah (scan HTML)
> CSS selectors meet streaming XML/HTML parsing. Filter StAX events and build targeted DOMs without loading the entire document.
[](https://crates.io/crates/scah)
[](https://www.npmjs.com/package/@zacharymm/scah)
[](https://pypi.org/project/scah/)
## What is scah?
**scah** is a high-performance parsing library that bridges the gap between SAX/StAX streaming efficiency and DOM convenience. Instead of loading an entire document into memory or manually tracking parser state, you declare what you want with **CSS selectors**; the library handles the streaming complexity and builds a targeted DOM containing only your selections.
- **Streaming core**: Built on StAX; constant memory regardless of document size
- **Familiar API**: CSS selectors (including combinators like `>`, ` `, `+` (coming soon), `~` (coming soon))
- **Multi-language**: Rust core with Python and TypeScript/JavaScript bindings
- **Composable queries**: Chain selections and nest them with closures for **structured querying**; not only more efficient than flat filtering, but a fundamentally better pattern for extracting hierarchical data relationships
## Quick Start
### Rust
```toml
# Cargo.toml
[dependencies]
scah = "0.0.15"
```
#### Basic usage
```rust
use scah::{Query, Save, parse};
let html = r#"<ul><li><a href="/one">One</a></li><li><a href="/two">Two</a></li></ul>"#;
let queries = &[Query::all("a[href]", Save::all())
.expect("valid selector")
.build()];
let store = parse(html, queries);
for a in store.get("a[href]").unwrap() {
let href = a.attribute(&store, "href").unwrap();
let text = a.text_content(&store).unwrap_or_default();
println!("{text}: {href}");
}
// Output:
// One: /one
// Two: /two
```
#### Structured querying with `.then()`
Instead of flat filtering, nest queries with closures. Child queries only run within the context of their parent match:
```rust
use scah::{Query, Save, parse};
let query = Query::all("main > section", Save::all())
.expect("valid selector")
.then(|section| {
Ok([
section.all("> a[href]", Save::all())?,
section.all("div a", Save::all())?,
])
})
.expect("valid child selectors")
.build();
let store = parse(html, &[query]);
// Access nested results through parent elements
for section in store.get("main > section").unwrap() {
println!("Section: {}", section.inner_html.unwrap_or(""));
if let Some(links) = section.get(&store, "> a[href]") {
for link in links {
println!(" Direct link: {}", link.attribute(&store, "href").unwrap());
}
}
}
```
If selectors come from user input, `Query::all(...)` and `Query::first(...)` return `Result`, so malformed selectors surface as `SelectorParseError`. For fixed selectors in examples or tests, use `.expect(...)` explicitly if you want panic-on-invalid-selector behavior.
#### Compile-time queries with `query!`
For selectors that are known at compile time, prefer the `query!` macro. It validates the selector tree during compilation and emits a `StaticQuery` backed by inline arrays instead of heap-allocated query storage.
```rust
use scah::{Save, parse, query};
let html = r#"
<article>
<h1>Title</h1>
<a href="/one">One</a>
<a href="/two">Two</a>
</article>
"#;
let query = query! {
all("article", Save::none()) => {
first("h1", Save::only_text_content()),
all("a[href]", Save::all()),
}
};
let store = parse(html, &[query]);
let articles = store.get("article").unwrap();
assert_eq!(articles.len(), 1);
for article in articles {
assert_eq!(article.get("a[href]").unwrap().count(), 2);
}
```
Use the runtime builder when you have dynamic sources. Use `query!` when the selector tree is authored in Rust code and should fail at compile time if it becomes invalid.
#### `Save` options
Control what data is captured per selector:
| `Save::all()` | Yes | Yes | Full extraction |
| `Save::only_inner_html()` | Yes | No | Raw markup only |
| `Save::only_text_content()` | No | Yes | Lightweight text scraping |
| `Save::none()` | No | No | Structure-only (attributes still saved) |
#### Supported CSS selector syntax
| Tag name | `a`, `div` | Working |
| ID | `#my-id` | Working |
| Class | `.my-class` | Working |
| Descendant | `main section a` | Working |
| Child | `main > section` | Working |
| Attribute presence | `a[href]` | Working |
| Attribute exact | `a[href="url"]` | Working |
| Attribute prefix | `a[href^="https"]` | Working |
| Attribute suffix | `a[href$=".com"]` | Working |
| Attribute substring | `a[href*="example"]` | Working |
| Adjacent sibling | `h1 + p` | Coming soon |
| General sibling | `h1 ~ p` | Coming soon |
> Full API documentation: [docs.rs/scah](https://docs.rs/scah)
#### Benchmarks

The repository includes two Rust benchmark tracks:
- Cross-library comparisons for simple `all` and `first` selectors.
- Runtime-builder vs `query!` macro comparisons to measure query-construction overhead separately from execution.
Run them with:
```bash
cargo bench -p scah-benches --bench speed_bench_simple_all
cargo bench -p scah-benches --bench speed_bench_simple_first
cargo bench -p scah-benches --bench speed_bench_macro_queries
```
### Python
```bash
pip install -U scah
```
```python
from scah import Query, Save, parse
query = Query.all("main > section", Save.all())
.then(lambda section: [
section.all("> a[href]", Save.all()),
section.all("div a", Save.all()),
])
.build()
store = parse(html, [query])
```
#### Benchmark's
##### Real Html BenchMark ([html.spec.whatwg.org](https://html.spec.whatwg.org/)) (select all `a` tags):

##### Synthetic Html BenchMark (select all `a` tags):

### Typescript / Javascript
```bash
npm install scah@npm:@zacharymm/scah
```
```ts
import { Query, parse } from 'scah';
const query = Query.all('main > section', { innerHtml: true, textContent: true })
.then((p) => [
p.all('> a[href]', { innerHtml: true, textContent: true }),
p.all('div a', { innerHtml: true, textContent: true }),
])
.build();
const store = parse(html, [query]);
```
## Codebase layout
The workspace is split by responsibility:
- `crates/scah`: public Rust API, parser entry points, store types, and re-exports.
- `crates/scah-query-ir`: selector parsing, compiled transitions, query builders, and shared query traits.
- `crates/scah-macros`: the `query!` proc macro for compile-time query construction.
- `crates/bindings/scah-python` and `crates/bindings/scah-node`: thin language bindings over the Rust core.
- `benches/`: Criterion and Gungraun benchmarks, including macro-query benchmarks.