Expand description
§scah - Streaming CSS-selector-driven HTML extraction
scah (scan HTML) is a high-performance parsing library that bridges the gap
between SAX/StAX streaming efficiency and DOM convenience. Instead of loading an
entire document into memory or manually tracking parser state, you declare what
you want with CSS selectors; the library handles the streaming complexity and
builds a targeted Store containing only your selections.
§Highlights
| Feature | Detail |
|---|---|
| Streaming core | Built on StAX: constant memory regardless of document size |
| Familiar API | CSS selectors including > (child) and (descendant) combinators |
| Composable queries | Chain selections with QueryBuilder::then for hierarchical data extraction |
| Zero-copy | Element names, attributes, and inner HTML are &str slices into the source |
| Multi-language | Rust core with Python and TypeScript/JavaScript bindings |
§Quick Start
use scah::{Query, Save, parse};
let html = r#"
<main>
<section>
<a href="link1">Link 1</a>
<a href="link2">Link 2</a>
</section>
</main>
"#;
// Build a query: find all <a> tags with an href attribute
// that are direct children of a <section> inside <main>.
let queries = &[
Query::all("main > section > a[href]", Save::all()).build()
];
let store = parse(html, queries);
// Iterate over matched elements
for element in store.get("main > section > a[href]").unwrap() {
println!("{}: {}", element.name, element.attribute(&store, "href").unwrap());
}§Structured Querying with .then()
Instead of flat filtering, you can nest queries using closures. Child queries only run within the context of their parent match, making extraction of hierarchical relationships both efficient and ergonomic:
use scah::{Query, Save, parse};
let queries = &[
Query::all("main > section", Save::all())
.then(|section| [
section.all("> a[href]", Save::all()),
section.all("div a", Save::all()),
])
.build()
];
let store = parse(html, queries);§Architecture
Internally, scah is composed of the following layers:
Reader: A zero-copy byte-level cursor over the HTML source.- CSS selector compiler: Parses selector strings into a compact
automaton of
Querytransitions. XHtmlParser: A streaming StAX parser that emits open/close events.QueryMultiplexer: Drives one or more query executors against the token stream simultaneously.Store: An arena-based result set that collects matchedElements, their attributes, and (optionally) inner HTML / text content.
§Supported CSS Selector Syntax
| Syntax | Example | Status |
|---|---|---|
| Tag name | a, div | Working |
| ID | #my-id | Working |
| Class | .my-class | Working |
| Descendant combinator | main section a | Working |
| Child combinator | main > section | Working |
| Attribute presence | a[href] | Working |
| Attribute exact match | a[href="url"] | Working |
| Attribute prefix | a[href^="https"] | Working |
| Attribute suffix | a[href$=".com"] | Working |
| Attribute substring | a[href*="example"] | Working |
| Adjacent sibling | h1 + p | Coming soon |
| General sibling | h1 ~ p | Coming soon |
Modules§
Macros§
Structs§
- Attribute
- A key-value pair representing an HTML element attribute.
- Element
- A matched HTML element stored in the
Store. - Element
Id - Query
- A compiled CSS query ready to be executed against an HTML document.
- Query
Builder - An in-progress query being assembled via a builder pattern.
- Query
Factory - A factory for creating child
QueryBuilders inside aQueryBuilder::thenclosure. - Query
Multiplexer - Query
Section - A single segment of a compiled
Querytree. - Reader
- Save
- Controls which pieces of content to capture for matched elements.
- Store
- The result set returned by
parse. - XHtml
Element - An HTML element as parsed from the token stream.
- XHtml
Parser
Enums§
- Selection
Kind - Whether a query section should match all occurrences or only the first one.