Skip to main content

Crate scah

Crate scah 

Source
Expand description

§scah - Streaming CSS-selector-driven HTML extraction

scah (scan HTML) is a high-performance parsing library that bridges the gap between SAX/StAX streaming efficiency and DOM convenience. Instead of loading an entire document into memory or manually tracking parser state, you declare what you want with CSS selectors; the library handles the streaming complexity and builds a targeted Store containing only your selections.

§Highlights

FeatureDetail
Streaming coreBuilt on StAX: constant memory regardless of document size
Familiar APICSS selectors including > (child) and (descendant) combinators
Composable queriesChain selections with QueryBuilder::then for hierarchical data extraction
Zero-copyElement names, attributes, and inner HTML are &str slices into the source
Multi-languageRust core with Python and TypeScript/JavaScript bindings

§Quick Start

use scah::{Query, Save, parse};

let html = r#"
    <main>
        <section>
            <a href="link1">Link 1</a>
            <a href="link2">Link 2</a>
        </section>
    </main>
"#;

// Build a query: find all <a> tags with an href attribute
// that are direct children of a <section> inside <main>.
let queries = &[
    Query::all("main > section > a[href]", Save::all()).build()
];

let store = parse(html, queries);

// Iterate over matched elements
for element in store.get("main > section > a[href]").unwrap() {
    println!("{}: {}", element.name, element.attribute(&store, "href").unwrap());
}

§Structured Querying with .then()

Instead of flat filtering, you can nest queries using closures. Child queries only run within the context of their parent match, making extraction of hierarchical relationships both efficient and ergonomic:

use scah::{Query, Save, parse};

let queries = &[
    Query::all("main > section", Save::all())
        .then(|section| [
            section.all("> a[href]", Save::all()),
            section.all("div a", Save::all()),
        ])
        .build()
];

let store = parse(html, queries);

§Architecture

Internally, scah is composed of the following layers:

  1. Reader: A zero-copy byte-level cursor over the HTML source.
  2. CSS selector compiler: Parses selector strings into a compact automaton of Query transitions.
  3. XHtmlParser: A streaming StAX parser that emits open/close events.
  4. QueryMultiplexer: Drives one or more query executors against the token stream simultaneously.
  5. Store: An arena-based result set that collects matched Elements, their attributes, and (optionally) inner HTML / text content.

§Supported CSS Selector Syntax

SyntaxExampleStatus
Tag namea, divWorking
ID#my-idWorking
Class.my-classWorking
Descendant combinatormain section aWorking
Child combinatormain > sectionWorking
Attribute presencea[href]Working
Attribute exact matcha[href="url"]Working
Attribute prefixa[href^="https"]Working
Attribute suffixa[href$=".com"]Working
Attribute substringa[href*="example"]Working
Adjacent siblingh1 + pComing soon
General siblingh1 ~ pComing soon

Modules§

lazy

Macros§

dbg_print
mut_prt_unchecked

Structs§

Attribute
A key-value pair representing an HTML element attribute.
Element
A matched HTML element stored in the Store.
ElementId
Query
A compiled CSS query ready to be executed against an HTML document.
QueryBuilder
An in-progress query being assembled via a builder pattern.
QueryFactory
A factory for creating child QueryBuilders inside a QueryBuilder::then closure.
QueryMultiplexer
QuerySection
A single segment of a compiled Query tree.
Reader
Save
Controls which pieces of content to capture for matched elements.
Store
The result set returned by parse.
XHtmlElement
An HTML element as parsed from the token stream.
XHtmlParser

Enums§

SelectionKind
Whether a query section should match all occurrences or only the first one.

Functions§

parse
Parse an HTML string against one or more pre-built Query objects and return a Store containing all matched elements.