Skip to main content

Crate scah

Crate scah 

Source
Expand description

§scah - Streaming CSS-selector-driven HTML extraction

scah (scan HTML) is a high-performance parsing library that bridges the gap between SAX/StAX streaming efficiency and DOM convenience. Instead of loading an entire document into memory or manually tracking parser state, you declare what you want with CSS selectors; the library handles the streaming complexity and builds a targeted Store containing only your selections.

§Highlights

FeatureDetail
Streaming coreBuilt on StAX: constant memory regardless of document size
Familiar APICSS selectors including > (child) and (descendant) combinators
Composable queriesChain selections with QueryBuilder::then for hierarchical data extraction
Zero-copyElement names, attributes, and inner HTML are &str slices into the source
Multi-languageRust core with Python and TypeScript/JavaScript bindings

§Quick Start

use scah::{Query, Save, parse};

let html = r#"
    <main>
        <section>
            <a href="link1">Link 1</a>
            <a href="link2">Link 2</a>
        </section>
    </main>
"#;

// Build a query: find all <a> tags with an href attribute
// that are direct children of a <section> inside <main>.
let queries = &[
    Query::all("main > section > a[href]", Save::all())
        .expect("valid selector")
        .build()
];

let store = parse(html, queries);

// Iterate over matched elements
for element in store.get("main > section > a[href]").unwrap() {
    println!("{}: {}", element.name, element.attribute(&store, "href").unwrap());
}

§Structured Querying with .then()

Instead of flat filtering, you can nest queries using closures. Child queries only run within the context of their parent match, making extraction of hierarchical relationships both efficient and ergonomic:

use scah::{Query, Save, parse};

let queries = &[Query::all("main > section", Save::all())
    .expect("valid selector")
    .then(|section| {
        Ok([
            section.all("> a[href]", Save::all())?,
            section.all("div a", Save::all())?,
        ])
    })
    .expect("valid child selectors")
    .build()];

let store = parse(html, queries);

§Architecture

Internally, scah is composed of the following layers:

  1. Reader: A zero-copy byte-level cursor over the HTML source.
  2. CSS selector compiler: Parses selector strings into a compact automaton of Query transitions.
  3. XHtmlParser: A streaming StAX parser that emits open/close events.
  4. QueryMultiplexer: Drives one or more query executors against the token stream simultaneously.
  5. Store: An arena-based result set that collects matched Elements, their attributes, and (optionally) inner HTML / text content.

§Supported CSS Selector Syntax

SyntaxExampleStatus
Tag namea, divWorking
ID#my-idWorking
Class.my-classWorking
Descendant combinatormain section aWorking
Child combinatormain > sectionWorking
Attribute presencea[href]Working
Attribute exact matcha[href="url"]Working
Attribute prefixa[href^="https"]Working
Attribute suffixa[href$=".com"]Working
Attribute substringa[href*="example"]Working
Adjacent siblingh1 + pComing soon
General siblingh1 ~ pComing soon

Modules§

lazy

Macros§

dbg_print
mut_prt_unchecked
query

Structs§

Attribute
AttributeSelection
Element
A matched HTML element stored in the Store.
ElementId
ElementPredicate
Position
Query
QueryBuilder
An in-progress query being assembled via a builder pattern.
QueryFactory
A factory for creating child QueryBuilders inside a QueryBuilder::then closure.
QueryMultiplexer
QuerySection
Reader
Save
Controls which pieces of content to capture for matched elements.
SelectorParseError
StaticQuery
Store
The result set returned by parse.
Transition
XHtmlElement
A key-value pair representing an HTML element attribute.
XHtmlParser

Enums§

AttributeSelectionKind
AttributeSelections
ClassSelections
Combinator
SelectionKind
Whether a query section should match all occurrences or only the first one.

Traits§

IElement
Element Interface
QuerySpec

Functions§

parse
Parse an HTML string against one or more pre-built Query objects and return a Store containing all matched elements.