Skip to main content

scah/
lib.rs

1//! # scah - Streaming CSS-selector-driven HTML extraction
2//!
3//! **scah** (*scan HTML*) is a high-performance parsing library that bridges the gap
4//! between SAX/StAX streaming efficiency and DOM convenience. Instead of loading an
5//! entire document into memory or manually tracking parser state, you declare what
6//! you want with **CSS selectors**; the library handles the streaming complexity and
7//! builds a targeted [`Store`] containing only your selections.
8//!
9//! ## Highlights
10//!
11//! | Feature | Detail |
12//! |---------|--------|
13//! | **Streaming core** | Built on StAX: constant memory regardless of document size |
14//! | **Familiar API** | CSS selectors including `>` (child) and ` ` (descendant) combinators |
15//! | **Composable queries** | Chain selections with [`QueryBuilder::then`] for hierarchical data extraction |
16//! | **Zero-copy** | Element names, attributes, and inner HTML are `&str` slices into the source |
17//! | **Multi-language** | Rust core with Python and TypeScript/JavaScript bindings |
18//!
19//! ## Quick Start
20//!
21//! ```rust
22//! use scah::{Query, Save, parse};
23//!
24//! let html = r#"
25//!     <main>
26//!         <section>
27//!             <a href="link1">Link 1</a>
28//!             <a href="link2">Link 2</a>
29//!         </section>
30//!     </main>
31//! "#;
32//!
33//! // Build a query: find all <a> tags with an href attribute
34//! // that are direct children of a <section> inside <main>.
35//! let queries = &[
36//!     Query::all("main > section > a[href]", Save::all()).build()
37//! ];
38//!
39//! let store = parse(html, queries);
40//!
41//! // Iterate over matched elements
42//! for element in store.get("main > section > a[href]").unwrap() {
43//!     println!("{}: {}", element.name, element.attribute(&store, "href").unwrap());
44//! }
45//! ```
46//!
47//! ## Structured Querying with `.then()`
48//!
49//! Instead of flat filtering, you can nest queries using closures.
50//! Child queries only run within the context of their parent match,
51//! making extraction of hierarchical relationships both efficient and ergonomic:
52//!
53//! ```rust
54//! use scah::{Query, Save, parse};
55//!
56//! # let html = "<main><section><a href='x'>Link</a></section></main>";
57//! let queries = &[
58//!     Query::all("main > section", Save::all())
59//!         .then(|section| [
60//!             section.all("> a[href]", Save::all()),
61//!             section.all("div a", Save::all()),
62//!         ])
63//!         .build()
64//! ];
65//!
66//! let store = parse(html, queries);
67//! ```
68//!
69//! ## Architecture
70//!
71//! Internally, scah is composed of the following layers:
72//!
73//! 1. **[`Reader`]**: A zero-copy byte-level cursor over the HTML source.
74//! 2. **CSS selector compiler**: Parses selector strings into a compact
75//!    automaton of [`Query`] transitions.
76//! 3. **[`XHtmlParser`]**: A streaming StAX parser that emits open/close events.
77//! 4. **[`QueryMultiplexer`]**: Drives one or more query executors against
78//!    the token stream simultaneously.
79//! 5. **[`Store`]**: An arena-based result set that collects matched
80//!    [`Element`]s, their attributes, and (optionally) inner HTML / text content.
81//!
82//! ## Supported CSS Selector Syntax
83//!
84//! | Syntax | Example | Status |
85//! |--------|---------|--------|
86//! | **Tag name** | `a`, `div` | Working |
87//! | **ID** | `#my-id` | Working |
88//! | **Class** | `.my-class` | Working |
89//! | **Descendant combinator** | `main section a` | Working |
90//! | **Child combinator** | `main > section` | Working |
91//! | **Attribute presence** | `a[href]` | Working |
92//! | **Attribute exact match** | `a[href="url"]` | Working |
93//! | **Attribute prefix** | `a[href^="https"]` | Working |
94//! | **Attribute suffix** | `a[href$=".com"]` | Working |
95//! | **Attribute substring** | `a[href*="example"]` | Working |
96//! | **Adjacent sibling** | `h1 + p` | Coming soon |
97//! | **General sibling** | `h1 ~ p` | Coming soon |
98
99mod css;
100mod sax;
101mod selection_engine;
102mod store;
103mod utils;
104
105pub use css::selector::lazy;
106pub use css::selector::{Query, QueryBuilder, QueryFactory, QuerySection, Save, SelectionKind};
107pub use sax::element::builder::{Attribute, XHtmlElement};
108pub use sax::parser::XHtmlParser;
109pub use selection_engine::multiplexer::QueryMultiplexer;
110pub use store::{Element, ElementId, Store};
111pub use utils::Reader;
112
113/// Parse an HTML string against one or more pre-built [`Query`] objects and
114/// return a [`Store`] containing all matched elements.
115///
116/// This is the main entry point of scah. It wires together the streaming
117/// [`XHtmlParser`], the [`QueryMultiplexer`], and the result [`Store`].
118///
119/// # Parameters
120///
121/// - `html`: The HTML source string. All returned string slices in the
122///   resulting [`Store`] borrow directly from this string (zero-copy).
123/// - `queries`: A slice of compiled [`Query`] objects. Each query is
124///   executed concurrently against the same token stream in a single pass.
125///
126/// # Returns
127///
128/// A [`Store`] containing all matched elements. Use [`Store::get`] with the
129/// original selector string to retrieve results for a specific query.
130///
131/// # Example
132///
133/// ```rust
134/// use scah::{Query, Save, parse};
135///
136/// let html = "<div><a href='link'>Hello</a></div>";
137/// let queries = &[Query::all("a", Save::all()).build()];
138/// let store = parse(html, queries);
139///
140/// let links: Vec<_> = store.get("a").unwrap().collect();
141/// assert_eq!(links.len(), 1);
142/// assert_eq!(links[0].name, "a");
143/// ```
144pub fn parse<'a: 'query, 'html: 'query, 'query: 'html>(
145    html: &'html str,
146    queries: &'a [Query<'query>],
147) -> Store<'html, 'query> {
148    let selectors = QueryMultiplexer::new(queries);
149
150    let no_extra_allocations = queries.iter().all(|q| q.exit_at_section_end.is_some());
151    let mut parser = if no_extra_allocations {
152        XHtmlParser::new(selectors)
153    } else {
154        XHtmlParser::with_capacity(selectors, html.len())
155    };
156
157    let mut reader = Reader::new(html);
158    while parser.next(&mut reader) {}
159
160    parser.matches()
161}