Expand description
CSS-selector and regex based data extractor for raw HTML.
This module is the fallback extraction layer that fires when structured data (JSON-LD, OpenGraph, microdata) is absent or incomplete. It walks the DOM using CSS selectors and applies regex patterns against visible text to pull out commerce attributes (price, rating, availability), classify the page type, and discover interactive actions (forms, buttons, CTAs).
Selector patterns are loaded at compile time from css_selectors.json via
include_str!. All public entry points are synchronous because the
scraper crate’s types are !Send – callers should wrap in
tokio::task::spawn_blocking when integrating with the async runtime.
§Confidence model
Every extracted value is paired with a confidence score in [0.0, 1.0].
Data attributes and itemprop selectors score highest (0.95) because they
carry explicit semantic intent. Generic CSS class selectors score lower
(0.85), and regex matches on free text are the least confident (0.70).
The caller can decide a threshold below which data is discarded.
Structs§
- Discovered
Action - Action discovered from HTML patterns (forms, buttons, links).
- Discovered
Form - Form discovered from HTML.
- Pattern
Result - Result of pattern-based extraction.
Functions§
- extract_
from_ patterns - Extract data from raw HTML using CSS selectors and regex patterns.