Module pattern_engine

Expand description

CSS-selector and regex based data extractor for raw HTML.

This module is the fallback extraction layer that fires when structured data (JSON-LD, OpenGraph, microdata) is absent or incomplete. It walks the DOM using CSS selectors and applies regex patterns against visible text to pull out commerce attributes (price, rating, availability), classify the page type, and discover interactive actions (forms, buttons, CTAs).

Selector patterns are loaded at compile time from css_selectors.json via include_str!. All public entry points are synchronous because the scraper crate’s types are !Send – callers should wrap in tokio::task::spawn_blocking when integrating with the async runtime.

§Confidence model

Every extracted value is paired with a confidence score in [0.0, 1.0]. Data attributes and itemprop selectors score highest (0.95) because they carry explicit semantic intent. Generic CSS class selectors score lower (0.85), and regex matches on free text are the least confident (0.70). The caller can decide a threshold below which data is discarded.

Structs§

DiscoveredAction: Action discovered from HTML patterns (forms, buttons, links).
DiscoveredForm: Form discovered from HTML.
PatternResult: Result of pattern-based extraction.

Functions§

extract_from_patterns: Extract data from raw HTML using CSS selectors and regex patterns.