Skip to main content

Module pattern_engine

Module pattern_engine 

Source
Expand description

CSS-selector and regex based data extractor for raw HTML.

This module is the fallback extraction layer that fires when structured data (JSON-LD, OpenGraph, microdata) is absent or incomplete. It walks the DOM using CSS selectors and applies regex patterns against visible text to pull out commerce attributes (price, rating, availability), classify the page type, and discover interactive actions (forms, buttons, CTAs).

Selector patterns are loaded at compile time from css_selectors.json via include_str!. All public entry points are synchronous because the scraper crate’s types are !Send – callers should wrap in tokio::task::spawn_blocking when integrating with the async runtime.

§Confidence model

Every extracted value is paired with a confidence score in [0.0, 1.0]. Data attributes and itemprop selectors score highest (0.95) because they carry explicit semantic intent. Generic CSS class selectors score lower (0.85), and regex matches on free text are the least confident (0.70). The caller can decide a threshold below which data is discarded.

Structs§

DiscoveredAction
Action discovered from HTML patterns (forms, buttons, links).
DiscoveredForm
Form discovered from HTML.
PatternResult
Result of pattern-based extraction.

Functions§

extract_from_patterns
Extract data from raw HTML using CSS selectors and regex patterns.