Crate crw_extract

Expand description

HTML content extraction and format conversion for the CRW web scraper.

Converts raw HTML into clean, structured output formats:

Modules§

answer: Multi-source LLM answer synthesis for /v1/search.
antibot: Anti-bot detection — port of crawl4ai’s antibot_detector.py.
chunking
clean
dom_features: DOM-side features fed into the markdown quality scorer.
dom_util: Lightweight DOM walking helpers shared by the listing-detection gate.
filter
judge: LLM meaningful-change judge for change-tracking / monitors.
llm: LLM provider dispatch (Anthropic, OpenAI, OpenAI-compatible, Azure).
markdown
pdf: PDF → markdown adapter over the pure-Rust pdf_inspector crate.
plaintext
pricing: Best-effort LLM pricing table for cost estimation.
quality: Markdown quality scoring used to drive escalation logic.
readability
selector
structured
summary: Single-page summarization via LLM.
tables: Data-table vs. layout-table classifier.

DebugCollector: Per-request collector for extraction debug traces. Wired in through ExtractOptions::debug_sink; the extractor pushes one DebugAttempt per extract() invocation, capturing the candidate ladder and the chosen output. Wrapped in an Arc<Mutex<_>> so the renderer / multi-attempt loop in crw-crawl can share a single sink across the JS-escalation retry.
ExtractOptions: Options for the high-level extraction pipeline.
LlmFallbackParams: Parameters for the LLM-assisted extraction fallback. See LlmFallbackConfig.

debug_candidate: Convenience: lift a single candidate description into a DebugCandidate.
extract: High-level extraction: given raw HTML + options, produce ScrapeData.
maybe_run_llm_fallback: Re-extract via the configured LLM provider when the current markdown scores below params.quality_threshold. If the LLM result has a higher quality score, it replaces data.markdown in place and a warning is appended noting the swap. On any failure (network, auth, parse) the original markdown is preserved and the error is logged.