Skip to main content

Crate crw_extract

Crate crw_extract 

Source
Expand description

HTML content extraction and format conversion for the CRW web scraper.

Converts raw HTML into clean, structured output formats:

  • Markdown — via markdown::html_to_markdown (htmd)
  • Plain text — via plaintext::html_to_plaintext
  • Cleaned HTML — boilerplate removal with clean::clean_html
  • Readability — main-content extraction with text-density scoring
  • CSS/XPath selector — narrow content to a specific element
  • Chunking — split content into sentence/topic/regex chunks
  • Filtering — BM25 or cosine-similarity ranking of chunks
  • Structured JSON — LLM-based extraction with JSON Schema validation

Modules§

answer
Multi-source LLM answer synthesis for /v1/search.
antibot
Anti-bot detection — port of crawl4ai’s antibot_detector.py.
chunking
clean
dom_features
DOM-side features fed into the markdown quality scorer.
dom_util
Lightweight DOM walking helpers shared by the listing-detection gate.
filter
judge
LLM meaningful-change judge for change-tracking / monitors.
llm
LLM provider dispatch (Anthropic, OpenAI, OpenAI-compatible, Azure).
markdown
pdf
PDF → markdown adapter over the pure-Rust pdf_inspector crate.
plaintext
pricing
Best-effort LLM pricing table for cost estimation.
quality
Markdown quality scoring used to drive escalation logic.
readability
selector
structured
summary
Single-page summarization via LLM.
tables
Data-table vs. layout-table classifier.

Structs§

DebugCollector
Per-request collector for extraction debug traces. Wired in through ExtractOptions::debug_sink; the extractor pushes one DebugAttempt per extract() invocation, capturing the candidate ladder and the chosen output. Wrapped in an Arc<Mutex<_>> so the renderer / multi-attempt loop in crw-crawl can share a single sink across the JS-escalation retry.
ExtractOptions
Options for the high-level extraction pipeline.
LlmFallbackParams
Parameters for the LLM-assisted extraction fallback. See LlmFallbackConfig.

Functions§

debug_candidate
Convenience: lift a single candidate description into a DebugCandidate.
extract
High-level extraction: given raw HTML + options, produce ScrapeData.
maybe_run_llm_fallback
Re-extract via the configured LLM provider when the current markdown scores below params.quality_threshold. If the LLM result has a higher quality score, it replaces data.markdown in place and a warning is appended noting the swap. On any failure (network, auth, parse) the original markdown is preserved and the error is logged.