Expand description
HTML content extraction and format conversion for the CRW web scraper.
Converts raw HTML into clean, structured output formats:
- Markdown — via
markdown::html_to_markdown(htmd) - Plain text — via
plaintext::html_to_plaintext - Cleaned HTML — boilerplate removal with
clean::clean_html - Readability — main-content extraction with text-density scoring
- CSS/XPath selector — narrow content to a specific element
- Chunking — split content into sentence/topic/regex chunks
- Filtering — BM25 or cosine-similarity ranking of chunks
- Structured JSON — LLM-based extraction with JSON Schema validation
Modules§
- answer
- Multi-source LLM answer synthesis for
/v1/search. - antibot
- Anti-bot detection — port of crawl4ai’s
antibot_detector.py. - chunking
- clean
- dom_
features - DOM-side features fed into the markdown quality scorer.
- dom_
util - Lightweight DOM walking helpers shared by the listing-detection gate.
- filter
- judge
- LLM meaningful-change judge for change-tracking / monitors.
- llm
- LLM provider dispatch (Anthropic, OpenAI, OpenAI-compatible, Azure).
- markdown
- PDF → markdown adapter over the pure-Rust
pdf_inspectorcrate. - plaintext
- pricing
- Best-effort LLM pricing table for cost estimation.
- quality
- Markdown quality scoring used to drive escalation logic.
- readability
- selector
- structured
- summary
- Single-page summarization via LLM.
- tables
- Data-table vs. layout-table classifier.
Structs§
- Debug
Collector - Per-request collector for extraction debug traces. Wired in through
ExtractOptions::debug_sink; the extractor pushes oneDebugAttemptperextract()invocation, capturing the candidate ladder and the chosen output. Wrapped in anArc<Mutex<_>>so the renderer / multi-attempt loop incrw-crawlcan share a single sink across the JS-escalation retry. - Extract
Options - Options for the high-level extraction pipeline.
- LlmFallback
Params - Parameters for the LLM-assisted extraction fallback. See
LlmFallbackConfig.
Functions§
- debug_
candidate - Convenience: lift a single candidate description into a
DebugCandidate. - extract
- High-level extraction: given raw HTML + options, produce ScrapeData.
- maybe_
run_ llm_ fallback - Re-extract via the configured LLM provider when the current markdown
scores below
params.quality_threshold. If the LLM result has a higher quality score, it replacesdata.markdownin place and a warning is appended noting the swap. On any failure (network, auth, parse) the original markdown is preserved and the error is logged.