Skip to main content

Crate crw_extract

Crate crw_extract 

Source
Expand description

HTML content extraction and format conversion for the CRW web scraper.

Converts raw HTML into clean, structured output formats:

  • Markdown — via markdown::html_to_markdown (htmd)
  • Plain text — via plaintext::html_to_plaintext
  • Cleaned HTML — boilerplate removal with clean::clean_html
  • Readability — main-content extraction with text-density scoring
  • CSS/XPath selector — narrow content to a specific element
  • Chunking — split content into sentence/topic/regex chunks
  • Filtering — BM25 or cosine-similarity ranking of chunks
  • Structured JSON — LLM-based extraction with JSON Schema validation

Modules§

chunking
clean
filter
markdown
pdf
PDF content extraction via pdf-inspector (lopdf-based).
plaintext
readability
selector
structured

Structs§

ExtractOptions
Options for the high-level extraction pipeline.

Functions§

extract
High-level extraction: given raw HTML + options, produce ScrapeData.