Expand description
HTML content extraction and format conversion for the CRW web scraper.
Converts raw HTML into clean, structured output formats:
- Markdown — via
markdown::html_to_markdown(htmd) - Plain text — via
plaintext::html_to_plaintext - Cleaned HTML — boilerplate removal with
clean::clean_html - Readability — main-content extraction with text-density scoring
- CSS/XPath selector — narrow content to a specific element
- Chunking — split content into sentence/topic/regex chunks
- Filtering — BM25 or cosine-similarity ranking of chunks
- Structured JSON — LLM-based extraction with JSON Schema validation
Modules§
- chunking
- clean
- filter
- markdown
- PDF content extraction via pdf-inspector (lopdf-based).
- plaintext
- readability
- selector
- structured
Structs§
- Extract
Options - Options for the high-level extraction pipeline.
Functions§
- extract
- High-level extraction: given raw HTML + options, produce ScrapeData.