HTML content extraction and format conversion for the CRW web scraper.
Converts raw HTML into clean, structured output formats:
- Markdown — via [
markdown::html_to_markdown] (htmd) - Plain text — via [
plaintext::html_to_plaintext] - Cleaned HTML — boilerplate removal with [
clean::clean_html] - Readability — main-content extraction with text-density scoring
- CSS/XPath selector — narrow content to a specific element
- Chunking — split content into sentence/topic/regex chunks
- Filtering — BM25 or cosine-similarity ranking of chunks
- Structured JSON — LLM-based extraction with JSON Schema validation