crw-extract 0.0.3

HTML extraction and markdown conversion engine for the CRW web scraper
Documentation

HTML content extraction and format conversion for the CRW web scraper.

Converts raw HTML into clean, structured output formats:

  • Markdown — via [markdown::html_to_markdown] (htmd)
  • Plain text — via [plaintext::html_to_plaintext]
  • Cleaned HTML — boilerplate removal with [clean::clean_html]
  • Readability — main-content extraction with text-density scoring
  • CSS/XPath selector — narrow content to a specific element
  • Chunking — split content into sentence/topic/regex chunks
  • Filtering — BM25 or cosine-similarity ranking of chunks
  • Structured JSON — LLM-based extraction with JSON Schema validation