crw-extract 0.0.3

HTML extraction and markdown conversion engine for the CRW web scraper

HTML content extraction and format conversion for the CRW web scraper.

Converts raw HTML into clean, structured output formats:

Markdown — via [markdown::html_to_markdown] (htmd)
Plain text — via [plaintext::html_to_plaintext]
Cleaned HTML — boilerplate removal with [clean::clean_html]
Readability — main-content extraction with text-density scoring
CSS/XPath selector — narrow content to a specific element
Chunking — split content into sentence/topic/regex chunks
Filtering — BM25 or cosine-similarity ranking of chunks
Structured JSON — LLM-based extraction with JSON Schema validation