crw-extract
HTML content extraction and format conversion engine for the CRW web scraper.
Overview
crw-extract converts raw HTML into clean, structured output formats for LLM consumption, RAG pipelines, and data extraction.
- Markdown — High-fidelity HTML→Markdown via
htmd(Turndown.js port): tables, code blocks, nested lists. Indented code blocks are post-processed into fenced (```) blocks for better LLM compatibility - Plain text — Tag-stripped, whitespace-normalized text
- Cleaned HTML — Boilerplate removal (scripts, styles, nav, footer, ads)
- Readability — Main-content extraction with text-density scoring and multi-selector fallback
- CSS selector & XPath — Narrow content to specific DOM elements before conversion
- Chunking — Split content into sentence, topic (heading-based), or regex-delimited chunks
- BM25 & cosine filtering — Rank chunks by relevance to a query, return top-K results
- Structured JSON — LLM-based extraction with JSON Schema validation (Anthropic tool_use + OpenAI function calling)
Installation
Usage
High-level extraction pipeline
The extract() function runs the full pipeline: clean → select → readability → convert → chunk → filter.
use ;
use OutputFormat;
let html = r#"<html><body><article><h1>Hello</h1><p>World</p></article></body></html>"#;
let result = extract.unwrap;
println!;
// # Hello
//
// World
HTML to Markdown
use html_to_markdown;
let md = html_to_markdown;
assert!;
assert!;
HTML to plain text
use html_to_plaintext;
let text = html_to_plaintext;
assert_eq!;
HTML cleaning
Remove boilerplate elements (scripts, styles, nav, footer, ads):
use clean_html;
let html = r#"<html><body><nav>Menu</nav><article><p>Content</p></article><footer>Footer</footer></body></html>"#;
let cleaned = clean_html.unwrap;
// nav and footer are stripped, article content is preserved
Filter by tag inclusion/exclusion:
use clean_html;
let html = "<div><p>Keep this</p><span>Remove this</span></div>";
let result = clean_html.unwrap;
assert!;
CSS selector extraction
use extract_by_css;
let html = r#"<div><article class="post"><p>Target content</p></article><aside>Sidebar</aside></div>"#;
let result = extract_by_css.unwrap;
assert!;
XPath extraction
use extract_by_xpath;
let html = "<html><body><h1>Title</h1><p>Text</p></body></html>";
let result = extract_by_xpath.unwrap;
assert_eq!;
Chunking
Split content into chunks for RAG pipelines:
use chunk_text;
use ChunkStrategy;
let text = "# Introduction\nFirst section.\n# Methods\nSecond section.";
let strategy = Topic ;
let chunks = chunk_text;
assert_eq!;
Chunk filtering
Rank chunks by relevance using BM25 or cosine similarity:
use filter_chunks;
use FilterMode;
let chunks = vec!;
let top = filter_chunks;
assert_eq!;
// Chunks mentioning "Rust" are ranked higher
Metadata extraction
Extract title, description, Open Graph metadata, and links:
use ;
let html = r#"<html><head><title>My Page</title><meta name="description" content="A page"></head><body><a href="/about">About</a></body></html>"#;
let meta = extract_metadata;
assert_eq!;
let links = extract_links;
assert!;
Part of CRW
This crate is part of the CRW workspace — a fast, lightweight, Firecrawl-compatible web scraper built in Rust.
| Crate | Description |
|---|---|
| crw-core | Core types, config, and error handling |
| crw-renderer | HTTP + CDP browser rendering engine |
| crw-extract | HTML → markdown/plaintext extraction (this crate) |
| crw-crawl | Async BFS crawler with robots.txt & sitemap |
| crw-server | Firecrawl-compatible API server |
| crw-cli | Standalone CLI (crw binary) |
| crw-mcp | MCP stdio proxy binary |
License
AGPL-3.0 — see LICENSE.