crw-extract

HTML content extraction and format conversion engine for the CRW web scraper.

Overview

crw-extract converts raw HTML into clean, structured output formats for LLM consumption, RAG pipelines, and data extraction.

Markdown — High-fidelity HTML→Markdown via htmd (Turndown.js port): tables, code blocks, nested lists. Indented code blocks are post-processed into fenced (```) blocks for better LLM compatibility
Plain text — Tag-stripped, whitespace-normalized text
Cleaned HTML — Boilerplate removal (scripts, styles, nav, footer, ads)
Readability — Main-content extraction with text-density scoring and multi-selector fallback
CSS selector & XPath — Narrow content to specific DOM elements before conversion
Chunking — Split content into sentence, topic (heading-based), or regex-delimited chunks
BM25 & cosine filtering — Rank chunks by relevance to a query, return top-K results
Structured JSON — LLM-based extraction with JSON Schema validation (Anthropic tool_use + OpenAI function calling)

Installation

cargo add crw-extract

Usage

High-level extraction pipeline

The extract() function runs the full pipeline: clean → select → readability → convert → chunk → filter.

use crw_extract::{extract, ExtractOptions};
use crw_core::types::OutputFormat;

let html = r#"<html><body><article><h1>Hello</h1><p>World</p></article></body></html>"#;

let result = extract(ExtractOptions {
    raw_html: html,
    source_url: "https://example.com",
    status_code: 200,
    rendered_with: None,
    elapsed_ms: 42,
    formats: &[OutputFormat::Markdown, OutputFormat::Links],
    only_main_content: true,
    include_tags: &[],
    exclude_tags: &[],
    css_selector: None,
    xpath: None,
    chunk_strategy: None,
    query: None,
    filter_mode: None,
    top_k: None,
}).unwrap();

println!("{}", result.markdown.unwrap());
// # Hello
//
// World

HTML to Markdown

use crw_extract::markdown::html_to_markdown;

let md = html_to_markdown("<h1>Title</h1><p>Paragraph with <strong>bold</strong> text.</p>");
assert!(md.contains("# Title"));
assert!(md.contains("**bold**"));

HTML to plain text

use crw_extract::plaintext::html_to_plaintext;

let text = html_to_plaintext("<p>Hello <b>world</b></p>");
assert_eq!(text.trim(), "Hello world");

HTML cleaning

Remove boilerplate elements (scripts, styles, nav, footer, ads):

use crw_extract::clean::clean_html;

let html = r#"<html><body><nav>Menu</nav><article><p>Content</p></article><footer>Footer</footer></body></html>"#;
let cleaned = clean_html(html, true, &[], &[]).unwrap();
// nav and footer are stripped, article content is preserved

Filter by tag inclusion/exclusion:

use crw_extract::clean::clean_html;

let html = "<div><p>Keep this</p><span>Remove this</span></div>";
let result = clean_html(html, false, &["p".into()], &[]).unwrap();
assert!(result.contains("Keep this"));

CSS selector extraction

use crw_extract::selector::extract_by_css;

let html = r#"<div><article class="post"><p>Target content</p></article><aside>Sidebar</aside></div>"#;
let result = extract_by_css(html, "article.post").unwrap();
assert!(result.unwrap().contains("Target content"));

XPath extraction

use crw_extract::selector::extract_by_xpath;

let html = "<html><body><h1>Title</h1><p>Text</p></body></html>";
let result = extract_by_xpath(html, "//h1").unwrap();
assert_eq!(result.unwrap(), vec!["Title".to_string()]);

Chunking

Split content into chunks for RAG pipelines:

use crw_extract::chunking::chunk_text;
use crw_core::types::ChunkStrategy;

let text = "# Introduction\nFirst section.\n# Methods\nSecond section.";
let strategy = ChunkStrategy::Topic {
    max_chars: None,
    overlap_chars: None,
    dedupe: None,
};
let chunks = chunk_text(text, &strategy);
assert_eq!(chunks.len(), 2);

Chunk filtering

Rank chunks by relevance using BM25 or cosine similarity:

use crw_extract::filter::filter_chunks;
use crw_core::types::FilterMode;

let chunks = vec![
    "Rust is a systems programming language".to_string(),
    "The weather is sunny today".to_string(),
    "Rust provides memory safety without GC".to_string(),
];
let top = filter_chunks(&chunks, "Rust programming", &FilterMode::Bm25, 2);
assert_eq!(top.len(), 2);
// Chunks mentioning "Rust" are ranked higher

Metadata extraction

Extract title, description, Open Graph metadata, and links:

use crw_extract::readability::{extract_metadata, extract_links};

let html = r#"<html><head><title>My Page</title><meta name="description" content="A page"></head><body><a href="/about">About</a></body></html>"#;
let meta = extract_metadata(html);
assert_eq!(meta.title, Some("My Page".into()));

let links = extract_links(html, "https://example.com");
assert!(links.iter().any(|l| l.contains("/about")));

Part of CRW

This crate is part of the CRW workspace — a fast, lightweight, Firecrawl-compatible web scraper built in Rust.

Crate	Description
crw-core	Core types, config, and error handling
crw-renderer	HTTP + CDP browser rendering engine
crw-extract	HTML → markdown/plaintext extraction (this crate)
crw-crawl	Async BFS crawler with robots.txt & sitemap
crw-server	Firecrawl-compatible API server
crw-cli	Standalone CLI (`crw` binary)
crw-mcp	MCP stdio proxy binary

License

AGPL-3.0 — see LICENSE.

crw-extract 0.2.0