# crw-extract
HTML content extraction and format conversion engine for the [CRW](https://github.com/us/crw) web scraper.
[](https://crates.io/crates/crw-extract)
[](https://docs.rs/crw-extract)
[](https://github.com/us/crw/blob/main/LICENSE)
## Overview
`crw-extract` converts raw HTML into clean, structured output formats for LLM consumption, RAG pipelines, and data extraction.
- **Markdown** — High-fidelity HTML→Markdown via `htmd` (Turndown.js port): tables, code blocks, nested lists. Indented code blocks are post-processed into fenced (```) blocks for better LLM compatibility
- **Plain text** — Tag-stripped, whitespace-normalized text
- **Cleaned HTML** — Boilerplate removal (scripts, styles, nav, footer, ads)
- **Readability** — Main-content extraction with text-density scoring and multi-selector fallback
- **CSS selector & XPath** — Narrow content to specific DOM elements before conversion
- **Chunking** — Split content into sentence, topic (heading-based), or regex-delimited chunks
- **BM25 & cosine filtering** — Rank chunks by relevance to a query, return top-K results
- **Structured JSON** — LLM-based extraction with JSON Schema validation (Anthropic tool_use + OpenAI function calling)
## Installation
```bash
cargo add crw-extract
```
## Usage
### High-level extraction pipeline
The `extract()` function runs the full pipeline: clean → select → readability → convert → chunk → filter.
```rust,no_run
use crw_extract::{extract, ExtractOptions};
use crw_core::types::OutputFormat;
let html = r#"<html><body><article><h1>Hello</h1><p>World</p></article></body></html>"#;
let result = extract(ExtractOptions {
raw_html: html,
source_url: "https://example.com",
status_code: 200,
rendered_with: None,
elapsed_ms: 42,
formats: &[OutputFormat::Markdown, OutputFormat::Links],
only_main_content: true,
include_tags: &[],
exclude_tags: &[],
css_selector: None,
xpath: None,
chunk_strategy: None,
query: None,
filter_mode: None,
top_k: None,
}).unwrap();
println!("{}", result.markdown.unwrap());
// # Hello
//
// World
```
### HTML to Markdown
```rust
use crw_extract::markdown::html_to_markdown;
let md = html_to_markdown("<h1>Title</h1><p>Paragraph with <strong>bold</strong> text.</p>");
assert!(md.contains("# Title"));
assert!(md.contains("**bold**"));
```
### HTML to plain text
```rust
use crw_extract::plaintext::html_to_plaintext;
let text = html_to_plaintext("<p>Hello <b>world</b></p>");
assert_eq!(text.trim(), "Hello world");
```
### HTML cleaning
Remove boilerplate elements (scripts, styles, nav, footer, ads):
```rust
use crw_extract::clean::clean_html;
let html = r#"<html><body><nav>Menu</nav><article><p>Content</p></article><footer>Footer</footer></body></html>"#;
let cleaned = clean_html(html, true, &[], &[]).unwrap();
// nav and footer are stripped, article content is preserved
```
Filter by tag inclusion/exclusion:
```rust
use crw_extract::clean::clean_html;
let html = "<div><p>Keep this</p><span>Remove this</span></div>";
let result = clean_html(html, false, &["p".into()], &[]).unwrap();
assert!(result.contains("Keep this"));
```
### CSS selector extraction
```rust
use crw_extract::selector::extract_by_css;
let html = r#"<div><article class="post"><p>Target content</p></article><aside>Sidebar</aside></div>"#;
let result = extract_by_css(html, "article.post").unwrap();
assert!(result.unwrap().contains("Target content"));
```
### XPath extraction
```rust
use crw_extract::selector::extract_by_xpath;
let html = "<html><body><h1>Title</h1><p>Text</p></body></html>";
let result = extract_by_xpath(html, "//h1").unwrap();
assert_eq!(result.unwrap(), vec!["Title".to_string()]);
```
### Chunking
Split content into chunks for RAG pipelines:
```rust
use crw_extract::chunking::chunk_text;
use crw_core::types::ChunkStrategy;
let text = "# Introduction\nFirst section.\n# Methods\nSecond section.";
let strategy = ChunkStrategy::Topic {
max_chars: None,
overlap_chars: None,
dedupe: None,
};
let chunks = chunk_text(text, &strategy);
assert_eq!(chunks.len(), 2);
```
### Chunk filtering
Rank chunks by relevance using BM25 or cosine similarity:
```rust
use crw_extract::filter::filter_chunks;
use crw_core::types::FilterMode;
let chunks = vec![
"Rust is a systems programming language".to_string(),
"The weather is sunny today".to_string(),
"Rust provides memory safety without GC".to_string(),
];
let top = filter_chunks(&chunks, "Rust programming", &FilterMode::Bm25, 2);
assert_eq!(top.len(), 2);
// Chunks mentioning "Rust" are ranked higher
```
### Metadata extraction
Extract title, description, Open Graph metadata, and links:
```rust
use crw_extract::readability::{extract_metadata, extract_links};
let html = r#"<html><head><title>My Page</title><meta name="description" content="A page"></head><body><a href="/about">About</a></body></html>"#;
let meta = extract_metadata(html);
assert_eq!(meta.title, Some("My Page".into()));
let links = extract_links(html, "https://example.com");
## Part of CRW
This crate is part of the [CRW](https://github.com/us/crw) workspace — a fast, lightweight, Firecrawl-compatible web scraper built in Rust.
| [crw-core](https://crates.io/crates/crw-core) | Core types, config, and error handling |
| [crw-renderer](https://crates.io/crates/crw-renderer) | HTTP + CDP browser rendering engine |
| **crw-extract** | HTML → markdown/plaintext extraction (this crate) |
| [crw-crawl](https://crates.io/crates/crw-crawl) | Async BFS crawler with robots.txt & sitemap |
| [crw-server](https://crates.io/crates/crw-server) | Firecrawl-compatible API server |
| [crw-cli](https://crates.io/crates/crw-cli) | Standalone CLI (`crw` binary) |
| [crw-mcp](https://crates.io/crates/crw-mcp) | MCP stdio proxy binary |
## License
AGPL-3.0 — see [LICENSE](https://github.com/us/crw/blob/main/LICENSE).