Expand description
§rs-trafilatura
Rust port of trafilatura - a web content extraction library.
This library extracts clean, readable content from web pages by stripping navigation, advertisements, and boilerplate while preserving meaningful text, metadata, and document structure.
§Quick Start
use rs_trafilatura::{extract, Options};
let html = r#"<html><head><title>My Article</title></head>
<body><article><p>Main content here.</p></article></body></html>"#;
let result = extract(html)?;
println!("Title: {:?}", result.metadata.title);
println!("Content: {}", result.content_text);§Features
- Content Extraction: Identifies and extracts the main article content
- Page Type Classification: XGBoost classifier detects 7 page types (article, forum, product, collection, listing, documentation, service)
- Per-Type Extraction Profiles: Type-specific boilerplate removal, content selectors, and extraction strategies
- Extraction Quality Predictor: ML confidence score (0.0-1.0) predicting extraction F1 — enables hybrid pipelines with LLM fallback for low-confidence pages
- Metadata Extraction: Title, author, date, language, sitename, and more from JSON-LD, Open Graph, Dublin Core, and HTML meta tags
- Markdown Output: GitHub Flavored Markdown with headings, lists, tables, code blocks
- Configurable: Options to tune precision/recall tradeoff
§Accuracy
Achieves F1 0.859 on a 1,497-page multi-type benchmark (WCEB), outperforming Trafilatura (0.792) and neural approaches MinerU-HTML (0.827) and ReaderLM-v2 (0.741). F1 0.893 on a 511-page held-out test set confirms generalization.
Modules§
- encoding
- Character encoding detection and transcoding. Character encoding detection and transcoding.
- markdown
- Markdown processing utilities (escaping, table conversion). Markdown processing utilities.
- page_
type - Page type classification (URL heuristics, HTML signals, ML classifier). Page type classification for web content extraction.
- scoring
- F-Score calculation for accuracy benchmarking. F-Score calculation for accuracy benchmarking.
Structs§
- Extract
Result - Result of content extraction from an HTML document.
- Image
Data - Structured image data extracted from content.
- Metadata
- Metadata extracted from an HTML document.
- Options
- Configuration options for content extraction.
Enums§
- Error
- Error type for extraction operations.
Functions§
- extract
- Extracts main content from an HTML document using default options.
- extract_
bytes - Extracts main content from HTML bytes with automatic encoding detection.
- extract_
bytes_ with_ options - Extracts main content from HTML bytes with custom options and automatic encoding detection.
- extract_
with_ options - Extracts main content from an HTML document with custom options.
Type Aliases§
- Result
- Result type alias for extraction operations.