Expand description
§rs-trafilatura
Rust port of trafilatura - a web content extraction library.
This library extracts clean, readable content from web pages by stripping navigation, advertisements, and boilerplate while preserving meaningful text, metadata, and document structure.
§Quick Start
use rs_trafilatura::{extract, Options};
let html = r#"<html><head><title>My Article</title></head>
<body><article><p>Main content here.</p></article></body></html>"#;
let result = extract(html)?;
println!("Title: {:?}", result.metadata.title);
println!("Content: {}", result.content_text);§Features
- Content Extraction: Identifies and extracts the main article content
- Metadata Extraction: Title, author, date, language, sitename, and more
- Boilerplate Removal: Strips navigation, ads, footers, and other noise
- Configurable: Options to tune precision/recall tradeoff
§Accuracy
Achieves F1 0.860 on a 1,502-page benchmark with page type classification, ML-based content detection, and extraction quality confidence scoring.
Modules§
- encoding
- Character encoding detection and transcoding. Character encoding detection and transcoding.
- markdown
- Markdown processing utilities (escaping, table conversion). Markdown processing utilities.
- page_
type - Page type classification (URL heuristics, HTML signals, ML classifier). Page type classification for web content extraction.
- scoring
- F-Score calculation for accuracy benchmarking. F-Score calculation for accuracy benchmarking.
Structs§
- Extract
Result - Result of content extraction from an HTML document.
- Image
Data - Structured image data extracted from content.
- Metadata
- Metadata extracted from an HTML document.
- Options
- Configuration options for content extraction.
Enums§
- Error
- Error type for extraction operations.
Functions§
- extract
- Extracts main content from an HTML document using default options.
- extract_
bytes - Extracts main content from HTML bytes with automatic encoding detection.
- extract_
bytes_ with_ options - Extracts main content from HTML bytes with custom options and automatic encoding detection.
- extract_
with_ options - Extracts main content from an HTML document with custom options.
Type Aliases§
- Result
- Result type alias for extraction operations.