Crate rs_trafilatura

Expand description

§rs-trafilatura

Rust port of trafilatura - a web content extraction library.

This library extracts clean, readable content from web pages by stripping navigation, advertisements, and boilerplate while preserving meaningful text, metadata, and document structure.

§Quick Start

use rs_trafilatura::{extract, Options};

let html = r#"<html><head><title>My Article</title></head>
<body><article><p>Main content here.</p></article></body></html>"#;

let result = extract(html)?;
println!("Title: {:?}", result.metadata.title);
println!("Content: {}", result.content_text);

§Features

Content Extraction: Identifies and extracts the main article content
Page Type Classification: XGBoost classifier detects 7 page types (article, forum, product, collection, listing, documentation, service)
Per-Type Extraction Profiles: Type-specific boilerplate removal, content selectors, and extraction strategies
Extraction Quality Predictor: ML confidence score (0.0-1.0) predicting extraction F1 — enables hybrid pipelines with LLM fallback for low-confidence pages
Metadata Extraction: Title, author, date, language, sitename, and more from JSON-LD, Open Graph, Dublin Core, and HTML meta tags
Markdown Output: GitHub Flavored Markdown with headings, lists, tables, code blocks
Configurable: Options to tune precision/recall tradeoff

§Accuracy

Achieves F1 0.859 on a 1,497-page multi-type benchmark (WCEB), outperforming Trafilatura (0.792) and neural approaches MinerU-HTML (0.827) and ReaderLM-v2 (0.741). F1 0.893 on a 511-page held-out test set confirms generalization.

Modules§

encoding: Character encoding detection and transcoding. Character encoding detection and transcoding.
markdown: Markdown processing utilities (escaping, table conversion). Markdown processing utilities.
page_type: Page type classification (URL heuristics, HTML signals, ML classifier). Page type classification for web content extraction.
scoring: F-Score calculation for accuracy benchmarking. F-Score calculation for accuracy benchmarking.

Structs§

ExtractResult: Result of content extraction from an HTML document.
ImageData: Structured image data extracted from content.
Metadata: Metadata extracted from an HTML document.
Options: Configuration options for content extraction.

Enums§

Error: Error type for extraction operations.

Functions§

extract: Extracts main content from an HTML document using default options.
extract_bytes: Extracts main content from HTML bytes with automatic encoding detection.
extract_bytes_with_options: Extracts main content from HTML bytes with custom options and automatic encoding detection.
extract_with_options: Extracts main content from an HTML document with custom options.

Type Aliases§

Result: Result type alias for extraction operations.

Crate rs_trafilatura

Crate rs_trafilatura Copy item path

§rs-trafilatura

§Quick Start

§Features

§Accuracy

Modules§

Structs§

Enums§

Functions§

Type Aliases§

Crate rs_trafilatura