Skip to main content

Crate rs_trafilatura

Crate rs_trafilatura 

Source
Expand description

§rs-trafilatura

Rust port of trafilatura - a web content extraction library.

This library extracts clean, readable content from web pages by stripping navigation, advertisements, and boilerplate while preserving meaningful text, metadata, and document structure.

§Quick Start

use rs_trafilatura::{extract, Options};

let html = r#"<html><head><title>My Article</title></head>
<body><article><p>Main content here.</p></article></body></html>"#;

let result = extract(html)?;
println!("Title: {:?}", result.metadata.title);
println!("Content: {}", result.content_text);

§Features

  • Content Extraction: Identifies and extracts the main article content
  • Metadata Extraction: Title, author, date, language, sitename, and more
  • Boilerplate Removal: Strips navigation, ads, footers, and other noise
  • Configurable: Options to tune precision/recall tradeoff

§Accuracy

Achieves F1 0.860 on a 1,502-page benchmark with page type classification, ML-based content detection, and extraction quality confidence scoring.

Modules§

encoding
Character encoding detection and transcoding. Character encoding detection and transcoding.
markdown
Markdown processing utilities (escaping, table conversion). Markdown processing utilities.
page_type
Page type classification (URL heuristics, HTML signals, ML classifier). Page type classification for web content extraction.
scoring
F-Score calculation for accuracy benchmarking. F-Score calculation for accuracy benchmarking.

Structs§

ExtractResult
Result of content extraction from an HTML document.
ImageData
Structured image data extracted from content.
Metadata
Metadata extracted from an HTML document.
Options
Configuration options for content extraction.

Enums§

Error
Error type for extraction operations.

Functions§

extract
Extracts main content from an HTML document using default options.
extract_bytes
Extracts main content from HTML bytes with automatic encoding detection.
extract_bytes_with_options
Extracts main content from HTML bytes with custom options and automatic encoding detection.
extract_with_options
Extracts main content from an HTML document with custom options.

Type Aliases§

Result
Result type alias for extraction operations.