trafilatura
Extract readable content, comments, and metadata from web pages.
A Rust port of go-trafilatura, which itself ports the Python trafilatura library by Adrien Barbaresi.
Usage
Add to your Cargo.toml:
[]
= "0.2"
Library
use ;
let html = r#"<html><body>
<nav>Menu items</nav>
<article><p>This is the main article content.</p></article>
<footer>Copyright 2024</footer>
</body></html>"#;
let result = extract.unwrap;
println!; // "This is the main article content."
println!; // extracted <title> or og:title
With options
use ;
let opts = default
.with_fallback // use readability fallback
.with_links // preserve <a> tags in HTML output
.with_focus; // extract more content
let result = extract.unwrap;
CLI
# Extract from a URL
# Extract from a file
# Include links in output
What it extracts
- Content — main article body as both plain text and cleaned HTML
- Comments — user comments, separately from article content
- Metadata — title, author, date, description, site name, categories, tags, license, language, and image URL (from meta tags, OpenGraph, JSON-LD)
How it works
- Parse HTML and extract metadata from
<meta>, OpenGraph, and JSON-LD - Clean the DOM (remove scripts, styles, hidden elements, boilerplate)
- Score and select content using CSS selector rules and paragraph heuristics
- If primary extraction yields too little, fall back to readability-based extraction or baseline (last-resort) extraction
- Filter duplicates and check language constraints
Benchmarks
Speed
Extraction time per document, Rust vs Go (go-trafilatura) vs Python (trafilatura):
| Document | Rust | Go | Python |
|---|---|---|---|
| small (6 KB) | 793 µs | 1.19 ms | 1.1 ms |
| medium (85 KB) | 5.7 ms | 5.6 ms | 6.2 ms |
| large (382 KB) | 3.6 ms | 4.9 ms | 4.7 ms |
| xlarge (906 KB) | 10.4 ms | 13.9 ms | 13.9 ms |
Extraction quality
Evaluated on a 960-entry dataset (strings expected to be present/absent in extracted text):
| Implementation | Precision | Recall | Accuracy | F-score |
|---|---|---|---|---|
| Rust (balanced + fallback) | 0.908 | 0.919 | 0.913 | 0.913 |
| Python trafilatura | 0.920 | 0.909 | 0.915 | 0.914 |
| Go go-trafilatura | 0.909 | 0.921 | 0.914 | 0.915 |
All three implementations produce near-identical quality scores. Minor differences stem from HTML parser handling and Unicode normalization.
Measured on Apple M4 Max, Rust 1.93, macOS 15.7.
Reproduce:
License
Apache-2.0