Expand description
Web content extraction library.
trafilatura extracts the main text, comments, and metadata from web pages,
stripping boilerplate (navigation, ads, footers) while preserving the
article body. It is a faithful Rust port of
go-trafilatura.
§Quick start
use trafilatura::{extract, Options};
let html = r#"<html><body>
<nav>Menu items</nav>
<article><p>This is the main article content.</p></article>
<footer>Copyright 2024</footer>
</body></html>"#;
let result = extract(html, &Options::default()).unwrap();
assert!(result.content_text.contains("main article content"));§Features
- Content extraction — identifies and extracts the main body text using CSS selector rules, paragraph scoring, and heuristic filters.
- Comment extraction — separately extracts user comments (optional).
- Metadata — extracts title, author, date, description, categories, tags, license, and more from meta tags, OpenGraph, and JSON-LD.
- Fallback strategies — when primary extraction yields too little content, falls back to readability-based or baseline extraction.
- Language filtering — optionally reject documents that don’t match a
target language (detected via
whatlang). - Deduplication — LRU-based detection of duplicate content across multiple extractions.
§Builder-style options
use trafilatura::{extract, Options, ExtractionFocus};
let html = "<html><body><article><p>Hello world</p></article></body></html>";
let opts = Options::default()
.with_fallback(true)
.with_links(true)
.with_focus(ExtractionFocus::FavorRecall);
let result = extract(html, &opts).unwrap();
assert_eq!(result.content_text, "Hello world");§Markdown output (requires markdown feature)
ⓘ
use trafilatura::{extract, create_markdown_document, Options};
let result = extract(html, &Options::default()).unwrap();
// Just the content as markdown:
let md = result.content_markdown();
// Full document with YAML front matter + content + comments:
let doc = create_markdown_document(&result);§Related crates
libreadability— Mozilla Readability port for extracting a clean article DOM subtree.justext— paragraph-level boilerplate removal using stopword density.html2markdown— converts HTML to Markdown via an intermediate AST.
Re-exports§
pub use error::TrafilaturaError;pub use options::Config;pub use options::ExtractionFocus;pub use options::FallbackCandidates;pub use options::HtmlDateMode;pub use options::Options;pub use result::ExtractResult;pub use result::Metadata;
Modules§
Functions§
- create_
readable_ document - Creates a complete, self-contained HTML document from an
ExtractResult. - extract
- Parse an HTML string and extract its main readable content.
- extract_
document - Extract readable content from an already-parsed
Document.