Crate trafilatura

Expand description

Web content extraction library.

trafilatura extracts the main text, comments, and metadata from web pages, stripping boilerplate (navigation, ads, footers) while preserving the article body. It is a faithful Rust port of go-trafilatura.

§Quick start

use trafilatura::{extract, Options};

let html = r#"<html><body>
  <nav>Menu items</nav>
  <article><p>This is the main article content.</p></article>
  <footer>Copyright 2024</footer>
</body></html>"#;

let result = extract(html, &Options::default()).unwrap();
assert!(result.content_text.contains("main article content"));

§Features

Content extraction — identifies and extracts the main body text using CSS selector rules, paragraph scoring, and heuristic filters.
Comment extraction — separately extracts user comments (optional).
Metadata — extracts title, author, date, description, categories, tags, license, and more from meta tags, OpenGraph, and JSON-LD.
Fallback strategies — when primary extraction yields too little content, falls back to readability-based or baseline extraction.
Language filtering — optionally reject documents that don’t match a target language (detected via whatlang).
Deduplication — LRU-based detection of duplicate content across multiple extractions.

§Builder-style options

use trafilatura::{extract, Options, ExtractionFocus};

let html = "<html><body><article><p>Hello world</p></article></body></html>";
let opts = Options::default()
    .with_fallback(true)
    .with_links(true)
    .with_focus(ExtractionFocus::FavorRecall);
let result = extract(html, &opts).unwrap();
assert_eq!(result.content_text, "Hello world");

§Markdown output (requires `markdown` feature)

use trafilatura::{extract, create_markdown_document, Options};

let result = extract(html, &Options::default()).unwrap();

// Just the content as markdown:
let md = result.content_markdown();

// Full document with YAML front matter + content + comments:
let doc = create_markdown_document(&result);

libreadability — Mozilla Readability port for extracting a clean article DOM subtree.
justext — paragraph-level boilerplate removal using stopword density.
html2markdown — converts HTML to Markdown via an intermediate AST.

Re-exports§

pub use error::TrafilaturaError;
pub use options::Config;
pub use options::ExtractionFocus;
pub use options::FallbackCandidates;
pub use options::HtmlDateMode;
pub use options::Options;
pub use result::ExtractResult;
pub use result::Metadata;

Modules§

dom
error
metadata
options
result
utils

Functions§

create_readable_document: Creates a complete, self-contained HTML document from an ExtractResult.
extract: Parse an HTML string and extract its main readable content.
extract_document: Extract readable content from an already-parsed Document.

Crate trafilatura

Crate trafilatura Copy item path

§Quick start

§Features

§Builder-style options

§Markdown output (requires markdown feature)

§Related crates

Re-exports§

Modules§

Functions§

Crate trafilatura

§Markdown output (requires `markdown` feature)