Skip to main content

Crate trafilatura

Crate trafilatura 

Source
Expand description

Web content extraction library.

trafilatura extracts the main text, comments, and metadata from web pages, stripping boilerplate (navigation, ads, footers) while preserving the article body. It is a faithful Rust port of go-trafilatura.

§Quick start

use trafilatura::{extract, Options};

let html = r#"<html><body>
  <nav>Menu items</nav>
  <article><p>This is the main article content.</p></article>
  <footer>Copyright 2024</footer>
</body></html>"#;

let result = extract(html, &Options::default()).unwrap();
assert!(result.content_text.contains("main article content"));

§Features

  • Content extraction — identifies and extracts the main body text using CSS selector rules, paragraph scoring, and heuristic filters.
  • Comment extraction — separately extracts user comments (optional).
  • Metadata — extracts title, author, date, description, categories, tags, license, and more from meta tags, OpenGraph, and JSON-LD.
  • Fallback strategies — when primary extraction yields too little content, falls back to readability-based or baseline extraction.
  • Language filtering — optionally reject documents that don’t match a target language (detected via whatlang).
  • Deduplication — LRU-based detection of duplicate content across multiple extractions.

§Builder-style options

use trafilatura::{extract, Options, ExtractionFocus};

let html = "<html><body><article><p>Hello world</p></article></body></html>";
let opts = Options::default()
    .with_fallback(true)
    .with_links(true)
    .with_focus(ExtractionFocus::FavorRecall);
let result = extract(html, &opts).unwrap();
assert_eq!(result.content_text, "Hello world");

§Markdown output (requires markdown feature)

use trafilatura::{extract, create_markdown_document, Options};

let result = extract(html, &Options::default()).unwrap();

// Just the content as markdown:
let md = result.content_markdown();

// Full document with YAML front matter + content + comments:
let doc = create_markdown_document(&result);
  • libreadability — Mozilla Readability port for extracting a clean article DOM subtree.
  • justext — paragraph-level boilerplate removal using stopword density.
  • html2markdown — converts HTML to Markdown via an intermediate AST.

Re-exports§

pub use error::TrafilaturaError;
pub use options::Config;
pub use options::ExtractionFocus;
pub use options::FallbackCandidates;
pub use options::HtmlDateMode;
pub use options::Options;
pub use result::ExtractResult;
pub use result::Metadata;

Modules§

dom
error
metadata
options
result
utils

Functions§

create_readable_document
Creates a complete, self-contained HTML document from an ExtractResult.
extract
Parse an HTML string and extract its main readable content.
extract_document
Extract readable content from an already-parsed Document.