kawat

A Rust library for web content extraction, inspired by trafilatura.

Kawat is Indonesian for "wire" — the same metallurgical metaphor as trafilatura (Italian for "wire drawing"), symbolizing the refinement of raw HTML into clean, structured text.

Features

Main text extraction with multi-algorithm fallback cascade
Metadata extraction: title, author, date, categories, tags, license
Comment extraction separated from main content
Date extraction via htmldate-rs (standalone crate)
Deduplication at sentence, paragraph, and document level
Multiple output formats: TXT, Markdown, JSON, XML, XML-TEI, CSV, HTML
XPath evaluation on HTML via sxd_html + sxd_xpath
Language detection (optional feature)

Installation

Add to your Cargo.toml:

[dependencies]
kawat = "0.1"

Usage

Basic extraction

use kawat::{extract, ExtractorOptions};

let html = std::fs::read_to_string("page.html")?;
let text = extract(&html, &ExtractorOptions::default())?;
println!("{}", text);

With metadata

use kawat::{bare_extraction, ExtractorOptions};

let html = std::fs::read_to_string("page.html")?;
let mut options = ExtractorOptions::default();
options.with_metadata = true;

let doc = bare_extraction(&html, &options)?;
println!("Title: {}", doc.metadata.title.unwrap_or_default());
println!("Author: {}", doc.metadata.author.unwrap_or_default());
println!("Date: {}", doc.metadata.date.unwrap_or_default());
println!("Body:\n{}", doc.body);

Fetch from URL

use kawat::{fetch_url, extract, ExtractorOptions};

let html = fetch_url("https://example.org/article")?;
let text = extract(&html, &ExtractorOptions::default())?;
println!("{}", text);

Async URL fetching

use kawat::{fetch_url_async, extract, ExtractorOptions};

let html = fetch_url_async("https://example.org/article").await?;
let text = extract(&html, &ExtractorOptions::default())?;
println!("{}", text);

Extraction Cascade

The extraction process follows this pipeline:

HTML → parse → metadata → clean → convert tags → extract comments
  → kawat sequence:
      extract_content (BODY_XPATH, first match)
      → if not fast: compare with readability + justext fallbacks
      → if still short: baseline (JSON-LD → <article> → <p> → body text)
  → size checks → dedup → language filter → output format

Configuration

use kawat::{ExtractorOptions, Focus, OutputFormat};

let options = ExtractorOptions {
    format: OutputFormat::Markdown,
    fast: false,                    // Use fallback algorithms
    focus: Focus::Balanced,         // Balanced precision/recall
    comments: true,                 // Extract comments
    formatting: true,               // Preserve formatting
    links: false,                   // Include links
    images: false,                  // Include images
    tables: true,                   // Include tables
    dedup: true,                    // Deduplicate content
    target_language: Some("en".to_string()),
    with_metadata: true,
    ..Default::default()
};

let text = kawat::extract(&html, &options)?;

Features

language-detection: Enable language filtering via the lingua crate

[dependencies]
kawat = { version = "0.1", features = ["language-detection"] }

Workspace Structure

Crate	Purpose
`kawat`	Public facade, re-exports
`kawat-core`	Extraction cascade orchestrator
`kawat-html`	Tree cleaning, tag normalization
`kawat-xpath`	XPath on HTML (sxd_html + sxd_xpath)
`kawat-extract`	Main content extractor
`kawat-readability`	Readability fallback (dom_smoothie)
`kawat-justext`	Pure Rust justext port
`kawat-metadata`	Title, author, OG, JSON-LD
`kawat-dedup`	Simhash + LRU deduplication
`kawat-output`	Format converters
`htmldate-rs`	Standalone date extraction

Acknowledgments

This project is a Rust reimplementation inspired by trafilatura by Adrien Barbaresi. The extraction heuristics, XPath expressions, and cascade architecture are derived from trafilatura's published algorithms.

Barbaresi, A. "Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction", Proceedings of ACL/IJCNLP 2021: System Demonstrations, 2021, p. 122-131.
Barbaresi, A. "htmldate: A Python package to extract publication dates from web pages", JOSS 5(51), 2439, 2020.

License

Apache-2.0

kawat 0.1.2

kawat

Features

Installation

Usage

Basic extraction

With metadata

Fetch from URL

Async URL fetching

Extraction Cascade

Configuration

Features

Workspace Structure

Acknowledgments

License