kawat
A Rust library for web content extraction, inspired by trafilatura.
Kawat is Indonesian for "wire" — the same metallurgical metaphor as trafilatura (Italian for "wire drawing"), symbolizing the refinement of raw HTML into clean, structured text.
Features
- Main text extraction with multi-algorithm fallback cascade
- Metadata extraction: title, author, date, categories, tags, license
- Comment extraction separated from main content
- Date extraction via
htmldate-rs(standalone crate) - Deduplication at sentence, paragraph, and document level
- Multiple output formats: TXT, Markdown, JSON, XML, XML-TEI, CSV, HTML
- XPath evaluation on HTML via
sxd_html+sxd_xpath - Language detection (optional feature)
Installation
Add to your Cargo.toml:
[]
= "0.1"
Usage
Basic extraction
use ;
let html = read_to_string?;
let text = extract?;
println!;
With metadata
use ;
let html = read_to_string?;
let mut options = default;
options.with_metadata = true;
let doc = bare_extraction?;
println!;
println!;
println!;
println!;
Fetch from URL
use ;
let html = fetch_url?;
let text = extract?;
println!;
Async URL fetching
use ;
let html = fetch_url_async.await?;
let text = extract?;
println!;
Extraction Cascade
The extraction process follows this pipeline:
HTML → parse → metadata → clean → convert tags → extract comments
→ kawat sequence:
extract_content (BODY_XPATH, first match)
→ if not fast: compare with readability + justext fallbacks
→ if still short: baseline (JSON-LD → <article> → <p> → body text)
→ size checks → dedup → language filter → output format
Configuration
use ;
let options = ExtractorOptions ;
let text = extract?;
Features
language-detection: Enable language filtering via thelinguacrate
[]
= { = "0.1", = ["language-detection"] }
Workspace Structure
| Crate | Purpose |
|---|---|
kawat |
Public facade, re-exports |
kawat-core |
Extraction cascade orchestrator |
kawat-html |
Tree cleaning, tag normalization |
kawat-xpath |
XPath on HTML (sxd_html + sxd_xpath) |
kawat-extract |
Main content extractor |
kawat-readability |
Readability fallback (dom_smoothie) |
kawat-justext |
Pure Rust justext port |
kawat-metadata |
Title, author, OG, JSON-LD |
kawat-dedup |
Simhash + LRU deduplication |
kawat-output |
Format converters |
htmldate-rs |
Standalone date extraction |
Acknowledgments
This project is a Rust reimplementation inspired by trafilatura by Adrien Barbaresi. The extraction heuristics, XPath expressions, and cascade architecture are derived from trafilatura's published algorithms.
- Barbaresi, A. "Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction", Proceedings of ACL/IJCNLP 2021: System Demonstrations, 2021, p. 122-131.
- Barbaresi, A. "htmldate: A Python package to extract publication dates from web pages", JOSS 5(51), 2439, 2020.
License
Apache-2.0