Expand description
§kawat
A Rust library for web content extraction, inspired by trafilatura.
Extracts main text, metadata, and comments from HTML documents with a multi-algorithm fallback cascade.
§Usage
use kawat::{extract, fetch_url, ExtractorOptions};
// From URL
let html = fetch_url("https://example.org/article").unwrap();
let text = extract(&html, &ExtractorOptions::default()).unwrap();
// With options
let options = ExtractorOptions {
with_metadata: true,
..Default::default()
};
let text = extract(&html, &options).unwrap();§Name
Kawat is Indonesian for “wire” — the same metallurgical metaphor as trafilatura (Italian for “wire drawing”), symbolizing the refinement of raw HTML into clean, structured text.
Re-exports§
pub use htmldate_rs;
Structs§
- Document
- A fully extracted document with text, metadata, and comments.
- Extractor
Options - Complete extraction configuration. Equivalent to trafilatura’s Extractor class.
Enums§
- Extraction
Error - Output
Format - Supported output formats.
Functions§
- bare_
extraction - Extract content from an HTML document.
- extract
- Extract and format content, equivalent to trafilatura’s
extract(). - fetch_
url - Fetch a URL and return the HTML content.
- fetch_
url_ async - Async version of
fetch_url.