Skip to main content

Crate kawat

Crate kawat 

Source
Expand description

§kawat

A Rust library for web content extraction, inspired by trafilatura.

Extracts main text, metadata, and comments from HTML documents with a multi-algorithm fallback cascade.

§Usage

use kawat::{extract, fetch_url, ExtractorOptions};

// From URL
let html = fetch_url("https://example.org/article").unwrap();
let text = extract(&html, &ExtractorOptions::default()).unwrap();

// With options
let options = ExtractorOptions {
    with_metadata: true,
    ..Default::default()
};
let text = extract(&html, &options).unwrap();

§Name

Kawat is Indonesian for “wire” — the same metallurgical metaphor as trafilatura (Italian for “wire drawing”), symbolizing the refinement of raw HTML into clean, structured text.

Re-exports§

pub use htmldate_rs;

Structs§

Document
A fully extracted document with text, metadata, and comments.
ExtractorOptions
Complete extraction configuration. Equivalent to trafilatura’s Extractor class.

Enums§

ExtractionError
OutputFormat
Supported output formats.

Functions§

bare_extraction
Extract content from an HTML document.
extract
Extract and format content, equivalent to trafilatura’s extract().
fetch_url
Fetch a URL and return the HTML content.
fetch_url_async
Async version of fetch_url.