kawat-core 0.1.1

Core extraction cascade orchestrator for kawat

Coverage
35.85%
19 out of 53 items documented0 out of 10 items with examples
Size
Source code size: 58.45 kB This is the summed size of all the files inside the crates.io package for this release.
Documentation size: 3.64 MB This is the summed size of all files generated by rustdoc for all configured targets
Ø build duration
this release: 56s Average build duration of successful builds.
all releases: 1m 24s Average build duration of successful builds in releases after 2024-10-23.
Links
crates.io
Dependencies
Versions
Owners

kawat-core

Core extraction orchestrator for the kawat web content extraction library.

Implements the full trafilatura extraction cascade with multi-algorithm fallback:

HTML parsing & metadata extraction
Tree cleaning & tag normalization
Comment extraction
Content extraction (BODY_XPATH → readability → justext → baseline)
Size checks & deduplication
Language filtering & output formatting

Features

Extraction cascade: Multi-algorithm fallback for robust content extraction
Configurable focus modes: Balanced, Precision, or Recall
Metadata support: Title, author, date, URL, categories, tags, license
Comment extraction: Separate user comments from main content
Deduplication: Simhash + LRU cache for duplicate detection
Language detection: Optional language filtering (lingua crate)

License

Apache-2.0