kawat-core
Core extraction orchestrator for the kawat web content extraction library.
Implements the full trafilatura extraction cascade with multi-algorithm fallback:
- HTML parsing & metadata extraction
- Tree cleaning & tag normalization
- Comment extraction
- Content extraction (BODY_XPATH → readability → justext → baseline)
- Size checks & deduplication
- Language filtering & output formatting
Features
- Extraction cascade: Multi-algorithm fallback for robust content extraction
- Configurable focus modes: Balanced, Precision, or Recall
- Metadata support: Title, author, date, URL, categories, tags, license
- Comment extraction: Separate user comments from main content
- Deduplication: Simhash + LRU cache for duplicate detection
- Language detection: Optional language filtering (lingua crate)
License
Apache-2.0