kawat-core 0.1.1

Core extraction cascade orchestrator for kawat
Documentation

kawat-core

Core extraction orchestrator for the kawat web content extraction library.

Implements the full trafilatura extraction cascade with multi-algorithm fallback:

  1. HTML parsing & metadata extraction
  2. Tree cleaning & tag normalization
  3. Comment extraction
  4. Content extraction (BODY_XPATH → readability → justext → baseline)
  5. Size checks & deduplication
  6. Language filtering & output formatting

Features

  • Extraction cascade: Multi-algorithm fallback for robust content extraction
  • Configurable focus modes: Balanced, Precision, or Recall
  • Metadata support: Title, author, date, URL, categories, tags, license
  • Comment extraction: Separate user comments from main content
  • Deduplication: Simhash + LRU cache for duplicate detection
  • Language detection: Optional language filtering (lingua crate)

License

Apache-2.0