Expand description
Document ingestion module for converting human-readable documents into LLM-optimized structured formats.
This module provides:
- Type system:
Document,Section,ContentBlockfor representing document structure - Parsers: Format-specific parsers (Markdown, HTML, plain text, CSV, DOCX, PDF)
- Distillation: Content compression pipeline that removes filler and optimizes for LLM attention
- Output: Document-specific formatters for Claude (XML), GPT (Markdown), agents (JSON)
Re-exports§
pub use types::*;
Modules§
- chunking
- Document chunking for multi-turn LLM conversations.
- distillation
- Content distillation pipeline for LLM attention and token optimization.
- output
- Document-specific output formatters for LLM consumption.
- parsers
- Format-specific document parsers.
- pii
- PII (Personally Identifiable Information) detection for documents.
- types
- Core type definitions for document ingestion.
Structs§
- Parse
Options - Options for document parsing.
Functions§
- count_
document_ tokens - Count tokens for a document’s full text content across all model families.
- count_
output_ tokens - Count tokens for formatted output text across all model families.
- parse_
content - Parse document content from a string with a known format.
- parse_
document - Parse a document from a file path, auto-detecting the format.