Skip to main content

Module document

Module document 

Source
Expand description

Document ingestion module for converting human-readable documents into LLM-optimized structured formats.

This module provides:

  • Type system: Document, Section, ContentBlock for representing document structure
  • Parsers: Format-specific parsers (Markdown, HTML, plain text, CSV, DOCX, PDF)
  • Distillation: Content compression pipeline that removes filler and optimizes for LLM attention
  • Output: Document-specific formatters for Claude (XML), GPT (Markdown), agents (JSON)

Re-exports§

pub use types::*;

Modules§

chunking
Document chunking for multi-turn LLM conversations.
distillation
Content distillation pipeline for LLM attention and token optimization.
output
Document-specific output formatters for LLM consumption.
parsers
Format-specific document parsers.
pii
PII (Personally Identifiable Information) detection for documents.
types
Core type definitions for document ingestion.

Structs§

ParseOptions
Options for document parsing.

Functions§

count_document_tokens
Count tokens for a document’s full text content across all model families.
count_output_tokens
Count tokens for formatted output text across all model families.
parse_content
Parse document content from a string with a known format.
parse_document
Parse a document from a file path, auto-detecting the format.