Expand description
Text processing and chunking
Modules§
- analysis
- Text analysis utilities Text analysis utilities for document structure detection
- boundary_
detection - Semantic boundary detection for BAR-RAG Semantic Boundary Detection for Boundary-Aware Chunking
- chunk_
enricher - Chunk enrichment pipeline Chunk enrichment pipeline
- chunking
- Text chunking utilities module
- chunking_
strategies - Trait-based chunking strategies Trait-based chunking strategy implementations
- contextual_
enricher - LLM-based contextual chunk enrichment (Anthropic Contextual Retrieval pattern) Contextual Chunk Enrichment via LLM (Anthropic Contextual Retrieval pattern)
- document_
structure - Document structure representation Document structure representation for hierarchical parsing
- extractive_
summarizer - Extractive summarization Real extractive summarization with sentence ranking
- keyword_
extraction - TF-IDF keyword extraction Real TF-IDF keyword extraction
- late_
chunking - Late Chunking for context-preserving embeddings (Jina AI technique) Late Chunking — context-preserving embeddings for RAG
- layout_
parser - Layout parser trait Layout parser trait and factory for document structure detection
- parsers
- Document layout parsers Document layout parsers
- semantic_
chunking - Semantic chunking based on embedding similarity Semantic Chunking for RAG
- semantic_
coherence - Semantic coherence scoring for BAR-RAG Semantic Coherence Scoring for Boundary-Aware Chunking
Structs§
- Boundary
- Represents a detected boundary in text
- Boundary
Aware Chunking Strategy - Boundary-Aware Chunking Strategy (BAR-RAG)
- Boundary
Detection Config - Configuration for boundary detection
- Boundary
Detector - Boundary detector for semantic text segmentation
- Chunk
Enricher - Chunk enricher that adds semantic metadata to text chunks
- Coherence
Config - Configuration for semantic coherence scoring
- Contextual
Enricher - LLM-based contextual chunk enricher (Anthropic Contextual Retrieval pattern)
- Contextual
Enricher Config - Configuration for contextual chunk enrichment
- Document
Structure - Complete document structure with headings and sections
- Enrichment
Statistics - Statistics about chunk enrichment
- Extractive
Summarizer - Extractive summarizer using sentence scoring
- Heading
- A heading in a document (e.g., chapter, section, subsection)
- Heading
Hierarchy - Hierarchical structure of a document
- Hierarchical
Chunking Strategy - Hierarchical chunking strategy wrapper
- Jina
Late Chunking Client - Jina AI embeddings client with native late chunking support
- Language
Detector - Language detection utilities
- Late
Chunking Config - Configuration for the late chunking strategy
- Late
Chunking Strategy - Context-aware chunking strategy for use with late-chunking embedding models
- Layout
Parser Factory - Factory for creating layout parsers based on document type
- Optimal
Split - Result of split-point optimization
- Scored
Chunk - Represents a candidate chunk with coherence score
- Section
- A section in a document, defined by a heading and its content range
- Section
Number - Parsed section number with format information
- Semantic
Chunk - Chunk of semantically similar sentences
- Semantic
Chunker - Semantic text chunker that splits based on embedding similarity
- Semantic
Chunker Config - Configuration for semantic chunking
- Semantic
Chunking Strategy - Semantic chunking strategy wrapper
- Semantic
Coherence Scorer - Semantic coherence scorer using sentence embeddings
- Structure
Statistics - Statistics about document structure
- Text
Analyzer - Text analyzer for structural analysis
- Text
Processor - Text processing utilities for chunking and preprocessing
- Text
Stats - Text statistics
- TfIdf
Keyword Extractor - TF-IDF based keyword extractor
Enums§
- Boundary
Type - Type of boundary detected
- Breakpoint
Strategy - Strategy for determining chunk breakpoints
- Section
Number Format - Section numbering format (e.g., “1.2.3”, “Chapter 1”, “I.A.1”)
Traits§
- Layout
Parser - Trait for document layout parsers