Module late_chunking

Expand description

Late Chunking for context-preserving embeddings (Jina AI technique) Late Chunking — context-preserving embeddings for RAG

Standard RAG embeds each chunk in isolation, losing cross-chunk context. Late Chunking (Jina AI, 2024) fixes this by encoding the whole document first and then extracting per-chunk embeddings via span pooling:

Standard:  chunk₁ → embed₁   chunk₂ → embed₂   (context-blind)
Late:       [chunk₁ | chunk₂ | …] → model → pool spans → embed₁, embed₂

Each chunk’s embedding “sees” the entire document during the attention pass, giving it +5-10% retrieval accuracy over standard chunking.

§Two usage modes

LateChunkingStrategy — a ChunkingStrategy that splits text and records precise byte spans. Use this when you will pass the chunks to a late-chunking-aware embedding provider separately.
JinaLateChunkingClient — calls the Jina embeddings API with late_chunking=true to get document-context-aware embeddings directly.

§Model context limits

Model	Max tokens	Notes
Jina v3 (default)	8 192	Good for most documents
gte-Qwen2-7B-instruct	32 768	Better quality, needs more GPU

For documents exceeding the limit use LateChunkingStrategy::split_into_sections to pre-divide the document and apply late chunking section-by-section.

Structs§

JinaLateChunkingClient: Jina AI embeddings client with native late chunking support
LateChunkingConfig: Configuration for the late chunking strategy
LateChunkingStrategy: Context-aware chunking strategy for use with late-chunking embedding models