Skip to main content

Module late_chunking

Module late_chunking 

Source
Expand description

Late Chunking for context-preserving embeddings (Jina AI technique) Late Chunking — context-preserving embeddings for RAG

Standard RAG embeds each chunk in isolation, losing cross-chunk context. Late Chunking (Jina AI, 2024) fixes this by encoding the whole document first and then extracting per-chunk embeddings via span pooling:

Standard:  chunk₁ → embed₁   chunk₂ → embed₂   (context-blind)
Late:       [chunk₁ | chunk₂ | …] → model → pool spans → embed₁, embed₂

Each chunk’s embedding “sees” the entire document during the attention pass, giving it +5-10% retrieval accuracy over standard chunking.

§Two usage modes

  1. LateChunkingStrategy — a ChunkingStrategy that splits text and records precise byte spans. Use this when you will pass the chunks to a late-chunking-aware embedding provider separately.

  2. JinaLateChunkingClient — calls the Jina embeddings API with late_chunking=true to get document-context-aware embeddings directly.

§Model context limits

ModelMax tokensNotes
Jina v3 (default)8 192Good for most documents
gte-Qwen2-7B-instruct32 768Better quality, needs more GPU

For documents exceeding the limit use LateChunkingStrategy::split_into_sections to pre-divide the document and apply late chunking section-by-section.

Structs§

JinaLateChunkingClient
Jina AI embeddings client with native late chunking support
LateChunkingConfig
Configuration for the late chunking strategy
LateChunkingStrategy
Context-aware chunking strategy for use with late-chunking embedding models