Expand description
Late Chunking for context-preserving embeddings (Jina AI technique) Late Chunking — context-preserving embeddings for RAG
Standard RAG embeds each chunk in isolation, losing cross-chunk context. Late Chunking (Jina AI, 2024) fixes this by encoding the whole document first and then extracting per-chunk embeddings via span pooling:
Standard: chunk₁ → embed₁ chunk₂ → embed₂ (context-blind)
Late: [chunk₁ | chunk₂ | …] → model → pool spans → embed₁, embed₂Each chunk’s embedding “sees” the entire document during the attention pass, giving it +5-10% retrieval accuracy over standard chunking.
§Two usage modes
-
LateChunkingStrategy— aChunkingStrategythat splits text and records precise byte spans. Use this when you will pass the chunks to a late-chunking-aware embedding provider separately. -
JinaLateChunkingClient— calls the Jina embeddings API withlate_chunking=trueto get document-context-aware embeddings directly.
§Model context limits
| Model | Max tokens | Notes |
|---|---|---|
| Jina v3 (default) | 8 192 | Good for most documents |
| gte-Qwen2-7B-instruct | 32 768 | Better quality, needs more GPU |
For documents exceeding the limit use LateChunkingStrategy::split_into_sections
to pre-divide the document and apply late chunking section-by-section.
Structs§
- Jina
Late Chunking Client - Jina AI embeddings client with native late chunking support
- Late
Chunking Config - Configuration for the late chunking strategy
- Late
Chunking Strategy - Context-aware chunking strategy for use with late-chunking embedding models