Expand description
§slabs
Text chunking for retrieval-augmented generation (RAG) pipelines.
§The Problem
Language models have context windows. Documents don’t fit. You need to split them into pieces (“chunks”) small enough to embed and retrieve, but large enough to preserve meaning.
This sounds trivial—just split every N characters, right? But consider:
- A sentence split mid-word is garbage
- A paragraph split mid-argument loses coherence
- A code block split mid-function is useless
- Overlap is needed for context continuity, but how much?
The right chunking strategy depends on your content and retrieval needs.
§Chunking Strategies
§Fixed Size (Baseline)
The simplest approach: split every N characters with M overlap.
Document: "The quick brown fox jumps over the lazy dog."
Size: 20, Overlap: 5
Chunk 0: "The quick brown fox " [0..20]
Chunk 1: " fox jumps over the " [15..35] <- overlap preserves "fox"
Chunk 2: " the lazy dog." [30..44]When to use: Homogeneous content (logs, code), baseline comparisons. Weakness: Ignores linguistic boundaries—splits mid-sentence.
§Sentence-Based
Split on sentence boundaries, group N sentences per chunk.
The key insight: sentence boundaries are surprisingly hard to detect. “Dr. Smith went to Washington D.C. on Jan. 15th.” has 1 sentence, not 4. We use Unicode segmentation (UAX #29) which handles most edge cases.
When to use: Prose, articles, documentation. Weakness: Very short or very long sentences cause imbalanced chunks.
§Recursive (LangChain-style)
Try splitting on paragraph breaks first. If chunks are still too large, split on sentence breaks. If still too large, split on words. Last resort: split on characters.
Separators: ["\n\n", "\n", ". ", " ", ""]
1. Try splitting on "\n\n" (paragraphs)
2. Any chunk > max_size? Split that chunk on "\n" (lines)
3. Still > max_size? Split on ". " (sentences)
4. Still > max_size? Split on " " (words)
5. Still > max_size? Split on "" (characters)When to use: General-purpose, mixed content. Weakness: Separator hierarchy is heuristic, not semantic.
§Semantic (Embedding-Based)
Embed each sentence, compute similarity between adjacent sentences, split where similarity drops below a threshold.
Sentences: [S1, S2, S3, S4, S5, S6]
Embeddings: [E1, E2, E3, E4, E5, E6]
Similarities: [sim(1,2)=0.9, sim(2,3)=0.8, sim(3,4)=0.3, sim(4,5)=0.85, sim(5,6)=0.7]
↑
Topic shift!
Chunks: [S1, S2, S3] | [S4, S5, S6]When to use: When topic coherence matters more than size uniformity. Weakness: Requires embedding model, slower, threshold is a hyperparameter.
§Quick Start
use slabs::{Chunker, FixedChunker, SentenceChunker, RecursiveChunker};
let text = "The quick brown fox jumps over the lazy dog. \
Pack my box with five dozen liquor jugs.";
// Fixed size
let chunker = FixedChunker::new(50, 10);
let slabs = chunker.chunk(text);
// Sentence-based (2 sentences per chunk)
let chunker = SentenceChunker::new(2);
let slabs = chunker.chunk(text);
// Recursive with custom separators
let chunker = RecursiveChunker::new(100, &["\n\n", "\n", ". ", " "]);
let slabs = chunker.chunk(text);§Semantic Chunking (requires semantic feature)
use slabs::{Chunker, SemanticChunker};
let chunker = SemanticChunker::new(0.5)?; // threshold
let slabs = chunker.chunk(long_document);§Late Chunking
Late chunking embeds the full document first, then pools token embeddings for each chunk. This preserves document-wide context that traditional chunking loses (e.g., pronouns referring to earlier entities).
use slabs::{LateChunker, SentenceChunker, Chunker};
// Wrap any chunker with late chunking
let late = LateChunker::new(SentenceChunker::new(3), 384);
// Get chunk boundaries
let chunks = late.chunk(&document);
// Get token embeddings from your embedding model (full document)
let token_embeddings = embed_document_tokens(&document);
// Pool into contextualized chunk embeddings
let chunk_embeddings = late.pool(&token_embeddings, &chunks, document.len());§Performance Considerations
| Strategy | Speed | Quality | Memory |
|---|---|---|---|
| Fixed | O(n) | Low | O(1) |
| Sentence | O(n) | Medium | O(n) |
| Recursive | O(n log n) | Medium | O(n) |
| Semantic | O(n × d) | High | O(n × d) |
Where n = document length, d = embedding dimension.
For most RAG applications, Recursive is the sweet spot. Use Semantic when retrieval quality justifies the cost.
Structs§
- Chunk
Capacity - Configuration for chunk size with flexible target and hard limit.
- Fixed
Chunker - Fixed-size chunker with configurable overlap.
- Late
Chunker - Wrapper that applies late chunking to any base chunker.
- Late
Chunking Pooler - Late chunking pooler: pools token embeddings into chunk embeddings.
- Model
Chunker - A chunker that uses a machine learning model to predict boundaries.
- Recursive
Chunker - Recursive character splitter.
- Sentence
Chunker - Sentence-based chunker.
- Slab
- A chunk of text with its position in the original document.
Enums§
- Chunk
Capacity Error - Error when configuring chunk capacity.
- Error
- Errors that can occur during chunking.
Traits§
- Chunker
- A text chunking strategy.
Type Aliases§
- Result
- Result type for slabs operations.