Crate slabs

Crate slabs 

Source
Expand description

§slabs

Text chunking for retrieval-augmented generation (RAG) pipelines.

§The Problem

Language models have context windows. Documents don’t fit. You need to split them into pieces (“chunks”) small enough to embed and retrieve, but large enough to preserve meaning.

This sounds trivial—just split every N characters, right? But consider:

  • A sentence split mid-word is garbage
  • A paragraph split mid-argument loses coherence
  • A code block split mid-function is useless
  • Overlap is needed for context continuity, but how much?

The right chunking strategy depends on your content and retrieval needs.

§Chunking Strategies

§Fixed Size (Baseline)

The simplest approach: split every N characters with M overlap.

Document: "The quick brown fox jumps over the lazy dog."
Size: 20, Overlap: 5

Chunk 0: "The quick brown fox "  [0..20]
Chunk 1: " fox jumps over the "  [15..35]  <- overlap preserves "fox"
Chunk 2: " the lazy dog."        [30..44]

When to use: Homogeneous content (logs, code), baseline comparisons. Weakness: Ignores linguistic boundaries—splits mid-sentence.

§Sentence-Based

Split on sentence boundaries, group N sentences per chunk.

The key insight: sentence boundaries are surprisingly hard to detect. “Dr. Smith went to Washington D.C. on Jan. 15th.” has 1 sentence, not 4. We use Unicode segmentation (UAX #29) which handles most edge cases.

When to use: Prose, articles, documentation. Weakness: Very short or very long sentences cause imbalanced chunks.

§Recursive (LangChain-style)

Try splitting on paragraph breaks first. If chunks are still too large, split on sentence breaks. If still too large, split on words. Last resort: split on characters.

Separators: ["\n\n", "\n", ". ", " ", ""]

1. Try splitting on "\n\n" (paragraphs)
2. Any chunk > max_size? Split that chunk on "\n" (lines)
3. Still > max_size? Split on ". " (sentences)
4. Still > max_size? Split on " " (words)
5. Still > max_size? Split on "" (characters)

When to use: General-purpose, mixed content. Weakness: Separator hierarchy is heuristic, not semantic.

§Semantic (Embedding-Based)

Embed each sentence, compute similarity between adjacent sentences, split where similarity drops below a threshold.

Sentences:  [S1, S2, S3, S4, S5, S6]
Embeddings: [E1, E2, E3, E4, E5, E6]
Similarities: [sim(1,2)=0.9, sim(2,3)=0.8, sim(3,4)=0.3, sim(4,5)=0.85, sim(5,6)=0.7]
                                             ↑
                                        Topic shift!

Chunks: [S1, S2, S3] | [S4, S5, S6]

When to use: When topic coherence matters more than size uniformity. Weakness: Requires embedding model, slower, threshold is a hyperparameter.

§Quick Start

use slabs::{Chunker, FixedChunker, SentenceChunker, RecursiveChunker};

let text = "The quick brown fox jumps over the lazy dog. \
            Pack my box with five dozen liquor jugs.";

// Fixed size
let chunker = FixedChunker::new(50, 10);
let slabs = chunker.chunk(text);

// Sentence-based (2 sentences per chunk)
let chunker = SentenceChunker::new(2);
let slabs = chunker.chunk(text);

// Recursive with custom separators
let chunker = RecursiveChunker::new(100, &["\n\n", "\n", ". ", " "]);
let slabs = chunker.chunk(text);

§Semantic Chunking (requires semantic feature)

use slabs::{Chunker, SemanticChunker};

let chunker = SemanticChunker::new(0.5)?; // threshold
let slabs = chunker.chunk(long_document);

§Late Chunking

Late chunking embeds the full document first, then pools token embeddings for each chunk. This preserves document-wide context that traditional chunking loses (e.g., pronouns referring to earlier entities).

use slabs::{LateChunker, SentenceChunker, Chunker};

// Wrap any chunker with late chunking
let late = LateChunker::new(SentenceChunker::new(3), 384);

// Get chunk boundaries
let chunks = late.chunk(&document);

// Get token embeddings from your embedding model (full document)
let token_embeddings = embed_document_tokens(&document);

// Pool into contextualized chunk embeddings
let chunk_embeddings = late.pool(&token_embeddings, &chunks, document.len());

§Performance Considerations

StrategySpeedQualityMemory
FixedO(n)LowO(1)
SentenceO(n)MediumO(n)
RecursiveO(n log n)MediumO(n)
SemanticO(n × d)HighO(n × d)

Where n = document length, d = embedding dimension.

For most RAG applications, Recursive is the sweet spot. Use Semantic when retrieval quality justifies the cost.

Structs§

ChunkCapacity
Configuration for chunk size with flexible target and hard limit.
FixedChunker
Fixed-size chunker with configurable overlap.
LateChunker
Wrapper that applies late chunking to any base chunker.
LateChunkingPooler
Late chunking pooler: pools token embeddings into chunk embeddings.
ModelChunker
A chunker that uses a machine learning model to predict boundaries.
RecursiveChunker
Recursive character splitter.
SentenceChunker
Sentence-based chunker.
Slab
A chunk of text with its position in the original document.

Enums§

ChunkCapacityError
Error when configuring chunk capacity.
Error
Errors that can occur during chunking.

Traits§

Chunker
A text chunking strategy.

Type Aliases§

Result
Result type for slabs operations.