slabs/lib.rs
1//! # slabs
2//!
3//! Text chunking for retrieval-augmented generation (RAG) pipelines.
4//!
5//! ## The Problem
6//!
7//! Language models have context windows. Documents don't fit. You need to split
8//! them into pieces ("chunks") small enough to embed and retrieve, but large
9//! enough to preserve meaning.
10//!
11//! This sounds trivial—just split every N characters, right? But consider:
12//!
13//! - A sentence split mid-word is garbage
14//! - A paragraph split mid-argument loses coherence
15//! - A code block split mid-function is useless
16//! - Overlap is needed for context continuity, but how much?
17//!
18//! The right chunking strategy depends on your content and retrieval needs.
19//!
20//! ## Chunking Strategies
21//!
22//! ### Fixed Size (Baseline)
23//!
24//! The simplest approach: split every N characters with M overlap.
25//!
26//! ```text
27//! Document: "The quick brown fox jumps over the lazy dog."
28//! Size: 20, Overlap: 5
29//!
30//! Chunk 0: "The quick brown fox " [0..20]
31//! Chunk 1: " fox jumps over the " [15..35] <- overlap preserves "fox"
32//! Chunk 2: " the lazy dog." [30..44]
33//! ```
34//!
35//! **When to use**: Homogeneous content (logs, code), baseline comparisons.
36//! **Weakness**: Ignores linguistic boundaries—splits mid-sentence.
37//!
38//! ### Sentence-Based
39//!
40//! Split on sentence boundaries, group N sentences per chunk.
41//!
42//! The key insight: sentence boundaries are surprisingly hard to detect.
43//! "Dr. Smith went to Washington D.C. on Jan. 15th." has 1 sentence, not 4.
44//! We use Unicode segmentation (UAX #29) which handles most edge cases.
45//!
46//! **When to use**: Prose, articles, documentation.
47//! **Weakness**: Very short or very long sentences cause imbalanced chunks.
48//!
49//! ### Recursive (LangChain-style)
50//!
51//! Try splitting on paragraph breaks first. If chunks are still too large,
52//! split on sentence breaks. If still too large, split on words. Last resort:
53//! split on characters.
54//!
55//! ```text
56//! Separators: ["\n\n", "\n", ". ", " ", ""]
57//!
58//! 1. Try splitting on "\n\n" (paragraphs)
59//! 2. Any chunk > max_size? Split that chunk on "\n" (lines)
60//! 3. Still > max_size? Split on ". " (sentences)
61//! 4. Still > max_size? Split on " " (words)
62//! 5. Still > max_size? Split on "" (characters)
63//! ```
64//!
65//! **When to use**: General-purpose, mixed content.
66//! **Weakness**: Separator hierarchy is heuristic, not semantic.
67//!
68//! ### Semantic (Embedding-Based)
69//!
70//! Embed each sentence, compute similarity between adjacent sentences,
71//! split where similarity drops below a threshold.
72//!
73//! ```text
74//! Sentences: [S1, S2, S3, S4, S5, S6]
75//! Embeddings: [E1, E2, E3, E4, E5, E6]
76//! Similarities: [sim(1,2)=0.9, sim(2,3)=0.8, sim(3,4)=0.3, sim(4,5)=0.85, sim(5,6)=0.7]
77//! ↑
78//! Topic shift!
79//!
80//! Chunks: [S1, S2, S3] | [S4, S5, S6]
81//! ```
82//!
83//! **When to use**: When topic coherence matters more than size uniformity.
84//! **Weakness**: Requires embedding model, slower, threshold is a hyperparameter.
85//!
86//! ## Quick Start
87//!
88//! ```rust
89//! use slabs::{Chunker, FixedChunker, SentenceChunker, RecursiveChunker};
90//!
91//! let text = "The quick brown fox jumps over the lazy dog. \
92//! Pack my box with five dozen liquor jugs.";
93//!
94//! // Fixed size
95//! let chunker = FixedChunker::new(50, 10);
96//! let slabs = chunker.chunk(text);
97//!
98//! // Sentence-based (2 sentences per chunk)
99//! let chunker = SentenceChunker::new(2);
100//! let slabs = chunker.chunk(text);
101//!
102//! // Recursive with custom separators
103//! let chunker = RecursiveChunker::new(100, &["\n\n", "\n", ". ", " "]);
104//! let slabs = chunker.chunk(text);
105//! ```
106//!
107//! ## Semantic Chunking (requires `semantic` feature)
108//!
109//! ```rust,ignore
110//! use slabs::{Chunker, SemanticChunker};
111//!
112//! let chunker = SemanticChunker::new(0.5)?; // threshold
113//! let slabs = chunker.chunk(long_document);
114//! ```
115//!
116//! ## Late Chunking
117//!
118//! Late chunking embeds the full document first, then pools token embeddings
119//! for each chunk. This preserves document-wide context that traditional
120//! chunking loses (e.g., pronouns referring to earlier entities).
121//!
122//! ```rust,ignore
123//! use slabs::{LateChunker, SentenceChunker, Chunker};
124//!
125//! // Wrap any chunker with late chunking
126//! let late = LateChunker::new(SentenceChunker::new(3), 384);
127//!
128//! // Get chunk boundaries
129//! let chunks = late.chunk(&document);
130//!
131//! // Get token embeddings from your embedding model (full document)
132//! let token_embeddings = embed_document_tokens(&document);
133//!
134//! // Pool into contextualized chunk embeddings
135//! let chunk_embeddings = late.pool(&token_embeddings, &chunks, document.len());
136//! ```
137//!
138//! ## Performance Considerations
139//!
140//! | Strategy | Speed | Quality | Memory |
141//! |----------|-------|---------|--------|
142//! | Fixed | O(n) | Low | O(1) |
143//! | Sentence | O(n) | Medium | O(n) |
144//! | Recursive | O(n log n) | Medium | O(n) |
145//! | Semantic | O(n × d) | High | O(n × d) |
146//!
147//! Where n = document length, d = embedding dimension.
148//!
149//! For most RAG applications, **Recursive** is the sweet spot.
150//! Use **Semantic** when retrieval quality justifies the cost.
151
152mod capacity;
153mod error;
154mod fixed;
155mod late;
156mod recursive;
157mod sentence;
158mod slab;
159
160#[cfg(feature = "semantic")]
161mod semantic;
162
163#[cfg(feature = "code")]
164mod code;
165
166mod model; // New model-based chunking
167
168pub use capacity::{ChunkCapacity, ChunkCapacityError};
169pub use error::{Error, Result};
170pub use fixed::FixedChunker;
171pub use late::{LateChunker, LateChunkingPooler};
172pub use recursive::RecursiveChunker;
173pub use sentence::SentenceChunker;
174pub use slab::Slab;
175
176#[cfg(feature = "semantic")]
177pub use semantic::SemanticChunker;
178
179#[cfg(feature = "code")]
180pub use code::{CodeChunker, CodeLanguage};
181
182pub use model::ModelChunker;
183
184/// A text chunking strategy.
185///
186/// All chunkers implement this trait, enabling polymorphic usage:
187///
188/// ```rust
189/// use slabs::{Chunker, FixedChunker, SentenceChunker};
190///
191/// fn chunk_document(chunker: &dyn Chunker, text: &str) -> Vec<slabs::Slab> {
192/// chunker.chunk(text)
193/// }
194///
195/// let fixed = FixedChunker::new(100, 20);
196/// let sentence = SentenceChunker::new(3);
197///
198/// let text = "Hello world. This is a test.";
199/// let slabs1 = chunk_document(&fixed, text);
200/// let slabs2 = chunk_document(&sentence, text);
201/// ```
202pub trait Chunker: Send + Sync {
203 /// Split text into chunks.
204 ///
205 /// Each chunk is a [`Slab`] containing the text and its byte offsets
206 /// in the original document.
207 fn chunk(&self, text: &str) -> Vec<Slab>;
208
209 /// Estimate the number of chunks for a given text length.
210 ///
211 /// Useful for pre-allocation. May be approximate.
212 fn estimate_chunks(&self, text_len: usize) -> usize {
213 // Conservative default
214 (text_len / 500).max(1)
215 }
216}