pub trait Chunker: Send + Sync {
// Required method
fn chunk_bytes(&self, text: &str) -> Vec<Slab>;
// Provided methods
fn chunk(&self, text: &str) -> Vec<Slab> { ... }
fn estimate_chunks(&self, text_len: usize) -> usize { ... }
}Expand description
A chunking strategy: text in, Slabs out.
Implementors override chunk_bytes; the default
chunk method adds Unicode character offsets.
This crate only ships one public chunker — [CodeChunker] — but the
trait is public so users can wrap external chunkers (text-splitter,
regex, custom logic) and feed the output into LateChunkingPooler.
Required Methods§
Provided Methods§
Sourcefn chunk(&self, text: &str) -> Vec<Slab>
fn chunk(&self, text: &str) -> Vec<Slab>
Split text into chunks with both byte and character offsets.
This calls chunk_bytes and then computes
Unicode character offsets on every slab. Users get correct char_start
and char_end without manual work.
Sourcefn estimate_chunks(&self, text_len: usize) -> usize
fn estimate_chunks(&self, text_len: usize) -> usize
Estimate the number of chunks for a given text length.
Useful for pre-allocation. May be approximate.