Skip to main content

Chunker

Trait Chunker 

Source
pub trait Chunker: Send + Sync {
    // Required method
    fn chunk_bytes(&self, text: &str) -> Vec<Slab>;

    // Provided methods
    fn chunk(&self, text: &str) -> Vec<Slab> { ... }
    fn estimate_chunks(&self, text_len: usize) -> usize { ... }
}
Expand description

A chunking strategy: text in, Slabs out.

Implementors override chunk_bytes; the default chunk method adds Unicode character offsets.

This crate only ships one public chunker — [CodeChunker] — but the trait is public so users can wrap external chunkers (text-splitter, regex, custom logic) and feed the output into LateChunkingPooler.

Required Methods§

Source

fn chunk_bytes(&self, text: &str) -> Vec<Slab>

Core chunking implementation returning Slabs with byte offsets only.

Implementors override this method. Users should call chunk instead, which adds character offsets automatically.

Provided Methods§

Source

fn chunk(&self, text: &str) -> Vec<Slab>

Split text into chunks with both byte and character offsets.

This calls chunk_bytes and then computes Unicode character offsets on every slab. Users get correct char_start and char_end without manual work.

Source

fn estimate_chunks(&self, text_len: usize) -> usize

Estimate the number of chunks for a given text length.

Useful for pre-allocation. May be approximate.

Implementors§