Expand description
§code-chunker
AST-aware code chunking and late chunking for RAG pipelines.
§Two primitives
§CodeChunker — split source code at AST boundaries
Tree-sitter walks the parse tree and produces chunks aligned to
function, class, impl, and module boundaries. When a node fits the
configured size budget it is kept intact; oversize nodes are split
recursively at structural separators. Supports Rust, Python,
TypeScript/JavaScript, and Go (behind the code feature).
§LateChunkingPooler — pool token embeddings into chunk embeddings
Late chunking (Günther et al. 2024, arXiv:2409.04701) embeds the full document first so every token attends to the rest of the document, then mean-pools token embeddings inside each chunk’s byte span. The result is a per-chunk embedding that carries document-wide context — pronouns, anaphora, and acronym definitions are no longer lost at chunk boundaries.
LateChunkingPooler is span-only: bring your own boundaries from any
source — CodeChunker, text-splitter, regex, or hand-built Slabs.
§What this crate does not do
- General-purpose text chunking. Use
text-splitterfor fixed/sentence/recursive prose splitting; it’s the de-facto Rust standard with broader Unicode and tokenizer support. - Format conversion (PDF, HTML, DOCX). Input is
&str. Usedeformatorpdf-extractupstream. - Embedding generation.
LateChunkingPoolerconsumes pre-computed token embeddings; bring your own long-context model (Jina v2/v3, nomic-embed-text, candle, ort). - Vector store integration.
Slabis the boundary; enable theserdefeature and wire to qdrant-client, lancedb, sqlx, etc. yourself.
§Quick start (code chunking)
use code_chunker::{Chunker, CodeChunker, CodeLanguage};
let chunker = CodeChunker::new(CodeLanguage::Rust, 1500, 0);
let slabs = chunker.chunk(source_code);§Quick start (late chunking)
use code_chunker::{LateChunkingPooler, Slab};
// Bring your own chunk boundaries (text-splitter, CodeChunker, ...).
let chunks: Vec<Slab> = my_chunker(&document);
// Embed the full document with a long-context model.
let token_embeddings: Vec<Vec<f32>> = my_model.embed_tokens(&document);
// Pool token embeddings into per-chunk embeddings.
let pooler = LateChunkingPooler::new(384);
let chunk_embeddings = pooler.pool(&token_embeddings, &chunks, document.len());Structs§
- Byte
Sizer - Default sizer: returns the byte length of the chunk text.
- Late
Chunking Pooler - Late chunking pooler: pools token embeddings into chunk embeddings.
- Slab
- A chunk of text with its position in the original document.
Enums§
- Error
- Errors that can occur during chunking.
Traits§
- Chunk
Sizer - Measures the size of a chunk for size-budget comparisons.
- Chunker
- A chunking strategy: text in,
Slabs out.
Functions§
- compute_
char_ offsets - Compute character offsets for a batch of slabs from the same document.
Type Aliases§
- Result
- Result type for code-chunker operations.