Crate code_chunker

Expand description

§code-chunker

AST-aware code chunking and late chunking for RAG pipelines.

§Two primitives

§`CodeChunker` — split source code at AST boundaries

Tree-sitter walks the parse tree and produces chunks aligned to function, class, impl, and module boundaries. When a node fits the configured size budget it is kept intact; oversize nodes are split recursively at structural separators. Supports Rust, Python, TypeScript/JavaScript, and Go (behind the code feature).

§`LateChunkingPooler` — pool token embeddings into chunk embeddings

Late chunking (Günther et al. 2024, arXiv:2409.04701) embeds the full document first so every token attends to the rest of the document, then mean-pools token embeddings inside each chunk’s byte span. The result is a per-chunk embedding that carries document-wide context — pronouns, anaphora, and acronym definitions are no longer lost at chunk boundaries.

LateChunkingPooler is span-only: bring your own boundaries from any source — CodeChunker, text-splitter, regex, or hand-built Slabs.

§What this crate does not do

General-purpose text chunking. Use text-splitter for fixed/sentence/recursive prose splitting; it’s the de-facto Rust standard with broader Unicode and tokenizer support.
Format conversion (PDF, HTML, DOCX). Input is &str. Use deformat or pdf-extract upstream.
Embedding generation. LateChunkingPooler consumes pre-computed token embeddings; bring your own long-context model (Jina v2/v3, nomic-embed-text, candle, ort).
Vector store integration. Slab is the boundary; enable the serde feature and wire to qdrant-client, lancedb, sqlx, etc. yourself.

§Quick start (code chunking)

use code_chunker::{Chunker, CodeChunker, CodeLanguage};

let chunker = CodeChunker::new(CodeLanguage::Rust, 1500, 0);
let slabs = chunker.chunk(source_code);

§Quick start (late chunking)

use code_chunker::{LateChunkingPooler, Slab};

// Bring your own chunk boundaries (text-splitter, CodeChunker, ...).
let chunks: Vec<Slab> = my_chunker(&document);

// Embed the full document with a long-context model.
let token_embeddings: Vec<Vec<f32>> = my_model.embed_tokens(&document);

// Pool token embeddings into per-chunk embeddings.
let pooler = LateChunkingPooler::new(384);
let chunk_embeddings = pooler.pool(&token_embeddings, &chunks, document.len());

Structs§

ByteSizer: Default sizer: returns the byte length of the chunk text.
LateChunkingPooler: Late chunking pooler: pools token embeddings into chunk embeddings.
Slab: A chunk of text with its position in the original document.

Enums§

Error: Errors that can occur during chunking.

Traits§

ChunkSizer: Measures the size of a chunk for size-budget comparisons.
Chunker: A chunking strategy: text in, Slabs out.

Functions§

compute_char_offsets: Compute character offsets for a batch of slabs from the same document.

Type Aliases§

Result: Result type for code-chunker operations.