code-chunker
AST-aware code chunking and late chunking for RAG.
Two primitives:
CodeChunker— split source code at function/class/impl boundaries via tree-sitter. Rust, Python, TypeScript/JavaScript, Go. Optional import-context injection. Pluggable size metric (bytes by default; bring your own tokenizer).LateChunkingPooler— pool full-document token embeddings into per-chunk vectors (Günther et al. 2024). Bring your own boundaries from any source.
Successor to slabs 0.1.x. Dual-licensed under MIT or Apache-2.0.
Install
[]
= { = "0.2", = ["code"] }
Features:
| Feature | What it enables |
|---|---|
code |
CodeChunker via tree-sitter (Rust, Python, TypeScript, Go) |
serde |
Serialize/Deserialize on Slab for storage backends |
Code chunking
Splits source files at AST-defined boundaries — keeping functions, classes, and impl blocks atomic when they fit the size budget. Oversize nodes are split recursively at structural separators; unparseable leaves fall back to recursive text splitting.
use ;
let chunker = new;
let slabs = chunker.chunk;
for slab in &slabs
Language can also be inferred from a file extension:
use ;
let lang = from_extension.unwrap;
let chunker = new;
Import-context injection
Method chunks lose the surrounding use/import statements that name the
types they reference. with_imports(true) walks the AST once, collects every
import node, and prepends them to each chunk that doesn't already contain
them. Retrievers see imports next to call sites instead of stranded at the
file head.
use ;
let chunker = new
.with_imports;
let slabs = chunker.chunk;
Per-language import nodes:
| Language | Nodes treated as imports |
|---|---|
| Rust | use_declaration, extern_crate_declaration |
| Python | import_statement, import_from_statement |
| TypeScript | import_statement |
| Go | import_declaration |
Pluggable size metric
CodeChunker sizes chunks in bytes by default. To target a model's token
context limit, plug in your tokenizer through the ChunkSizer trait:
use ;
let chunker = new
.with_sizer;
The max_chunk_size argument is interpreted in whatever unit the sizer
returns — bytes for the default ByteSizer, tokens for a tokenizer-backed
sizer.
AST node types kept atomic
| Language | Block types |
|---|---|
| Rust | function_item, impl_item, struct_item, enum_item, trait_item, mod_item |
| Python | function_definition, class_definition |
| TypeScript | function_declaration, class_declaration, method_definition, interface_declaration, enum_declaration |
| Go | function_declaration, method_declaration, type_declaration |
Late chunking
Traditional chunking embeds chunks independently, so cross-chunk references — "He became famous" loses the antecedent "Einstein" — degrade retrieval. Late chunking embeds the full document first so every token attends to the rest of the document, then pools token-level embeddings into per-chunk vectors. The result preserves document-wide context.
LateChunkingPooler is a primitive: it takes pre-computed token embeddings
plus chunk boundaries and returns pooled chunk embeddings. Bring your own
boundaries from any source.
use ;
// 1. Chunk boundaries from any source — text-splitter, CodeChunker, regex, manual.
let chunks: = my_chunker;
// 2. Embed the FULL document with a long-context model
// (Jina v2/v3, nomic-embed-text, etc.) to get [n_tokens, dim] embeddings.
let token_embeddings: = my_model.embed_tokens;
// 3. Pool token embeddings inside each chunk's byte span.
let pooler = new; // dim
let chunk_embeddings = pooler.pool;
If you have exact token offsets from the tokenizer, use pool_with_offsets
for precise boundary mapping instead of the default linear approximation.
Late chunking requires holding full-document token embeddings in memory and a model whose context window covers the document.
What this crate does not do
- General-purpose text chunking. Use
text-splitter(1.2M+ downloads) for fixed/sentence/recursive prose splitting. It has broader Unicode handling, token-count sizing, and is the de-facto Rust standard. Wrap its output inSlabif you want to feed it toLateChunkingPooler. - Format conversion (PDF, HTML, DOCX). Input is
&str. Usedeformatorpdf-extractupstream. - Embedding generation.
LateChunkingPoolerconsumes pre-computed token embeddings. Bring your own model. - Vector store integration.
Slabis the boundary; enable theserdefeature and wire to qdrant-client, lancedb, sqlx, etc. yourself. - Cross-file analysis (LSP, type resolution, dependency graphs). This
crate operates on one document at a time. See
tree-sitter-stack-graphsandast-grepfor code-graph tools.
Examples
Migrating from slabs
This crate is the renamed and narrowed successor to slabs.
The 0.1.x slabs releases bundled four general-text chunkers and a CLI;
those are gone. Replace slabs = "0.1" with:
= { = "0.2", = ["code"] }
Removed (use text-splitter for
prose chunking):
FixedChunker,SentenceChunker,RecursiveChunker,SemanticChunkerLateChunker<C>wrapper → useLateChunkingPoolerdirectly withVec<Slab>from any sourceChunkCapacity(was unused by any constructor)slabsCLI binary