code-chunker 0.2.0

AST-aware code chunking and late chunking for RAG
Documentation
code-chunker-0.2.0 has been yanked.

code-chunker

crates.io Documentation CI

AST-aware code chunking and late chunking for RAG.

Two primitives:

  • CodeChunker — split source code at function/class/impl boundaries via tree-sitter. Rust, Python, TypeScript/JavaScript, Go. Optional import-context injection. Pluggable size metric (bytes by default; bring your own tokenizer).
  • LateChunkingPooler — pool full-document token embeddings into per-chunk vectors (Günther et al. 2024). Bring your own boundaries from any source.

Successor to slabs 0.1.x. Dual-licensed under MIT or Apache-2.0.

Install

[dependencies]
code-chunker = { version = "0.2", features = ["code"] }

Features:

Feature What it enables
code CodeChunker via tree-sitter (Rust, Python, TypeScript, Go)
serde Serialize/Deserialize on Slab for storage backends

Code chunking

Splits source files at AST-defined boundaries — keeping functions, classes, and impl blocks atomic when they fit the size budget. Oversize nodes are split recursively at structural separators; unparseable leaves fall back to recursive text splitting.

use code_chunker::{Chunker, CodeChunker, CodeLanguage};

let chunker = CodeChunker::new(CodeLanguage::Rust, 1500, 0);
let slabs = chunker.chunk(source_code);

for slab in &slabs {
    println!("[{}..{}]\n{}\n", slab.start, slab.end, slab.text);
}

Language can also be inferred from a file extension:

use code_chunker::{CodeChunker, CodeLanguage};
let lang = CodeLanguage::from_extension("py").unwrap();
let chunker = CodeChunker::new(lang, 1500, 0);

Import-context injection

Method chunks lose the surrounding use/import statements that name the types they reference. with_imports(true) walks the AST once, collects every import node, and prepends them to each chunk that doesn't already contain them. Retrievers see imports next to call sites instead of stranded at the file head.

use code_chunker::{CodeChunker, CodeLanguage};

let chunker = CodeChunker::new(CodeLanguage::Rust, 1500, 0)
    .with_imports(true);
let slabs = chunker.chunk(source_code);

Per-language import nodes:

Language Nodes treated as imports
Rust use_declaration, extern_crate_declaration
Python import_statement, import_from_statement
TypeScript import_statement
Go import_declaration

Pluggable size metric

CodeChunker sizes chunks in bytes by default. To target a model's token context limit, plug in your tokenizer through the ChunkSizer trait:

use code_chunker::{ChunkSizer, CodeChunker, CodeLanguage};

struct TiktokenSizer { /* your tokenizer */ }

impl ChunkSizer for TiktokenSizer {
    fn size(&self, text: &str) -> usize {
        // count tokens using your tokenizer
        # 0
    }
}

let chunker = CodeChunker::new(CodeLanguage::Rust, 8000, 0)
    .with_sizer(TiktokenSizer { /* ... */ });

The max_chunk_size argument is interpreted in whatever unit the sizer returns — bytes for the default ByteSizer, tokens for a tokenizer-backed sizer.

AST node types kept atomic

Language Block types
Rust function_item, impl_item, struct_item, enum_item, trait_item, mod_item
Python function_definition, class_definition
TypeScript function_declaration, class_declaration, method_definition, interface_declaration, enum_declaration
Go function_declaration, method_declaration, type_declaration

Late chunking

Traditional chunking embeds chunks independently, so cross-chunk references — "He became famous" loses the antecedent "Einstein" — degrade retrieval. Late chunking embeds the full document first so every token attends to the rest of the document, then pools token-level embeddings into per-chunk vectors. The result preserves document-wide context.

LateChunkingPooler is a primitive: it takes pre-computed token embeddings plus chunk boundaries and returns pooled chunk embeddings. Bring your own boundaries from any source.

use code_chunker::{LateChunkingPooler, Slab};

// 1. Chunk boundaries from any source — text-splitter, CodeChunker, regex, manual.
let chunks: Vec<Slab> = my_chunker(&document);

// 2. Embed the FULL document with a long-context model
//    (Jina v2/v3, nomic-embed-text, etc.) to get [n_tokens, dim] embeddings.
let token_embeddings: Vec<Vec<f32>> = my_model.embed_tokens(&document);

// 3. Pool token embeddings inside each chunk's byte span.
let pooler = LateChunkingPooler::new(384); // dim
let chunk_embeddings = pooler.pool(&token_embeddings, &chunks, document.len());

If you have exact token offsets from the tokenizer, use pool_with_offsets for precise boundary mapping instead of the default linear approximation.

Late chunking requires holding full-document token embeddings in memory and a model whose context window covers the document.

What this crate does not do

  • General-purpose text chunking. Use text-splitter (1.2M+ downloads) for fixed/sentence/recursive prose splitting. It has broader Unicode handling, token-count sizing, and is the de-facto Rust standard. Wrap its output in Slab if you want to feed it to LateChunkingPooler.
  • Format conversion (PDF, HTML, DOCX). Input is &str. Use deformat or pdf-extract upstream.
  • Embedding generation. LateChunkingPooler consumes pre-computed token embeddings. Bring your own model.
  • Vector store integration. Slab is the boundary; enable the serde feature and wire to qdrant-client, lancedb, sqlx, etc. yourself.
  • Cross-file analysis (LSP, type resolution, dependency graphs). This crate operates on one document at a time. See tree-sitter-stack-graphs and ast-grep for code-graph tools.

Examples

cargo run --example code_chunking --features code
cargo run --example late_chunking

Migrating from slabs

This crate is the renamed and narrowed successor to slabs. The 0.1.x slabs releases bundled four general-text chunkers and a CLI; those are gone. Replace slabs = "0.1" with:

code-chunker = { version = "0.2", features = ["code"] }

Removed (use text-splitter for prose chunking):

  • FixedChunker, SentenceChunker, RecursiveChunker, SemanticChunker
  • LateChunker<C> wrapper → use LateChunkingPooler directly with Vec<Slab> from any source
  • ChunkCapacity (was unused by any constructor)
  • slabs CLI binary