Skip to main content

Module chunker

Module chunker 

Source
Expand description

CST-aware code chunking.

Splits source files into semantically meaningful chunks using the concrete syntax tree (CST) produced by ast-grep/tree-sitter. The algorithm:

  1. If a CST node fits within max_chunk_size (non-whitespace chars) -> emit it as a chunk.
  2. If too large -> recurse into named children, preferring semantic boundaries (function/class/impl definitions) as split points.
  3. Adjacent small siblings are merged greedily, but only when they share the same semantic category (e.g., imports with imports, declarations with declarations).
  4. When a chunk comes from inside a function/class, a truncated signature header is prepended so the chunk is self-contextualizing for embeddings.

Each chunk records its parent symbol (resolved by line-range containment).

Structs§

ChunkConfig
Configuration for the chunker.
CodeChunk
A code chunk produced by the CST-aware chunker.

Functions§

chunk_file
Chunk a file using its CST tree.