Expand description
CST-aware code chunking.
Splits source files into semantically meaningful chunks using the concrete syntax tree (CST) produced by ast-grep/tree-sitter. The algorithm:
- If a CST node fits within
max_chunk_size(non-whitespace chars) -> emit it as a chunk. - If too large -> recurse into named children, preferring semantic boundaries (function/class/impl definitions) as split points.
- Adjacent small siblings are merged greedily, but only when they share the same semantic category (e.g., imports with imports, declarations with declarations).
- When a chunk comes from inside a function/class, a truncated signature header is prepended so the chunk is self-contextualizing for embeddings.
Each chunk records its parent symbol (resolved by line-range containment).
Structs§
- Chunk
Config - Configuration for the chunker.
- Code
Chunk - A code chunk produced by the CST-aware chunker.
Functions§
- chunk_
file - Chunk a file using its CST tree.