Expand description
Tree-sitter AST-merge chunker (semble flavor).
Port of ~/src/semble/src/semble/chunking/{core,chunking}.py.
Greedily walks AST children, packing siblings up to a 1500-char
target before recursing into oversized nodes. Falls back to
line-merge chunking for unsupported languages.
The algorithm differs from ripvec’s query-based chunker in
crate::chunk: it does not extract specific definitions, it merges
AST subtrees until they fill the size budget. This produces fewer,
larger, more contextually-coherent chunks for semantic embedding.
§Boundary shape
ChunkBoundary reports byte offsets (matching tree-sitter’s
native units) and 1-based line numbers (matching CodeChunk’s
convention). Callers usually pass ChunkBoundary::content +
line numbers into [CodeChunk] construction.
Structs§
- Chunk
Boundary - Inclusive byte span and (1-based) line range for one chunk.
Constants§
- DEFAULT_
DESIRED_ CHUNK_ CHARS - Default desired chunk length used by
chunk_sourcewhen callers don’t specify one (mirrors_DESIRED_CHUNK_LENGTH_CHARSfromchunking.py:10).
Functions§
- chunk_
lines - Chunk source by lines, merging adjacent lines up to
desired_length. - chunk_
source - Chunk source text via tree-sitter AST-merge or line-merge fallback.