Skip to main content

Module chunking

Module chunking 

Source
Expand description

Tree-sitter AST-merge chunker (semble flavor).

Port of ~/src/semble/src/semble/chunking/{core,chunking}.py. Greedily walks AST children, packing siblings up to a 1500-char target before recursing into oversized nodes. Falls back to line-merge chunking for unsupported languages.

The algorithm differs from ripvec’s query-based chunker in crate::chunk: it does not extract specific definitions, it merges AST subtrees until they fill the size budget. This produces fewer, larger, more contextually-coherent chunks for semantic embedding.

§Boundary shape

ChunkBoundary reports byte offsets (matching tree-sitter’s native units) and 1-based line numbers (matching CodeChunk’s convention). Callers usually pass ChunkBoundary::content + line numbers into [CodeChunk] construction.

Structs§

ChunkBoundary
Inclusive byte span and (1-based) line range for one chunk.

Constants§

DEFAULT_DESIRED_CHUNK_CHARS
Default desired chunk length used by chunk_source when callers don’t specify one (mirrors _DESIRED_CHUNK_LENGTH_CHARS from chunking.py:10).

Functions§

chunk_lines
Chunk source by lines, merging adjacent lines up to desired_length.
chunk_source
Chunk source text via tree-sitter AST-merge or line-merge fallback.