Expand description
Token-aware chunking utilities for bodies that exceed the embedding window. Semantic chunking for embedding inputs (Markdown-aware, 512-token limit).
Splits bodies using text_splitter::MarkdownSplitter with overlap so
multi-chunk memories preserve context across chunk boundaries.
Structs§
- Body
Budget - Budget assessment of a body for the auto-split and dry-run paths (GAP-SG-05/06).
- Chunk
- A contiguous slice of a body string identified by byte offsets.
Constants§
- CHUNK_
OVERLAP_ CHARS - Character overlap between consecutive chunks to preserve cross-boundary context.
- CHUNK_
SIZE_ CHARS - Maximum character length of a single chunk (derived from token limit × chars-per-token).
Functions§
- aggregate_
embeddings - Computes the mean of
chunk_embeddingsand L2-normalizes the result. - assess_
body_ budget - Assesses
bodyagainst the single-memory budgets (GAP-SG-05/06). - chunk_
text - Returns the string slice of
bodydescribed bychunk’s byte offsets. - estimate_
chunk_ count - Returns the number of embedding chunks
bodysplits into, using the same hierarchical splitter the persistence path uses (GAP-SG-05). - needs_
chunking - Returns
truewhenbodyexceedsCHUNK_SIZE_CHARSand must be split. - split_
body_ by_ sections - Splits a large
bodyinto sub-memory partitions at Markdown section boundaries (ATX headers), keeping each partition below the byte, chunk and token budgets (GAP-SG-04/07). - split_
into_ chunks - Splits
bodyinto overlappingChunks using a character-based heuristic. - split_
into_ chunks_ by_ token_ offsets - Splits
bodyintoChunks using pre-computed token byte-offsets. - split_
into_ chunks_ hierarchical - Splits body into chunks using MarkdownSplitter with a real tokenizer. Respects Markdown semantic boundaries (H1-H6, paragraphs, blocks). For plain text without Markdown markers, falls back to paragraph and sentence breaks.