Expand description
Token-aware chunking utilities for bodies that exceed the embedding window. Semantic chunking for embedding inputs (Markdown-aware, 512-token limit).
Splits bodies using text_splitter::MarkdownSplitter with overlap so
multi-chunk memories preserve context across chunk boundaries.
Structs§
- Chunk
- A contiguous slice of a body string identified by byte offsets.
Constants§
- CHUNK_
OVERLAP_ CHARS - Character overlap between consecutive chunks to preserve cross-boundary context.
- CHUNK_
SIZE_ CHARS - Maximum character length of a single chunk (derived from token limit × chars-per-token).
Functions§
- aggregate_
embeddings - Computes the mean of
chunk_embeddingsand L2-normalizes the result. - chunk_
text - Returns the string slice of
bodydescribed bychunk’s byte offsets. - needs_
chunking - Returns
truewhenbodyexceedsCHUNK_SIZE_CHARSand must be split. - split_
into_ chunks - Splits
bodyinto overlappingChunks using a character-based heuristic. - split_
into_ chunks_ by_ token_ offsets - Splits
bodyintoChunks using pre-computed token byte-offsets. - split_
into_ chunks_ hierarchical - Splits body into chunks using MarkdownSplitter with a real tokenizer. Respects Markdown semantic boundaries (H1-H6, paragraphs, blocks). For plain text without Markdown markers, falls back to paragraph and sentence breaks.