Skip to main content

Module chunking

Module chunking 

Source
Expand description

Token-aware chunking utilities for bodies that exceed the embedding window. Semantic chunking for embedding inputs (Markdown-aware, 512-token limit).

Splits bodies using text_splitter::MarkdownSplitter with overlap so multi-chunk memories preserve context across chunk boundaries.

Structs§

BodyBudget
Budget assessment of a body for the auto-split and dry-run paths (GAP-SG-05/06).
Chunk
A contiguous slice of a body string identified by byte offsets.

Constants§

CHUNK_OVERLAP_CHARS
Character overlap between consecutive chunks to preserve cross-boundary context.
CHUNK_SIZE_CHARS
Maximum character length of a single chunk (derived from token limit × chars-per-token).

Functions§

aggregate_embeddings
Computes the mean of chunk_embeddings and L2-normalizes the result.
assess_body_budget
Assesses body against the single-memory budgets (GAP-SG-05/06).
chunk_text
Returns the string slice of body described by chunk’s byte offsets.
estimate_chunk_count
Returns the number of embedding chunks body splits into, using the same hierarchical splitter the persistence path uses (GAP-SG-05).
needs_chunking
Returns true when body exceeds CHUNK_SIZE_CHARS and must be split.
split_body_by_sections
Splits a large body into sub-memory partitions at Markdown section boundaries (ATX headers), keeping each partition below the byte, chunk and token budgets (GAP-SG-04/07).
split_into_chunks
Splits body into overlapping Chunks using a character-based heuristic.
split_into_chunks_by_token_offsets
Splits body into Chunks using pre-computed token byte-offsets.
split_into_chunks_hierarchical
Splits body into chunks using MarkdownSplitter with a real tokenizer. Respects Markdown semantic boundaries (H1-H6, paragraphs, blocks). For plain text without Markdown markers, falls back to paragraph and sentence breaks.