Skip to main content

Module chunking

Module chunking 

Source
Expand description

Text chunking for embeddings.

Splits large text into overlapping chunks for embedding generation. This ensures full semantic coverage while respecting model token limits.

§Design Decisions

  • Character-based chunking: Simple, predictable, works with any language. Token-based would be more accurate but requires the model’s tokenizer.
  • Word boundary splitting: Avoids breaking mid-word which can confuse embeddings.
  • Overlapping windows: Maintains context at chunk boundaries for better retrieval.
  • Configurable parameters: Different models have different optimal chunk sizes.

Structs§

ChunkConfig
Configuration for text chunking.
TextChunk
A text chunk with its index.

Functions§

chunk_text
Split text into overlapping chunks.
prepare_item_text
Prepare text for embedding by concatenating key and value.