Expand description
Text chunking for embeddings.
Splits large text into overlapping chunks for embedding generation. This ensures full semantic coverage while respecting model token limits.
§Design Decisions
- Character-based chunking: Simple, predictable, works with any language. Token-based would be more accurate but requires the model’s tokenizer.
- Word boundary splitting: Avoids breaking mid-word which can confuse embeddings.
- Overlapping windows: Maintains context at chunk boundaries for better retrieval.
- Configurable parameters: Different models have different optimal chunk sizes.
Structs§
- Chunk
Config - Configuration for text chunking.
- Text
Chunk - A text chunk with its index.
Functions§
- chunk_
text - Split text into overlapping chunks.
- prepare_
item_ text - Prepare text for embedding by concatenating key and value.