Skip to main content

Module splitting

Module splitting 

Source

Structs§

CharRatioTokenizer
A simple character-based tokenizer that approximates token count. Useful as a fallback when no real tokenizer is available.
TextSplit
TextSplitter

Enums§

Separator
SeparatorGroup

Statics§

TWO_PLUS_NEWLINE_REGEX

Traits§

Tokenizer
Trait for counting tokens in text. Implement this to integrate with specific tokenizers (e.g., tiktoken, HuggingFace tokenizers).

Functions§

split_text_into_indices
Split text into sentence indices using Unicode-aware sentence boundary detection.
split_text_into_sentences
Split text into sentences using improved Unicode-aware sentence boundary detection.