Expand description
Text tokenization functionality.
Provides methods to split text into regex-based, word-level (and punctuation-level) tokens. Tokenization is necessary for alignment between extracted data and the source text and for forming sentence boundaries for LLM information extraction.
Structs§
- Char
Interval - Represents a character interval in text
- Sentence
Iterator - Iterator for processing sentences in tokenized text
- Token
- Represents a token extracted from text
- Token
Interval - Represents a token interval over tokens in tokenized text
- Tokenized
Text - Holds the result of tokenizing a text string
- Tokenizer
- Text tokenizer for splitting text into tokens
Enums§
- Token
Type - Enumeration of token types produced during tokenization
Functions§
- tokenize
- Convenience function for creating a tokenizer and tokenizing text