Module tokenizer

Module tokenizer 

Source
Expand description

Text tokenization functionality.

Provides methods to split text into regex-based, word-level (and punctuation-level) tokens. Tokenization is necessary for alignment between extracted data and the source text and for forming sentence boundaries for LLM information extraction.

Structs§

CharInterval
Represents a character interval in text
SentenceIterator
Iterator for processing sentences in tokenized text
Token
Represents a token extracted from text
TokenInterval
Represents a token interval over tokens in tokenized text
TokenizedText
Holds the result of tokenizing a text string
Tokenizer
Text tokenizer for splitting text into tokens

Enums§

TokenType
Enumeration of token types produced during tokenization

Functions§

tokenize
Convenience function for creating a tokenizer and tokenizing text