Module tokenizer

Expand description

Text tokenization functionality.

Provides methods to split text into regex-based, word-level (and punctuation-level) tokens. Tokenization is necessary for alignment between extracted data and the source text and for forming sentence boundaries for LLM information extraction.

Structs§

CharInterval: Represents a character interval in text
SentenceIterator: Iterator for processing sentences in tokenized text
Token: Represents a token extracted from text
TokenInterval: Represents a token interval over tokens in tokenized text
TokenizedText: Holds the result of tokenizing a text string
Tokenizer: Text tokenizer for splitting text into tokens

Enums§

TokenType: Enumeration of token types produced during tokenization

Functions§

tokenize: Convenience function for creating a tokenizer and tokenizing text

Module tokenizer

Module tokenizer Copy item path

Structs§

Enums§

Functions§

Module tokenizer