Module tokenizer

Expand description

Tokenizer API for text processing

Structs§

HfTokenizer: Cached HuggingFace tokenizer
IdfWeights: Pre-computed IDF weights indexed by token_id
IdfWeightsCache: Global cache for IDF weights, keyed by model name. Caches both successful loads and failures to avoid repeated download attempts.
LanguageAwareTokenizer: Language-aware tokenizer that can be configured per-field
MultiLanguageStemmer: Multi-language stemmer that can select language dynamically
RawCiTokenizer: Raw case-insensitive tokenizer — lowercases the entire input without splitting.
RawTokenizer: Raw tokenizer — no tokenization at all.
SimpleTokenizer: Simple tokenizer — splits on whitespace, strips non-alphanumeric, and lowercases.
StemmerTokenizer: Stemming tokenizer - splits on whitespace, lowercases, and applies stemming
StopWordTokenizer: Stop word filter tokenizer - wraps another tokenizer and filters out stop words
Token: A token produced by tokenization
TokenizerCache: Global tokenizer cache for reuse across queries
TokenizerRegistry: Registry for named tokenizers