Skip to main content

Module tokenizer

Module tokenizer 

Source
Expand description

Tokenizer API for text processing

Structs§

HfTokenizer
Cached HuggingFace tokenizer
IdfWeights
Pre-computed IDF weights indexed by token_id
IdfWeightsCache
Global cache for IDF weights, keyed by model name. Caches both successful loads and failures to avoid repeated download attempts.
LanguageAwareTokenizer
Language-aware tokenizer that can be configured per-field
MultiLanguageStemmer
Multi-language stemmer that can select language dynamically
RawCiTokenizer
Raw case-insensitive tokenizer — lowercases the entire input without splitting.
RawTokenizer
Raw tokenizer — no tokenization at all.
SimpleTokenizer
Simple tokenizer — splits on whitespace, strips non-alphanumeric, and lowercases.
StemmerTokenizer
Stemming tokenizer - splits on whitespace, lowercases, and applies stemming
StopWordTokenizer
Stop word filter tokenizer - wraps another tokenizer and filters out stop words
Token
A token produced by tokenization
TokenizerCache
Global tokenizer cache for reuse across queries
TokenizerRegistry
Registry for named tokenizers

Enums§

Language
Supported stemmer languages
TokenizerSource
Tokenizer source - where to load the tokenizer from

Traits§

Tokenizer
Trait for tokenizers
TokenizerClone

Functions§

idf_weights_cache
Get the global IDF weights cache
parse_language
Parse a language string into a Language enum
tokenizer_cache
Get the global tokenizer cache

Type Aliases§

BoxedTokenizer
Boxed tokenizer for dynamic dispatch