Skip to main content

Module tokenizer

Module tokenizer 

Source
Expand description

Tokenizer API for text processing

Structs§

HfTokenizer
Cached HuggingFace tokenizer
LanguageAwareTokenizer
Language-aware tokenizer that can be configured per-field
LowercaseTokenizer
Lowercase tokenizer - splits on whitespace and lowercases
MultiLanguageStemmer
Multi-language stemmer that can select language dynamically
SimpleTokenizer
Simple whitespace tokenizer
StemmerTokenizer
Stemming tokenizer - splits on whitespace, lowercases, and applies stemming
StopWordTokenizer
Stop word filter tokenizer - wraps another tokenizer and filters out stop words
Token
A token produced by tokenization
TokenizerCache
Global tokenizer cache for reuse across queries
TokenizerRegistry
Registry for named tokenizers

Enums§

Language
Supported stemmer languages
TokenizerSource
Tokenizer source - where to load the tokenizer from

Traits§

Tokenizer
Trait for tokenizers
TokenizerClone

Functions§

parse_language
Parse a language string into a Language enum
tokenizer_cache
Get the global tokenizer cache

Type Aliases§

BoxedTokenizer
Boxed tokenizer for dynamic dispatch