pub trait Tokenizer: Send + Sync {
// Required methods
fn tokenize(&self, text: &str, language: Option<&Language>) -> Vec<Token>;
fn name(&self) -> &'static str;
// Provided method
fn is_stopword(&self, token: &Token, language: Option<&Language>) -> bool { ... }
}Expand description
Trait for language-specific tokenization.
Implementations should handle:
- Word segmentation (whitespace, morphological, statistical)
- Stopword detection
- Case normalization (if applicable)
- Script-specific rules (CJK, Arabic, etc.)