pub trait Bm25Tokenizer {
// Required method
fn tokenize(&self, input_text: &str) -> Vec<String>;
}Expand description
Trait for tokenizing text into individual terms for BM25 processing.
Implementors of this trait define how input text should be broken down into individual tokens. This is an important step in the BM25 algorithm as it determines how documents are analysed and indexed.
Common tokenization strategies include:
- Whitespace splitting: Split on spaces and punctuation
- Stemming/Lemmatization: Reduce words to their root forms
- N-gram generation: Create overlapping sequences of words
- Language-specific processing: Handle specific language features
§Examples
use bm25_vectorizer::Bm25Tokenizer;
struct WhitespaceTokenizer;
impl Bm25Tokenizer for WhitespaceTokenizer {
fn tokenize(&self, input_text: &str) -> Vec<String> {
input_text
.split_whitespace()
.map(|token| token.to_lowercase())
.collect()
}
}
let tokenizer = WhitespaceTokenizer;
let tokens = tokenizer.tokenize("Hello World Example");
assert_eq!(tokens, vec!["hello", "world", "example"]);Required Methods§
Sourcefn tokenize(&self, input_text: &str) -> Vec<String>
fn tokenize(&self, input_text: &str) -> Vec<String>
Tokenizes the input text into a vector of string tokens.
This method takes a string slice and returns a vector of tokens that will be used for BM25 scoring.
§Arguments
input_text- The text to be tokenized
§Returns
A vector of string tokens extracted from the input text
§Examples
use bm25_vectorizer::Bm25Tokenizer;
struct SimpleTokenizer;
impl Bm25Tokenizer for SimpleTokenizer {
fn tokenize(&self, input_text: &str) -> Vec<String> {
input_text.split_whitespace()
.map(String::from)
.collect()
}
}
let tokenizer = SimpleTokenizer;
let tokens = tokenizer.tokenize("rust is awesome");
assert_eq!(tokens, vec!["rust", "is", "awesome"]);