Bm25Tokenizer

Trait Bm25Tokenizer 

Source
pub trait Bm25Tokenizer {
    // Required method
    fn tokenize(&self, input_text: &str) -> Vec<String>;
}
Expand description

Trait for tokenizing text into individual terms for BM25 processing.

Implementors of this trait define how input text should be broken down into individual tokens. This is an important step in the BM25 algorithm as it determines how documents are analysed and indexed.

Common tokenization strategies include:

  • Whitespace splitting: Split on spaces and punctuation
  • Stemming/Lemmatization: Reduce words to their root forms
  • N-gram generation: Create overlapping sequences of words
  • Language-specific processing: Handle specific language features

§Examples

use bm25_vectorizer::Bm25Tokenizer;

struct WhitespaceTokenizer;

impl Bm25Tokenizer for WhitespaceTokenizer {
    fn tokenize(&self, input_text: &str) -> Vec<String> {
        input_text
            .split_whitespace()
            .map(|token| token.to_lowercase())
            .collect()
    }
}

let tokenizer = WhitespaceTokenizer;
let tokens = tokenizer.tokenize("Hello World Example");
assert_eq!(tokens, vec!["hello", "world", "example"]);

Required Methods§

Source

fn tokenize(&self, input_text: &str) -> Vec<String>

Tokenizes the input text into a vector of string tokens.

This method takes a string slice and returns a vector of tokens that will be used for BM25 scoring.

§Arguments
  • input_text - The text to be tokenized
§Returns

A vector of string tokens extracted from the input text

§Examples
use bm25_vectorizer::Bm25Tokenizer;

struct SimpleTokenizer;
impl Bm25Tokenizer for SimpleTokenizer {
    fn tokenize(&self, input_text: &str) -> Vec<String> {
        input_text.split_whitespace()
                  .map(String::from)
                  .collect()
    }
}

let tokenizer = SimpleTokenizer;
let tokens = tokenizer.tokenize("rust is awesome");
assert_eq!(tokens, vec!["rust", "is", "awesome"]);

Implementors§