Trait Bm25Tokenizer

Source

pub trait Bm25Tokenizer {
    // Required method
    fn tokenize(&self, input_text: &str) -> Vec<String>;
}

Expand description

Trait for tokenizing text into individual terms for BM25 processing.

Implementors of this trait define how input text should be broken down into individual tokens. This is an important step in the BM25 algorithm as it determines how documents are analysed and indexed.

Common tokenization strategies include:

Whitespace splitting: Split on spaces and punctuation
Stemming/Lemmatization: Reduce words to their root forms
N-gram generation: Create overlapping sequences of words
Language-specific processing: Handle specific language features

§Examples

use bm25_vectorizer::Bm25Tokenizer;

struct WhitespaceTokenizer;

impl Bm25Tokenizer for WhitespaceTokenizer {
    fn tokenize(&self, input_text: &str) -> Vec<String> {
        input_text
            .split_whitespace()
            .map(|token| token.to_lowercase())
            .collect()
    }
}

let tokenizer = WhitespaceTokenizer;
let tokens = tokenizer.tokenize("Hello World Example");
assert_eq!(tokens, vec!["hello", "world", "example"]);

Required Methods§

Source

fn tokenize(&self, input_text: &str) -> Vec<String>

Tokenizes the input text into a vector of string tokens.

This method takes a string slice and returns a vector of tokens that will be used for BM25 scoring.

§Arguments

input_text - The text to be tokenized

§Returns

A vector of string tokens extracted from the input text

§Examples

use bm25_vectorizer::Bm25Tokenizer;

struct SimpleTokenizer;
impl Bm25Tokenizer for SimpleTokenizer {
    fn tokenize(&self, input_text: &str) -> Vec<String> {
        input_text.split_whitespace()
                  .map(String::from)
                  .collect()
    }
}

let tokenizer = SimpleTokenizer;
let tokens = tokenizer.tokenize("rust is awesome");
assert_eq!(tokens, vec!["rust", "is", "awesome"]);

Implementors§

Source §

impl Bm25Tokenizer for MockCasePreservingTokenizer

Source §

impl Bm25Tokenizer for MockPunctuationTokenizer

Source §

Bm25Tokenizer

Trait Bm25Tokenizer Copy item path

§Examples

Required Methods§

fn tokenize(&self, input_text: &str) -> Vec<String>

§Arguments

§Returns

§Examples

Implementors§

impl Bm25Tokenizer for MockCasePreservingTokenizer

impl Bm25Tokenizer for MockPunctuationTokenizer

impl Bm25Tokenizer for MockWhitespaceTokenizer

Trait Bm25Tokenizer