pub struct Bm25Vectorizer<TokenIndexer, Tokenizer> { /* private fields */ }Expand description
The main BM25 vectorizer that converts text into sparse vector representations.
This struct encapsulates all the parameters and components needed to perform BM25 vectorization. It uses a tokenizer to break text into tokens and a token indexer to map tokens to indices.
§Type Parameters
TokenIndexer: Implementation ofBm25TokenIndexertrait for mapping tokens to indicesTokenizer: Implementation ofBm25Tokenizertrait for text tokenization
§Examples
use bm25_vectorizer::{Bm25VectorizerBuilder, MockWhitespaceTokenizer, MockHashTokenIndexer};
let corpus = vec!["hello world", "world of rust"];
let vectorizer = Bm25VectorizerBuilder::new()
.tokenizer(MockWhitespaceTokenizer)
.token_indexer(MockHashTokenIndexer)
.fit(&corpus)?
.build()?;
let result = vectorizer.vectorize("hello rust");Implementations§
Source§impl<TokenIndexer, Tokenizer> Bm25Vectorizer<TokenIndexer, Tokenizer>
impl<TokenIndexer, Tokenizer> Bm25Vectorizer<TokenIndexer, Tokenizer>
Sourcepub fn avgdl(&self) -> f32
pub fn avgdl(&self) -> f32
Returns the average document length used for normalisation.
§Examples
assert_eq!(vectorizer.avgdl(), 10.5);Sourcepub fn k1(&self) -> f32
pub fn k1(&self) -> f32
Returns the k1 parameter controlling term frequency saturation.
§Examples
assert_eq!(vectorizer.k1(), 1.5);Sourcepub fn b(&self) -> f32
pub fn b(&self) -> f32
Returns the b parameter controlling length normalisation.
§Examples
assert_eq!(vectorizer.b(), 0.8);Sourcepub fn delta(&self) -> f32
pub fn delta(&self) -> f32
Returns the delta parameter used as a lower bound for term values.
§Examples
assert_eq!(vectorizer.delta(), 0.25);Sourcepub fn vectorize(
&self,
text: &str,
) -> SparseRepresentation<TokenIndexer::Bm25TokenIndex>where
TokenIndexer: Bm25TokenIndexer,
TokenIndexer::Bm25TokenIndex: Eq + Hash + Clone + Debug + Ord,
Tokenizer: Bm25Tokenizer,
pub fn vectorize(
&self,
text: &str,
) -> SparseRepresentation<TokenIndexer::Bm25TokenIndex>where
TokenIndexer: Bm25TokenIndexer,
TokenIndexer::Bm25TokenIndex: Eq + Hash + Clone + Debug + Ord,
Tokenizer: Bm25Tokenizer,
Converts input text into a sparse BM25 vector representation.
This method tokenizes the input text, and computes BM25 term frequencies to generate a sparse vector representation that can then be uploaded to a vector database.
NOTE: Vector databases might require to specify an IDF modifier when setting up the vector store to instruct them to calculate IDF statistics automatically. This implementation produces only the normalised term frequency (TF) component in document vectors and expects the inverse document frequency (IDF) to be computed by the vector database.
§Arguments
text: The input text to vectorize
§Returns
A SparseRepresentation containing token indices and their BM25 values
§Examples
use bm25_vectorizer::{Bm25VectorizerBuilder, MockWhitespaceTokenizer, MockHashTokenIndexer};
let corpus = vec!["hello world", "world rust"];
let vectorizer = Bm25VectorizerBuilder::new()
.tokenizer(MockWhitespaceTokenizer)
.token_indexer(MockHashTokenIndexer)
.fit(&corpus)?
.build()?;
let result = vectorizer.vectorize("hello world");
// Result contains BM25 values for tokens "hello" and "world"
assert_eq!(result.0.len(), 2);