Bm25Vectorizer

Struct Bm25Vectorizer 

Source
pub struct Bm25Vectorizer<TokenIndexer, Tokenizer> { /* private fields */ }
Expand description

The main BM25 vectorizer that converts text into sparse vector representations.

This struct encapsulates all the parameters and components needed to perform BM25 vectorization. It uses a tokenizer to break text into tokens and a token indexer to map tokens to indices.

§Type Parameters

  • TokenIndexer: Implementation of Bm25TokenIndexer trait for mapping tokens to indices
  • Tokenizer: Implementation of Bm25Tokenizer trait for text tokenization

§Examples

use bm25_vectorizer::{Bm25VectorizerBuilder, MockWhitespaceTokenizer, MockHashTokenIndexer};

let corpus = vec!["hello world", "world of rust"];
let vectorizer = Bm25VectorizerBuilder::new()
    .tokenizer(MockWhitespaceTokenizer)
    .token_indexer(MockHashTokenIndexer)
    .fit(&corpus)?
    .build()?;

let result = vectorizer.vectorize("hello rust");

Implementations§

Source§

impl<TokenIndexer, Tokenizer> Bm25Vectorizer<TokenIndexer, Tokenizer>

Source

pub fn avgdl(&self) -> f32

Returns the average document length used for normalisation.

§Examples
assert_eq!(vectorizer.avgdl(), 10.5);
Source

pub fn k1(&self) -> f32

Returns the k1 parameter controlling term frequency saturation.

§Examples
assert_eq!(vectorizer.k1(), 1.5);
Source

pub fn b(&self) -> f32

Returns the b parameter controlling length normalisation.

§Examples
assert_eq!(vectorizer.b(), 0.8);
Source

pub fn delta(&self) -> f32

Returns the delta parameter used as a lower bound for term values.

§Examples
assert_eq!(vectorizer.delta(), 0.25);
Source

pub fn vectorize( &self, text: &str, ) -> SparseRepresentation<TokenIndexer::Bm25TokenIndex>
where TokenIndexer: Bm25TokenIndexer, TokenIndexer::Bm25TokenIndex: Eq + Hash + Clone + Debug + Ord, Tokenizer: Bm25Tokenizer,

Converts input text into a sparse BM25 vector representation.

This method tokenizes the input text, and computes BM25 term frequencies to generate a sparse vector representation that can then be uploaded to a vector database.

NOTE: Vector databases might require to specify an IDF modifier when setting up the vector store to instruct them to calculate IDF statistics automatically. This implementation produces only the normalised term frequency (TF) component in document vectors and expects the inverse document frequency (IDF) to be computed by the vector database.

§Arguments
  • text: The input text to vectorize
§Returns

A SparseRepresentation containing token indices and their BM25 values

§Examples
use bm25_vectorizer::{Bm25VectorizerBuilder, MockWhitespaceTokenizer, MockHashTokenIndexer};

let corpus = vec!["hello world", "world rust"];
let vectorizer = Bm25VectorizerBuilder::new()
    .tokenizer(MockWhitespaceTokenizer)
    .token_indexer(MockHashTokenIndexer)
    .fit(&corpus)?
    .build()?;

let result = vectorizer.vectorize("hello world");
// Result contains BM25 values for tokens "hello" and "world"
assert_eq!(result.0.len(), 2);

Trait Implementations§

Source§

impl<TokenIndexer: Debug, Tokenizer: Debug> Debug for Bm25Vectorizer<TokenIndexer, Tokenizer>

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more

Auto Trait Implementations§

§

impl<TokenIndexer, Tokenizer> Freeze for Bm25Vectorizer<TokenIndexer, Tokenizer>
where Tokenizer: Freeze, TokenIndexer: Freeze,

§

impl<TokenIndexer, Tokenizer> RefUnwindSafe for Bm25Vectorizer<TokenIndexer, Tokenizer>
where Tokenizer: RefUnwindSafe, TokenIndexer: RefUnwindSafe,

§

impl<TokenIndexer, Tokenizer> Send for Bm25Vectorizer<TokenIndexer, Tokenizer>
where Tokenizer: Send, TokenIndexer: Send,

§

impl<TokenIndexer, Tokenizer> Sync for Bm25Vectorizer<TokenIndexer, Tokenizer>
where Tokenizer: Sync, TokenIndexer: Sync,

§

impl<TokenIndexer, Tokenizer> Unpin for Bm25Vectorizer<TokenIndexer, Tokenizer>
where Tokenizer: Unpin, TokenIndexer: Unpin,

§

impl<TokenIndexer, Tokenizer> UnwindSafe for Bm25Vectorizer<TokenIndexer, Tokenizer>
where Tokenizer: UnwindSafe, TokenIndexer: UnwindSafe,

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.