Skip to main content

Module bm25

Module bm25 

Source
Expand description

BM25 Scoring for Lexical Search (Task 4)

This module implements BM25 (Best Matching 25) scoring for keyword search. BM25 is the standard ranking function for lexical retrieval, balancing:

  • Term frequency (TF): How often a term appears in a document
  • Inverse document frequency (IDF): How rare a term is across all documents
  • Document length normalization: Penalizing very long documents

§BM25 Formula

score(q, d) = Σ IDF(t) * (TF(t,d) * (k1 + 1)) / (TF(t,d) + k1 * (1 - b + b * |d|/avgdl))

Where:

  • TF(t,d) = term frequency of term t in document d
  • IDF(t) = log((N - df(t) + 0.5) / (df(t) + 0.5) + 1)
  • N = total number of documents
  • df(t) = number of documents containing term t
  • |d| = length of document d
  • avgdl = average document length
  • k1 = term frequency saturation parameter (typically 1.2)
  • b = length normalization parameter (typically 0.75)

Structs§

BM25Config
BM25 scoring parameters
BM25Scorer
BM25 scorer for a document collection
BM25Stats
BM25 scorer statistics

Functions§

tokenize
Simple whitespace + lowercase tokenizer
tokenize_minimal
Tokenize with minimal normalization
tokenize_query
Tokenize query (keeps original for exact matching, adds lowercase)