Expand description
BM25 Scoring for Lexical Search (Task 4)
This module implements BM25 (Best Matching 25) scoring for keyword search. BM25 is the standard ranking function for lexical retrieval, balancing:
- Term frequency (TF): How often a term appears in a document
- Inverse document frequency (IDF): How rare a term is across all documents
- Document length normalization: Penalizing very long documents
§BM25 Formula
score(q, d) = Σ IDF(t) * (TF(t,d) * (k1 + 1)) / (TF(t,d) + k1 * (1 - b + b * |d|/avgdl))Where:
TF(t,d)= term frequency of term t in document dIDF(t)= log((N - df(t) + 0.5) / (df(t) + 0.5) + 1)N= total number of documentsdf(t)= number of documents containing term t|d|= length of document davgdl= average document lengthk1= term frequency saturation parameter (typically 1.2)b= length normalization parameter (typically 0.75)
Structs§
- BM25
Config - BM25 scoring parameters
- BM25
Scorer - BM25 scorer for a document collection
- BM25
Stats - BM25 scorer statistics
Functions§
- tokenize
- Simple whitespace + lowercase tokenizer
- tokenize_
minimal - Tokenize with minimal normalization
- tokenize_
query - Tokenize query (keeps original for exact matching, adds lowercase)