Expand description
BM25 relevance scoring — the standard ranking function for keyword retrieval. Pure math; no SQL coupling.
Resolves Phase 8 plan Q4 + Q5: no stemming and no stop-list. The
caller is responsible for tokenizing both the query and the document
(see super::tokenizer::tokenize); this module just consumes term
frequencies + corpus stats and produces a score.
§Formula (Robertson/Spärck Jones BM25)
For a document d and query q:
score(d, q) = Σ_{t ∈ q} idf(t) · (tf(t,d) · (k1 + 1)) /
(tf(t,d) + k1 · (1 - b + b · |d| / avgdl))
idf(t) = ln(1 + (N - n(t) + 0.5) / (n(t) + 0.5))N= total documents in corpusn(t)= number of documents containing termttf(t,d)= frequency oftind|d|= length ofdin tokensavgdl= average document length across the corpusk1,b= tuning constants (Q4 — fixed at SQLite FTS5 defaults)
The + 1 inside the IDF log keeps the term non-negative even when
n(t) > N/2, which would otherwise give the classic BM25 negative
IDF and require clipping. This is the “BM25+” / Lucene variant.
Structs§
- Bm25
Params - Tuning parameters for BM25. Per Phase 8 Q4 the public surface still
exposes these as a struct so we can grow per-call overrides later
without breaking signatures, but the
Bm25Params::default()values (k1 = 1.5,b = 0.75) are fixed for the MVP and match SQLite FTS5.
Functions§
- score
- Compute the BM25 score for a single (document, query) pair.