Skip to main content

Module bm25

Module bm25 

Source
Expand description

BM25 relevance scoring — the standard ranking function for keyword retrieval. Pure math; no SQL coupling.

Resolves Phase 8 plan Q4 + Q5: no stemming and no stop-list. The caller is responsible for tokenizing both the query and the document (see super::tokenizer::tokenize); this module just consumes term frequencies + corpus stats and produces a score.

§Formula (Robertson/Spärck Jones BM25)

For a document d and query q:

score(d, q) = Σ_{t ∈ q} idf(t) · (tf(t,d) · (k1 + 1)) /
                                 (tf(t,d) + k1 · (1 - b + b · |d| / avgdl))

idf(t) = ln(1 + (N - n(t) + 0.5) / (n(t) + 0.5))
  • N = total documents in corpus
  • n(t) = number of documents containing term t
  • tf(t,d) = frequency of t in d
  • |d| = length of d in tokens
  • avgdl = average document length across the corpus
  • k1, b = tuning constants (Q4 — fixed at SQLite FTS5 defaults)

The + 1 inside the IDF log keeps the term non-negative even when n(t) > N/2, which would otherwise give the classic BM25 negative IDF and require clipping. This is the “BM25+” / Lucene variant.

Structs§

Bm25Params
Tuning parameters for BM25. Per Phase 8 Q4 the public surface still exposes these as a struct so we can grow per-call overrides later without breaking signatures, but the Bm25Params::default() values (k1 = 1.5, b = 0.75) are fixed for the MVP and match SQLite FTS5.

Functions§

score
Compute the BM25 score for a single (document, query) pair.