Skip to main content

Module bm25

Module bm25 

Source
Expand description

BM25 with ripvec’s stem-doubled path enrichment.

Port of ~/src/semble/src/semble/index/sparse.py (enrich_for_bm25 and selector_to_mask) plus the BM25 scoring loop used in ~/src/semble/src/semble/search.py:search_bm25. The enrichment appends the file stem twice and the last three directory components to chunk content before tokenization, so path-based queries hit even when the query terms aren’t in the chunk text.

Python uses the bm25s library; this port hand-rolls Okapi BM25 (k1=1.5, b=0.75) to avoid another dependency. The output ordering matches bm25s’s descending-score semantics with zero-score exclusion as in search.py:search_bm25.

Structs§

Bm25Index
Hand-rolled Okapi BM25 index over a set of enriched documents.

Functions§

enrich_for_bm25
Append the file stem (twice, for up-weight) and the last three directory components to a chunk’s text content. Mirrors enrich_for_bm25 from sparse.py:18.
search_bm25
Top-k BM25 search with optional selector mask and zero-score exclusion. Mirrors search.py:search_bm25.
selector_to_mask
Convert a sparse selector (chunk indices to keep) into a dense boolean mask of size. Mirrors selector_to_mask from sparse.py:9. Returns None when selector is None.