Expand description
BM25 with ripvec’s stem-doubled path enrichment.
Port of ~/src/semble/src/semble/index/sparse.py (enrich_for_bm25
and selector_to_mask) plus the BM25 scoring loop used in
~/src/semble/src/semble/search.py:search_bm25. The enrichment
appends the file stem twice and the last three directory components
to chunk content before tokenization, so path-based queries hit
even when the query terms aren’t in the chunk text.
Python uses the bm25s library; this port hand-rolls Okapi BM25
(k1=1.5, b=0.75) to avoid another dependency. The output ordering
matches bm25s’s descending-score semantics with zero-score
exclusion as in search.py:search_bm25.
Structs§
- Bm25
Index - Hand-rolled Okapi BM25 index over a set of enriched documents.
Functions§
- enrich_
for_ bm25 - Append the file stem (twice, for up-weight) and the last three
directory components to a chunk’s text content. Mirrors
enrich_for_bm25fromsparse.py:18. - search_
bm25 - Top-k BM25 search with optional selector mask and zero-score
exclusion. Mirrors
search.py:search_bm25. - selector_
to_ mask - Convert a sparse selector (chunk indices to keep) into a dense
boolean mask of
size. Mirrorsselector_to_maskfromsparse.py:9. ReturnsNonewhenselectorisNone.