bm25x
A fast, streaming-friendly BM25 search engine written in Rust with memory-mapped (mmap) index support.
Inspired by bm25s, but designed from the ground up for incremental indexing — add, delete, and update documents without rebuilding the entire index.
Features
- All 5 BM25 variants: Lucene (default), Robertson, ATIRE, BM25L, BM25+
- Streaming index: Add documents incrementally, delete by ID, update in-place
- Pre-filtered search: Score only a subset of documents — up to 600x faster
- Memory-mapped storage: Load large indices with minimal RAM via mmap
- Auto-persistence: Point to a directory and the index saves/loads automatically
- Python bindings: Via maturin / PyO3
Python
Install
&&
Usage
# Create a persistent index (auto-saves on every mutation)
=
# Search
=
# Pre-filtered search — only score a subset of documents
=
# Streaming mutations (auto-saved to disk)
# Reload later — just point to the same directory
= # loads existing index, ready to search
Constructor
Methods
| Method | Description |
|---|---|
add(docs) -> list[int] |
Add documents (list of strings), returns assigned indices |
search(query, k, subset=None) -> list[tuple[int, float]] |
Top-k search. Pass subset to restrict scoring to specific doc IDs |
delete(doc_ids) |
Delete documents by their indices |
update(doc_id, text) |
Replace a document's text |
save(index) |
Explicit save (only needed for in-memory indices) |
load(index, mmap=False) |
Load from directory (static method) |
len(index) |
Number of active documents |
Rust
Add to your project
[]
= "0.1"
Usage
use ;
// Create an index
let mut index = BM25new;
// Add documents (returns assigned indices)
let ids = index.add;
// Search — returns Vec<SearchResult> with .index and .score
let results = index.search;
for r in &results
// Pre-filtered search — only score documents in the subset
let results = index.search_filtered;
// Streaming mutations
index.add;
index.delete;
index.update;
// Save / load with mmap
index.save.unwrap;
let index = BM25load.unwrap; // mmap=true
API
// Constructor
BM25new // Documents
Benchmarks
BEIR SciFact — 5k documents, 300 queries
| Metric | bm25s | bm25x |
|---|---|---|
| NDCG@10 | 0.6617 | 0.6650 |
| Index time | 0.581s | 0.190s (3.1x faster) |
| Search time | 0.031s | 0.011s (2.8x faster) |
BEIR MS MARCO — 8.8M documents, 6,980 queries
| Metric | bm25s | bm25x |
|---|---|---|
| NDCG@10 | 0.2124 | 0.2186 |
| Index time | 377.9s | 106.6s (3.5x faster) |
| Index throughput | 23,395 d/s | 82,910 d/s |
| Search throughput | 16 q/s | 65 q/s (4x faster) |
| Mmap mem delta | 153 MB | 109 MB |
Pre-filtered search — 100k documents, 1k queries
| Filter size | Throughput | Speedup |
|---|---|---|
| No filter | 592 q/s | baseline |
| 1,000 docs | 4,286 q/s | 7x |
| 100 docs | 52,000 q/s | 88x |
| 10 docs | 366,000 q/s | 618x |
Design
bm25s pre-computes all BM25 scores at index time (eager scoring). This makes queries fast but rebuilding the index is required to add or remove documents.
bm25x uses lazy scoring — it stores raw term frequencies in an inverted index and computes BM25 scores at query time. This makes the index naturally streaming-friendly: add, delete, and update are cheap operations that don't require a full rebuild.
Pre-filtered search uses a doc-centric approach: instead of scanning posting lists, it iterates only the subset and binary-searches each document's term frequency. This makes it O(|subset| * |query_terms| * log n) instead of O(|posting_list|).
On-disk, the index uses a flat binary format with memory-mapped postings and document lengths, keeping RAM usage minimal for large indices.
License
MIT