bm25 2.1.0

BM25 embedder, scorer, and search engine
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
# bm25

[![Docs](https://docs.rs/bm25/badge.svg)](https://docs.rs/bm25/)
[![Crates.io Version](https://img.shields.io/crates/v/bm25)](https://crates.io/crates/bm25)
[![Crates.io Total Downloads](https://img.shields.io/crates/d/bm25)](https://crates.io/crates/bm25)

A Rust crate for everything [BM25](https://en.wikipedia.org/wiki/Okapi_BM25). This crate provides
utilities at three levels of abstraction:
1. **BM25 Embedder**: Embeds text into a sparse vector space for information retrieval. You can use
    these embeddings with vector databases, e.g., Qdrant, Pinecone and Milvus, etc.
2. **BM25 Scorer**: Efficiently scores the relevance of a query embedding to document embeddings.
3. **BM25 Search Engine**: A fast, light-weight, in-memory full-text search engine built on top of
    the embedder and scorer.

## Features

- Fast
- Language-detecting tokenizer using industry-standard NLP techniques
- Parallelism for fast batch-fitting
- Full access to BM25 parameters
- Modular and customisable
- Configurable via compile-time features

## The BM25 algorithm 

[BM25](https://en.wikipedia.org/wiki/Okapi_BM25) is an algorithm for scoring the relevance of a
query to documents in a corpus. You can make this scoring more efficient by pre-computing a
'sparse embedding' of each document. You can use these sparse embeddings directly, or upload them
to a vector database and query them from there.

BM25 assumes that you know the average (meaningful) word count of your documents ahead of time. This
crate provides utilities to compute this. If this assumption doesn't hold for your use-case, you
have two options: (1) make a sensible guess (e.g. based on a sample); or (2) configure the algorithm
to disregard document length. The former is recommended if most of your documents are around the
same size.

BM25 has three parameters: `b`, `k1` and `avgdl`. These terms match the formula given on
Wikipedia. `avgdl` ('average document length') is the aforementioned average meaningful word count;
you should always provide a value for this and the crate can fit this for you. `b` controls
document length normalization; `0` means no normalisation (length will not affect score) while `1`
means full normalisation. If you know `avgdl`, `0.75` is typically a good choice for `b`. If
you're guessing `avgdl`, you can use a slightly lower `b` to reduce the effect of document length
on score. If you have no idea what `avgdl` is, set `b` to `0`. `k1` controls how much weight is
given to recurring tokens. For almost all use-cases, a value of `1.2` is suitable.

## Getting started

Add `bm25` to your project with

```sh
cargo add bm25
```

Depending on your use-case, you may want to read more about the [Embedder](#embed), [Scorer](#score)
or [SearchEngine](#search).

### Embed

The best way to embed some text is to fit an embedder to your corpus. 
```rust
use bm25::{Embedder, EmbedderBuilder, Embedding, TokenEmbedding, Language};

let corpus = [
    "The sky blushed pink as the sun dipped below the horizon.",
    "Apples, oranges, papayas, and more papayas.",
    "She found a forgotten letter tucked inside an old book.",
    "A single drop of rain fell, followed by a thousand more.",
];

let embedder: Embedder = EmbedderBuilder::with_fit_to_corpus(Language::English, &corpus).build();

assert_eq!(embedder.avgdl(), 5.75);

let embedding = embedder.embed(corpus[1]);

assert_eq!(
    embedding,
    Embedding(vec![
        TokenEmbedding {
            index: 1777144781,
            value: 1.1422123,
        },
        TokenEmbedding {
            index: 3887370161,
            value: 1.1422123,
        },
        TokenEmbedding {
            index: 2177600299,
            value: 1.5037148,
        },
        TokenEmbedding {
            index: 2177600299,
            value: 1.5037148,
        },
    ])
)
```

#### BM25 parameters

For cases where you don't have the full corpus ahead of time, but have an approximate idea of the
average meaningful word count you expect, you can construct an embedder with your `avgdl` guess.

```rust
use bm25::{Embedder, EmbedderBuilder};

let embedder: Embedder = EmbedderBuilder::with_avgdl(7.0)
    .build();
```

If you want to disregard document length altogether, set `b` to 0.

```rust
use bm25::{Embedder, EmbedderBuilder};

let embedder: Embedder = EmbedderBuilder::with_avgdl(1.0)
    .b(0.0) // if b = 0, avgdl has no effect
    .build();
```

#### Language

By default, the embedder uses an English `DefaultTokenizer`. If you are working with a different
language, you can configure the embedder to tokenize accordingly.

```rust
use bm25::{Embedder, EmbedderBuilder, Language};

let embedder: Embedder = EmbedderBuilder::with_avgdl(256.0)
    .language_mode(Language::German)
    .build();
```

If your corpus is multilingual, or you don't know the language ahead of time, you can enable the
`language_detection` feature.

```sh
cargo add bm25 --features language_detection
```

This unlocks the `LanguageMode::Detect` enum value. In this mode, the tokenizer will try to detect
the language of each piece of input text before tokenizing. Note that there is a small performance
overhead when embedding in this mode.

```rust
use bm25::{Embedder, EmbedderBuilder, LanguageMode};

let embedder: Embedder = EmbedderBuilder::with_avgdl(64.0)
    .language_mode(LanguageMode::Detect)
    .build();
```

#### Tokenizer

The embedder uses a tokenizer to convert text into a sequence of tokens to embed. The default
tokenizer detects language, normalizes unicode, splits on whitespace and punctuation, removes stop
words and stems the remaining words. You can customise its behaviour by using the builder.

```rust
use bm25::{DefaultTokenizer, Language, Tokenizer};

let tokenizer = DefaultTokenizer::builder()
    .language_mode(Language::English)
    .normalization(true) // Normalize unicode (e.g., 'é' -> 'e', '🍕' -> 'pizza', etc.)
    .stopwords(false) // Remove common words with little meaning (e.g., 'the', 'and', etc.)
    .stemming(false) // Reduce words to their root form (e.g., 'running' -> 'run')
    .build();

let text = "Slice of 🍕";

let tokens = tokenizer.tokenize(text);

assert_eq!(tokens, vec!["slice", "of", "pizza"]);
```

While this works well for most languages and use-cases, this crate makes it easy for you to provide
your own tokenizer. All you have to do is implement the `Tokenizer` trait.

```rust
use bm25::{EmbedderBuilder, Embedding, Tokenizer};

#[derive(Default)]
struct MyTokenizer {}

// Tokenize on occurrences of "T"
impl Tokenizer for MyTokenizer {
    fn tokenize(&self, input_text: &str) -> Vec<String> {
        input_text
            .split("T")
            .filter(|s| !s.is_empty())
            .map(str::to_string)
            .collect()
    }
}

let embedder = EmbedderBuilder::<u32, MyTokenizer>::with_avgdl(1.0).build();

let embedding = embedder.embed("CupTofTtea");

assert_eq!(
    embedding.indices().cloned().collect::<Vec<_>>(),
    vec![3568447556, 3221979461, 415655421]
);
```

If you're not using the `DefaultTokenizer` at all, you can disable the `default_tokenizer` feature
to remove some dependencies from your project.

```sh
cargo add bm25 --no-default-features
```

#### Embedding space

You can customise the dimensionality of your sparse vector via the generic parameter. Supported
values are `usize`, `u32` and `u64`. You can also use your own type (and inject your own embedding
function) by implementing the `TokenEmbedder` trait.

```rust
use bm25::{EmbedderBuilder, TokenEmbedder};

let text = "cup of tea";

// Embed into a u32-dimensional space
let embedder = EmbedderBuilder::<u32>::with_avgdl(2.0).build();
let embedding = embedder.embed(text);
assert_eq!(
    embedding.indices().cloned().collect::<Vec<_>>(),
    [2070875659, 415655421]
);

// Embed into a u64-dimensional space
let embedder = EmbedderBuilder::<u64>::with_avgdl(2.0).build();
let embedding = embedder.embed(text);
assert_eq!(
    embedding.indices().cloned().collect::<Vec<_>>(),
    [3288102823240002853, 7123809554392261272]
);

#[derive(Eq, PartialEq, Hash, Clone, Debug)]
struct MyType(u32);
impl TokenEmbedder for MyType {
    type EmbeddingSpace = Self;
    fn embed(_token: &str) -> Self {
        MyType(42)
    }
}

// Embed into a MyType-dimensional space
let embedder = EmbedderBuilder::<MyType>::with_avgdl(2.0).build();
let embedding = embedder.embed(text);
assert_eq!(
    embedding.indices().cloned().collect::<Vec<_>>(),
    [MyType(42), MyType(42)]
);
```

### Score

This crate provides a BM25 scorer that can efficiently score the relevance of a query embedding to
document embeddings. The scorer manages the complexity of maintaining token frequencies and indexes,
as well as the actual scoring.

```rust
use bm25::{Embedder, EmbedderBuilder, Language, Scorer, ScoredDocument};

let corpus = [
    "The sky blushed pink as the sun dipped below the horizon.",
    "She found a forgotten letter tucked inside an old book.",
    "Apples, oranges, pink grapefruits, and more pink grapefruits.",
    "A single drop of rain fell, followed by a thousand more.",
];
let query = "pink";

let mut scorer = Scorer::<usize>::new();

let embedder: Embedder =
    EmbedderBuilder::with_fit_to_corpus(Language::English, &corpus).build();

for (i, document) in corpus.iter().enumerate() {
    let document_embedding = embedder.embed(document);
    scorer.upsert(&i, document_embedding);
}

let query_embedding = embedder.embed(query);

let score = scorer.score(&0, &query_embedding);
assert_eq!(score, Some(0.36260858));

let matches = scorer.matches(&query_embedding);
assert_eq!(
    matches,
    vec![
        ScoredDocument {
            id: 2,
            score: 0.4960082
        },
        ScoredDocument {
            id: 0,
            score: 0.36260858
        }
    ]
);
```

### Search

This crate includes a light-weight, in-memory full-text search engine built on top of the embedder.

```rust
use bm25::{Document, Language, SearchEngineBuilder, SearchResult};

let corpus = [
    "The rabbit munched the orange carrot.",
    "The snake hugged the green lizard.",
    "The hedgehog impaled the orange orange.",
    "The squirrel buried the brown nut.",
];

let search_engine = SearchEngineBuilder::<u32>::with_corpus(Language::English, corpus).build();

let limit = 3;
let search_results = search_engine.search("orange", limit);

assert_eq!(
    search_results,
    vec![
        SearchResult {
            document: Document {
                id: 2,
                contents: String::from("The hedgehog impaled the orange orange."),
            },
            score: 0.4904281,
        },
        SearchResult {
            document: Document {
                id: 0,
                contents: String::from("The rabbit munched the orange carrot."),
            },
            score: 0.35667497,
        },
    ]
);
```

You can construct a search engine with documents (allowing you to customise the id type and
value), or with an average document length.

```rust
use bm25::{Document, Language, SearchEngineBuilder};

// Build a search engine from documents
let search_engine = SearchEngineBuilder::<&str>::with_documents(
    Language::English,
    [
        Document {
            id: "Guacamole",
            contents: String::from("avocado, lime juice, salt, onion, tomatoes, coriander."),
        },
        Document {
            id: "Hummus",
            contents: String::from("chickpeas, tahini, olive oil, garlic, lemon juice, salt."),
        },
    ],
)
.build();

// Build a search engine from avgdl
let search_engine = SearchEngineBuilder::<u32>::with_avgdl(128.0)
    .build();
```

You can upsert or remove documents from the search engine. Note that mutating the search corpus
by upserting or removing documents will change the true value of `avgdl`. The more `avgdl` drifts
from its true value, the less accurate the BM25 scores will be.

```rust
use bm25::{Document, SearchEngineBuilder};

let mut search_engine = SearchEngineBuilder::<u32>::with_avgdl(10.0)
    .build();

let document_id = 42;
let document = Document {
    id: document_id,
    contents: String::from(
        "A breeze carried the scent of blooming jasmine through the open window.",
    ),
};

search_engine.upsert(document.clone());
assert_eq!(search_engine.get(&document_id), Some(document));

search_engine.remove(&document_id);
assert_eq!(search_engine.get(&document_id), None);
```

### Working with a large corpus

If your corpus is large, fitting an embedder can be slow. Fortunately, you can trivially
parallelise this via the `parallelism` feature, which implements data parallelism using
[Rayon](https://crates.io/crates/rayon).

```sh
cargo add bm25 --features parallelism
```

## License

[MIT License](https://github.com/Michael-JB/bm25/blob/main/LICENSE)