langchainrust 0.2.13

BM25 algorithm documentation

BM25 stands for Best Match 25, a ranking function for text retrieval.
BM25 was developed by Stephen Robertson and Karen Spärck Jones.
BM25 is an improved version of the earlier Okapi BM11 algorithm.
BM25 is widely used in search engines and information retrieval systems.

The BM25 formula calculates document relevance scores.
The score depends on term frequency in the document.
The score also considers inverse document frequency IDF.
Document length normalization is applied through parameter b.
Term frequency saturation is controlled by parameter k1.

BM25 formula components explained

IDF stands for inverse document frequency.
IDF measures how rare or common a term is across all documents.
IDF formula: log((N - n + 0.5) / (n + 0.5) + 1).
N is the total number of documents in the collection.
n is the number of documents containing the term.
Rare terms have higher IDF values.
Common terms have lower IDF values.

TF stands for term frequency.
TF counts how many times a term appears in a document.
BM25 normalizes TF to prevent long documents from dominating.
The normalization uses document length and average document length.
The formula: (tf * (k1 + 1)) / (tf + k1 * (1 - b + b * dl/avgdl)).

Parameter k1 controls term frequency saturation.
k1 typically ranges from 1.2 to 2.0.
Higher k1 means more saturation for high-frequency terms.
Default k1 value is often 1.5.

Parameter b controls document length normalization.
b typically ranges from 0 to 1.
b equals 0 means no length normalization.
b equals 1 means full length normalization.
Default b value is often 0.75.

BM25 advantages and use cases

BM25 works well for keyword-based search.
BM25 does not require embeddings or neural networks.
BM25 is computationally efficient and fast.
BM25 handles rare and common terms appropriately.
BM25 works well for exact term matching.

BM25 limitations and considerations

BM25 does not capture semantic similarity.
BM25 may miss synonyms and related concepts.
BM25 requires proper tokenization for best results.
BM25 may need tuning for specific domains.
BM25 can be combined with vector search for hybrid retrieval.

BM25 implementation in LangChainRust

BM25Retriever provides BM25 search functionality.
BM25Index stores documents and builds term frequency tables.
Tokenizer handles text segmentation for Chinese and English.
BM25Params allows customization of k1 and b parameters.
The implementation supports both Chinese and English text.