Anda-DB BM25 Full-Text Search Library
anda_db_tfs is a full-text search library implementing the BM25 ranking algorithm in Rust. BM25 (Best Matching 25) is a ranking function used by search engines to estimate the relevance of documents to a given search query. It's an extension of the TF-IDF model.
Features
- High Performance: Optimized for speed with parallel processing using Rayon.
- Customizable Tokenization: Support for various tokenizers including Chinese text via jieba.
- BM25 Ranking: Industry-standard relevance scoring algorithm.
- Serialization: Save and load indices in CBOR format with optional compression.
- Incremental Persistent: Support incremental index updates persistent (insertions and deletions)
- Thread-safe concurrent access: Safely use the index from multiple threads
Installation
Add this to your Cargo.toml:
[]
= "0.4"
For full features including tantivy tokenizers and jieba support:
[]
= { = "0.4", = ["full"] }
Quick Start
use ;
use ;
// Create a new index with a simple tokenizer
let index = new;
// Add documents to the index
index.insert.unwrap;
index.insert.unwrap;
index.insert.unwrap;
// Search for documents containing "fox"
let results = index.search;
for in results
// Remove a document
index.remove;
// Store the index
// Load the index from a file
let metadata = open?;
let loaded_index = load_all
.await?;
println!;
Chinese Text Support
With the tantivy-jieba feature enabled, you can use the jieba tokenizer for Chinese text:
use ;
// Create an index with jieba tokenizer
let index = new;
// Add documents with Chinese text
index.insert.unwrap;
index.insert.unwrap;
// Search for documents
let results = index.search;
Advanced Usage
Custom Tokenizer and BM25 Parameters
use ;
use ;
// Create an index with custom BM25 parameters
let params = BM25Config ;
let index_name = "my_custom_index".to_string;
let tokenizer = builder
.filter
.filter
.filter
.build;
let index = new;
API Documentation
BM25Config
Parameters for the BM25 ranking algorithm.
Default values: k1 = 1.2, b = 0.75
Error Handling
The library uses a custom error type BM25Error for various error conditions:
BM25Error::Generic: Index-related errors.BM25Error::Serialization: CBOR serialization/deserialization errors.BM25Error::NotFound: Error when a token is not found.BM25Error::AlreadyExists: When trying to add a document with an ID that already exists.BM25Error::TokenizeFailed: When tokenization produces no tokens for a document.
Performance Considerations
- For large documents, the library automatically uses parallel processing for tokenization.
- The search function uses parallel processing for query terms.
- For best performance with large indices, consider using SSD storage for serialized indices.
- Memory usage scales with the number of documents and unique terms.
License
Copyright © 2026 LDC Labs.
ldclabs/anda-db is licensed under the MIT License. See LICENSE for the full license text.