Anda-DB BM25 Full-Text Search Library

anda_db_tfs is a full-text search library implementing the BM25 ranking algorithm in Rust. BM25 (Best Matching 25) is a ranking function used by search engines to estimate the relevance of documents to a given search query. It's an extension of the TF-IDF model.

Features

High Performance: Optimized for speed with parallel processing using Rayon.
Customizable Tokenization: Support for various tokenizers including Chinese text via jieba.
BM25 Ranking: Industry-standard relevance scoring algorithm.
Document Management: Add, remove, and search documents with ease.
Serialization: Save and load indices in CBOR format with optional compression.
Thread-Safe: Designed for concurrent access with read-write locks.
Memory Efficient: Optimized data structures for reduced memory footprint.

Installation

Add this to your Cargo.toml:

[dependencies]
anda_db_tfs = "0.2"

For full features including tantivy tokenizers and jieba support:

[dependencies]
anda_db_tfs = { version = "0.2", features = ["full"] }

Quick Start

use anda_db_tfs::{BM25Index, SimpleTokenizer};

// Create a new index with a simple tokenizer
let index = BM25Index::new(SimpleTokenizer::default(), None);

// Add documents to the index
index.insert(1, "The quick brown fox jumps over the lazy dog", now_ms).unwrap();
index.insert(2, "A fast brown fox runs past the lazy dog", now_ms).unwrap();
index.insert(3, "The lazy dog sleeps all day", now_ms).unwrap();

// Search for documents containing "fox"
let results = index.search("fox", 10);
for (doc_id, score) in results {
    println!("Document {}: score {}", doc_id, score);
}

// Remove a document
index.remove(3, "The lazy dog sleeps all day", now_ms);

// Store the index to a file
let file = tokio::fs::File::create("index.cbor").await.unwrap();
index.store(file, now_ms).await.unwrap();

// Load the index from a file
let file = tokio::fs::File::open("index.cbor").await.unwrap();
let loaded_index = BM25Index::load(file, SimpleTokenizer::default()).await.unwrap();

Chinese Text Support

With the tantivy-jieba feature enabled, you can use the jieba tokenizer for Chinese text:

use anda_db_tfs::{BM25Index, jieba_tokenizer};

// Create an index with jieba tokenizer
let index = BM25Index::new(jieba_tokenizer(), None);

// Add documents with Chinese text
index.insert(1, "Rust 是一种系统编程语言", now_ms).unwrap();
index.insert(2, "Rust 快速且内存高效，安全、并发、实用", now_ms).unwrap();

// Search for documents
let results = index.search("安全", 10);

Advanced Usage

Custom Tokenizer and BM25 Parameters

use anda_db_tfs::{BM25Index, BM25Params};
use tantivy::tokenizer::{LowerCaser, RemoveLongFilter, SimpleTokenizer, Stemmer};

// Create an index with custom BM25 parameters
let params = BM25Params { k1: 1.5, b: 0.75 };
let tokenizer = TokenizerChain::builder(SimpleTokenizer::default())
  .filter(RemoveLongFilter::limit(32))
  .filter(LowerCaser)
  .filter(Stemmer::default())
  .build();
let index = BM25Index::new(tokenizer, Some(params));

Batch Document Processing

use anda_db_tfs::{BM25Index, default_tokenizer};

let index = BM25Index::new(default_tokenizer());

// Prepare multiple documents
let docs = vec![
    (1, "Document one content".to_string()),
    (2, "Document two content".to_string()),
    (3, "Document three content".to_string()),
];

// Add documents in batch
let results = index.insert(docs, now_ms);

API Documentation

BM25Index

The main struct for creating and managing a search index.

// Creates a new index
pub fn new(tokenizer: T, params: Some(BM25Param)) -> Self

// Gets the number of documents in the index
pub fn len(&self) -> usize

// Checks if the index is empty
pub fn is_empty(&self) -> bool

/// Returns the index update version
pub fn version(&self) -> u64

/// Returns the index metadata
pub fn metadata(&self) -> IndexMetadata

/// Gets current statistics about the index
pub fn stats(&self) -> IndexStats

// Adds a document to the index
pub fn insert(&self, id: u64, text: &str, now_ms: u64) -> Result<(), BM25Error>

// Adds multiple documents to the index
pub fn insert_batch(&self, docs: Vec<(u64, String)>, now_ms: u64) -> Vec<Result<(), BM25Error>>

// Removes a document from the index
pub fn remove(&self, id: u64, text: &str, now_ms: u64) -> bool

// Searches the index
pub fn search(&self, query: &str, top_k: usize) -> Vec<(u64, f32)>

/// Searches the index for documents matching the query expression,
/// which can include boolean operators (AND, OR, NOT).
pub fn search_advanced(&self, query: &str, top_k: usize) -> Vec<(u64, f32)>

// Stores the index without postings to a writer.
pub async fn store<W: AsyncRead>(&self, w: W, now_ms: u64) -> Result<(), BM25Error>

// Stores the index with postings to a writer.
pub async fn store_all<W: AsyncRead>(&self, w: W, now_ms: u64) -> Result<(), BM25Error>

// Loads the index from a reader
pub async fn load<R: AsyncWrite>(r: R, tokenizer: T) -> Result<Self, BM25Error>

BM25Params

Parameters for the BM25 ranking algorithm.

pub struct BM25Params {
    // Controls term frequency saturation
    pub k1: f32,
    // Controls document length normalization
    pub b: f32,
}

Default values: k1 = 1.2, b = 0.75

Error Handling

The library uses a custom error type BM25Error for various error conditions:

BM25Error::Db: Database-related errors.
BM25Error::Cbor: Serialization/deserialization errors.
BM25Error::AlreadyExists: When trying to add a document with an ID that already exists.
BM25Error::TokenizeFailed: When tokenization produces no tokens for a document.

Performance Considerations

For large documents, the library automatically uses parallel processing for tokenization.
The search function uses parallel processing for query terms.
For best performance with large indices, consider using SSD storage for serialized indices.
Memory usage scales with the number of documents and unique terms.

License

ldclabs/anda-db is licensed under the MIT License. See LICENSE for the full license text.

anda_db_tfs 0.2.1