anda_db_tfs: A High-Performance Full-Text Search Library in Rust
anda_db_tfs is a full-text search library implementing the BM25 ranking algorithm in Rust. BM25 (Best Matching 25) is a ranking function used by search engines to estimate the relevance of documents to a given search query. It's an extension of the TF-IDF model.
Features
- High Performance: Optimized for speed with parallel processing using Rayon.
- Customizable Tokenization: Support for various tokenizers including Chinese text via jieba.
- BM25 Ranking: Industry-standard relevance scoring algorithm.
- Document Management: Add, remove, and search documents with ease.
- Serialization: Save and load indices in CBOR format with optional compression.
- Thread-Safe: Designed for concurrent access with read-write locks.
- Memory Efficient: Optimized data structures for reduced memory footprint.
Installation
Add this to your Cargo.toml:
[]
= "0.1.0"
For full features including tantivy tokenizers and jieba support:
[]
= { = "0.1.0", = ["full"] }
Quick Start
use ;
// Create a new index with a simple tokenizer
let index = new;
// Add documents to the index
index.add_document.unwrap;
index.add_document.unwrap;
index.add_document.unwrap;
// Search for documents containing "fox"
let results = index.search;
for in results
// Remove a document
index.remove_document;
// Save the index to a file
let file = create.unwrap;
index.save.unwrap;
// Load the index from a file
let file = open.unwrap;
let loaded_index = load.unwrap;
Chinese Text Support
With the tantivy-jieba feature enabled, you can use the jieba tokenizer for Chinese text:
use ;
// Create an index with jieba tokenizer
let index = new;
// Add documents with Chinese text
index.add_document.unwrap;
index.add_document.unwrap;
// Search for documents
let results = index.search;
Advanced Usage
Custom Tokenizer and BM25 Parameters
use ;
use ;
// Create an index with custom BM25 parameters
let params = BM25Params ;
let tokenizer = builder
.filter
.filter
.filter
.build;
let index = new.with_params;
Batch Document Processing
use ;
let index = new;
// Prepare multiple documents
let docs = vec!;
// Add documents in batch
let results = index.add_documents;
API Documentation
BM25Index
The main struct for creating and managing a search index.
// Create a new index
BM25Params
Parameters for the BM25 ranking algorithm.
Default values: k1 = 1.2, b = 0.75
Error Handling
The library uses a custom error type BM25Error for various error conditions:
BM25Error::Io: IO errors during read/write operations.BM25Error::Cbor: Serialization/deserialization errors.BM25Error::AlreadyExists: When trying to add a document with an ID that already exists.BM25Error::TokenizeFailed: When tokenization produces no tokens for a document.
Performance Considerations
- For large documents, the library automatically uses parallel processing for tokenization.
- The search function uses parallel processing for query terms.
- For best performance with large indices, consider using SSD storage for serialized indices.
- Memory usage scales with the number of documents and unique terms.
License
Copyright © 2025 LDC Labs.
ldclabs/anda-db is licensed under the MIT License. See LICENSE for the full license text.