Expand description
§Anda-DB BM25 Full-Text Search Library
anda_db_tfs is a thread-safe, embeddable full-text search engine based on
the Okapi BM25 ranking algorithm.
It is the text-indexing component of AndaDB and is specifically designed to
back the long-term textual memory of AI agents.
§Features
- BM25 ranking with configurable
k1andbparameters. - Composable tokenization: plug any
Tokenizer— including chains of filters — into the same index. Built-in helpers cover Latin/Cyrillic/Arabic text and Chinese via jieba. - Boolean query language with
AND,OR,NOT, and parentheses, exposed throughBM25Index::search_advanced. - Concurrent reads and writes powered by
dashmap+ atomic counters, so inserts, removes, and searches can run from multiple threads. - Incremental persistence: the inverted index is sharded into buckets
of bounded CBOR size; only dirty buckets are re-written on
BM25Index::flush. - Bucket compaction via
BM25Index::compact_bucketsto repack a fragmented index into the minimum number of buckets.
§Quick start
use anda_db_tfs::{BM25Index, default_tokenizer};
let index = BM25Index::new("notes".to_string(), default_tokenizer(), None);
index.insert(1, "The quick brown fox jumps over the lazy dog", 0).unwrap();
index.insert(2, "A fast brown fox runs past the lazy dog", 0).unwrap();
let hits = index.search("fox", 10, None);
for (doc_id, score) in hits {
println!("doc {doc_id}: {score}");
}See the README
and docs/anda_db_tfs.md for a full technical overview.
Structs§
- BM25
Config - Top-level configuration of a
BM25Index. - BM25
Index - Concurrent, bucket-sharded full-text index using BM25 scoring.
- BM25
Metadata - Index metadata.
- BM25
Params - Parameters controlling the BM25 scoring formula.
- BM25
Stats - Index statistics.
- BoxToken
Stream - Simple wrapper of
Box<dyn TokenStream + 'a>. - Token
- Token
- Tokenizer
Chain - A type-erased pipeline of a base
Tokenizerfollowed by zero or moreTokenFilters. - Tokenizer
Chain Builder - Builder for
TokenizerChain
Enums§
- BM25
Error - Errors that can occur when working with BM25 index.
- Query
Type - Represents different types of boolean queries that can be parsed from a query string. Supports Term, Or, And, and Not operations for building complex search expressions. Operator precedence: OR < AND < NOT.
Traits§
- Boxable
Tokenizer - A boxable
Tokenizer, with itsTokenStreamtype erased. - Token
Filter - Trait for the pluggable components of
Tokenizers. - Token
Stream TokenStreamis the result of the tokenization.- Tokenizer
Tokenizerare in charge of splitting text into a stream of token before indexing.
Functions§
- collect_
tokens - Tokenizes text and optionally filters tokens
- default_
tokenizer - Creates a default English-friendly tokenizer chain.
- flat_
full_ text_ search - Performs a simple full-text search by finding matching tokens in a document
Type Aliases§
- BoxError
- Posting
Value - Type alias for posting values: (bucket id, Vec<(document_id, token_frequency)>)