Skip to main content

Crate anda_db_tfs

Crate anda_db_tfs 

Source
Expand description

§Anda-DB BM25 Full-Text Search Library

anda_db_tfs is a thread-safe, embeddable full-text search engine based on the Okapi BM25 ranking algorithm. It is the text-indexing component of AndaDB and is specifically designed to back the long-term textual memory of AI agents.

§Features

  • BM25 ranking with configurable k1 and b parameters.
  • Composable tokenization: plug any Tokenizer — including chains of filters — into the same index. Built-in helpers cover Latin/Cyrillic/Arabic text and Chinese via jieba.
  • Boolean query language with AND, OR, NOT, and parentheses, exposed through BM25Index::search_advanced.
  • Concurrent reads and writes powered by dashmap + atomic counters, so inserts, removes, and searches can run from multiple threads.
  • Incremental persistence: the inverted index is sharded into buckets of bounded CBOR size; only dirty buckets are re-written on BM25Index::flush.
  • Bucket compaction via BM25Index::compact_buckets to repack a fragmented index into the minimum number of buckets.

§Quick start

use anda_db_tfs::{BM25Index, default_tokenizer};

let index = BM25Index::new("notes".to_string(), default_tokenizer(), None);
index.insert(1, "The quick brown fox jumps over the lazy dog", 0).unwrap();
index.insert(2, "A fast brown fox runs past the lazy dog", 0).unwrap();

let hits = index.search("fox", 10, None);
for (doc_id, score) in hits {
    println!("doc {doc_id}: {score}");
}

See the README and docs/anda_db_tfs.md for a full technical overview.

Structs§

BM25Config
Top-level configuration of a BM25Index.
BM25Index
Concurrent, bucket-sharded full-text index using BM25 scoring.
BM25Metadata
Index metadata.
BM25Params
Parameters controlling the BM25 scoring formula.
BM25Stats
Index statistics.
BoxTokenStream
Simple wrapper of Box<dyn TokenStream + 'a>.
Token
Token
TokenizerChain
A type-erased pipeline of a base Tokenizer followed by zero or more TokenFilters.
TokenizerChainBuilder
Builder for TokenizerChain

Enums§

BM25Error
Errors that can occur when working with BM25 index.
QueryType
Represents different types of boolean queries that can be parsed from a query string. Supports Term, Or, And, and Not operations for building complex search expressions. Operator precedence: OR < AND < NOT.

Traits§

BoxableTokenizer
A boxable Tokenizer, with its TokenStream type erased.
TokenFilter
Trait for the pluggable components of Tokenizers.
TokenStream
TokenStream is the result of the tokenization.
Tokenizer
Tokenizer are in charge of splitting text into a stream of token before indexing.

Functions§

collect_tokens
Tokenizes text and optionally filters tokens
default_tokenizer
Creates a default English-friendly tokenizer chain.
flat_full_text_search
Performs a simple full-text search by finding matching tokens in a document

Type Aliases§

BoxError
PostingValue
Type alias for posting values: (bucket id, Vec<(document_id, token_frequency)>)