Expand description
docbert – a local document search engine combining BM25 and ColBERT reranking.
docbert indexes collections of markdown and text files, providing fast keyword search via Tantivy with optional neural reranking via ColBERT (specifically the ColBERT-Zero model via pylate-rs).
§Architecture
The search pipeline has two stages:
-
BM25 retrieval – Tantivy indexes documents with English stemming and retrieves the top 1000 candidates for a query. Optionally includes fuzzy matching (Levenshtein distance 1).
-
ColBERT reranking – each candidate’s per-token embedding matrix is compared against the query embedding using MaxSim scoring, producing a semantic relevance score that captures meaning beyond keyword overlap.
§Storage
All data is stored locally in three databases managed by DataDir:
config.db(ConfigDb) – collections, contexts, document metadata, settingsembeddings.db(EmbeddingDb) – ColBERT per-token embedding matricestantivy/(SearchIndex) – BM25 full-text search index
§Quick start
use docbert::{ConfigDb, DataDir, SearchIndex, EmbeddingDb, ModelManager};
use docbert::search::{self, SearchParams};
// Open databases
let data_dir = DataDir::resolve(None).unwrap();
let config_db = ConfigDb::open(&data_dir.config_db()).unwrap();
let search_index = SearchIndex::open(&data_dir.tantivy_dir().unwrap()).unwrap();
let embedding_db = EmbeddingDb::open(&data_dir.embeddings_db()).unwrap();
let mut model = ModelManager::new();
// Search with BM25 only (no model download required)
let params = SearchParams {
query: "rust programming".to_string(),
count: 10,
collection: None,
min_score: 0.0,
bm25_only: true,
no_fuzzy: false,
all: false,
};
let results = search::execute_search(¶ms, &search_index, &embedding_db, &mut model)
.unwrap();
for r in &results {
println!("{}: {}:{} (score: {:.3})", r.rank, r.collection, r.path, r.score);
}§Indexing documents
use docbert::{ConfigDb, SearchIndex, EmbeddingDb, ModelManager, DocumentId};
use docbert::{walker, ingestion, embedding};
// Discover files in a directory
let files = walker::discover_files(tmp.path()).unwrap();
// Index into Tantivy
let index = SearchIndex::open_in_ram().unwrap();
let mut writer = index.writer(15_000_000).unwrap();
let count = ingestion::ingest_files(&index, &mut writer, "notes", &files).unwrap();
// Optionally compute ColBERT embeddings (downloads model on first use)
let emb_db = EmbeddingDb::open(&tmp.path().join("emb.db")).unwrap();
let mut model = ModelManager::new();
// embedding::embed_and_store(&mut model, &emb_db, docs).unwrap();Re-exports§
pub use config_db::ConfigDb;pub use data_dir::DataDir;pub use doc_id::DocumentId;pub use embedding_db::EmbeddingDb;pub use error::Error;pub use error::Result;pub use model_manager::ModelManager;pub use tantivy_index::SearchIndex;
Modules§
- chunking
- Chunking utilities for splitting long documents into overlapping segments.
- config_
db - data_
dir - doc_id
- embedding
- embedding_
db - error
- incremental
- ingestion
- mcp
- model_
manager - reranker
- search
- tantivy_
index - text_
util - walker