Skip to main content

Crate docbert

Crate docbert 

Source
Expand description

docbert – a local document search engine combining BM25 and ColBERT reranking.

docbert indexes collections of markdown and text files, providing fast keyword search via Tantivy with optional neural reranking via ColBERT (specifically the ColBERT-Zero model via pylate-rs).

§Architecture

The search pipeline has two stages:

  1. BM25 retrieval – Tantivy indexes documents with English stemming and retrieves the top 1000 candidates for a query. Optionally includes fuzzy matching (Levenshtein distance 1).

  2. ColBERT reranking – each candidate’s per-token embedding matrix is compared against the query embedding using MaxSim scoring, producing a semantic relevance score that captures meaning beyond keyword overlap.

§Storage

All data is stored locally in three databases managed by DataDir:

  • config.db (ConfigDb) – collections, contexts, document metadata, settings
  • embeddings.db (EmbeddingDb) – ColBERT per-token embedding matrices
  • tantivy/ (SearchIndex) – BM25 full-text search index

§Quick start

use docbert::{ConfigDb, DataDir, SearchIndex, EmbeddingDb, ModelManager};
use docbert::search::{self, SearchParams};

// Open databases
let data_dir = DataDir::resolve(None).unwrap();
let config_db = ConfigDb::open(&data_dir.config_db()).unwrap();
let search_index = SearchIndex::open(&data_dir.tantivy_dir().unwrap()).unwrap();
let embedding_db = EmbeddingDb::open(&data_dir.embeddings_db()).unwrap();
let mut model = ModelManager::new();

// Search with BM25 only (no model download required)
let params = SearchParams {
    query: "rust programming".to_string(),
    count: 10,
    collection: None,
    min_score: 0.0,
    bm25_only: true,
    no_fuzzy: false,
    all: false,
};

let results = search::execute_search(&params, &search_index, &embedding_db, &mut model)
    .unwrap();
for r in &results {
    println!("{}: {}:{} (score: {:.3})", r.rank, r.collection, r.path, r.score);
}

§Indexing documents

use docbert::{ConfigDb, SearchIndex, EmbeddingDb, ModelManager, DocumentId};
use docbert::{walker, ingestion, embedding};

// Discover files in a directory
let files = walker::discover_files(tmp.path()).unwrap();

// Index into Tantivy
let index = SearchIndex::open_in_ram().unwrap();
let mut writer = index.writer(15_000_000).unwrap();
let count = ingestion::ingest_files(&index, &mut writer, "notes", &files).unwrap();

// Optionally compute ColBERT embeddings (downloads model on first use)
let emb_db = EmbeddingDb::open(&tmp.path().join("emb.db")).unwrap();
let mut model = ModelManager::new();
// embedding::embed_and_store(&mut model, &emb_db, docs).unwrap();

Re-exports§

pub use config_db::ConfigDb;
pub use data_dir::DataDir;
pub use doc_id::DocumentId;
pub use embedding_db::EmbeddingDb;
pub use error::Error;
pub use error::Result;
pub use model_manager::ModelManager;
pub use tantivy_index::SearchIndex;

Modules§

chunking
Chunking utilities for splitting long documents into overlapping segments.
config_db
data_dir
doc_id
embedding
embedding_db
error
incremental
ingestion
mcp
model_manager
reranker
search
tantivy_index
text_util
walker