docbert
docbert is a CLI for searching local documents. It uses BM25 to find likely matches quickly, then reranks them with ColBERT.
Point it at one or more folders, sync the index, and search across Markdown or plain text files.
What it does
- two-stage search with BM25 and ColBERT
- semantic-only full scans with
docbert ssearch - named collections for grouping directories
- incremental indexing that only reprocesses changed files
- human-readable, JSON, or file-only output
- CUDA and Metal support for faster embedding work
- fuzzy matching for typo-tolerant queries
- Markdown and plain text support
Quick start
# Add a collection of markdown notes
# Build or update the index
# Search across all collections
# Search with semantic reranking (default)
# Run a semantic-only full scan (ColBERT only)
# Skip neural reranking and use BM25 only
# Output JSON for scripts
# Print matching file paths
|
MCP server
docbert can also run as an MCP (Model Context Protocol) server for editors and AI tools.
Available tools:
docbert_search: keyword + semantic search, with optional collection filterssemantic_search: semantic-only search across all documentsdocbert_get: fetch a document by path or#doc_iddocbert_multi_get: fetch multiple documents with a glob patterndocbert_status: show index health and collection summaries
Claude Desktop config
File: ~/Library/Application Support/Claude/claude_desktop_config.json
Claude Code config
File: ~/.claude/settings.json
Installation
With Nix
# CPU version
# CUDA version (for NVIDIA GPUs)
Shell completions for bash, zsh, and fish are installed with the Nix package.
From source
# CPU build
# With CUDA support
# With Metal support (macOS)
Usage
Manage collections
# Add a directory as a collection
# List all collections
# Remove a collection
Search
# Basic search (top 10 results)
# More results
# Search one collection
# Return all results above a score threshold
# Disable fuzzy matching
# Semantic-only full scan (slower on large corpora)
Retrieve documents
# Get a document by collection:path
# Get by document ID
# Output with metadata
# Get multiple documents with glob patterns
Maintenance
# Show system status
# Sync changes incrementally
# Sync one collection
# Full rebuild
# Rebuild one collection
How search works
Search happens in two steps:
- Tantivy runs BM25 retrieval, optionally with fuzzy matching, and returns a candidate set.
- pylate-rs reranks those candidates with ColBERT using
lightonai/ColBERT-Zeroby default.
That gives you fast keyword search without losing semantic ranking.
If you want pure semantic ranking, docbert ssearch skips BM25 and scores every stored embedding. That is slower on large collections, but it avoids BM25 and fuzzy-matching bias.
Configuration
docbert stores its data in ~/.local/share/docbert/ or $XDG_DATA_HOME/docbert/.
Use --data-dir to override that:
Data directory resolution order:
--data-dirCLI flagDOCBERT_DATA_DIRenvironment variable- XDG default:
$XDG_DATA_HOME/docbert/or~/.local/share/docbert/
Environment variables
DOCBERT_DATA_DIR: override the data directoryDOCBERT_MODEL: override the ColBERT modelDOCBERT_LOG: set the log level, for exampledebug,info, orwarn
Model selection
Set a default model in config.db:
Override it for a single command:
Alternative models
The default model, lightonai/ColBERT-Zero, works out of the box. If you want to use another pylate-rs-compatible model:
# or
DOCBERT_MODEL=/path/to/model
Supported file types
- Markdown (
.md) - Plain text (
.txt)
Performance notes
- Use
--bm25-onlywhen keyword search is enough. - The ColBERT model loads on the first semantic search.
- GPU support speeds up embedding generation.
- Incremental indexing only reprocesses changed files.
License
MIT OR Apache-2.0