superbit
A lightweight, in-memory vector index for approximate nearest-neighbor (ANN) search using Locality-Sensitive Hashing.
Overview
superbit provides fast approximate nearest-neighbor search over
high-dimensional vectors without the operational overhead of a full vector
database. It implements random hyperplane LSH (SimHash), a
locality-sensitive hashing scheme that hashes similar vectors into the same
buckets with high probability. Candidate vectors retrieved from the hash
tables are then re-ranked with an exact distance computation, giving a good
balance between speed and recall.
Target use cases:
- Retrieval-augmented generation (RAG) prototyping
- Recommendation system experiments
- Embedding similarity search during development
- Anywhere you need sub-linear ANN queries and want to avoid external infrastructure
Features
- Random hyperplane LSH (SimHash) for cosine, Euclidean, and dot-product similarity
- Multi-probe querying -- probe neighboring hash buckets to improve recall without adding more tables
- Thread-safe concurrent access via
parking_lot::RwLock(parallel reads, exclusive writes) - Builder pattern for ergonomic index configuration
- Auto-tuning --
suggest_paramsrecommendsnum_hashes,num_tables, andnum_probesgiven a target recall and dataset size - Runtime metrics -- lock-free atomic counters track query latency, candidate counts, and bucket hit rates
- Optional features:
parallel-- parallel bulk insert and batch query via rayonpersistence-- save/load indexes to disk with serde + bincode (or JSON)python-- Python bindings via PyO3
Architecture
Module Structure
graph TD
A[<b>LshIndex</b><br/>Public API] --> B[RwLock<IndexInner><br/>Thread-safe wrapper]
A --> M[MetricsCollector<br/>Atomic counters]
B --> V[vectors<br/>HashMap<id, Array1<f32>>]
B --> T[tables<br/>Vec<HashMap<u64, Vec<id>>>]
B --> H[hashers<br/>Vec<RandomProjectionHasher>]
B --> C[IndexConfig]
H --> |"sign(dot(v, proj))"| T
subgraph Optional Features
P[parallel<br/>rayon batch ops]
S[persistence<br/>serde + bincode/JSON]
PY[python<br/>PyO3 bindings]
end
A -.-> P
A -.-> S
A -.-> PY
Query Flow
flowchart LR
Q[Query Vector] --> N{Normalize?}
N -->|Cosine| NORM[L2 Normalize]
N -->|Other| HASH
NORM --> HASH
HASH[Hash with L Hashers] --> PROBE[Multi-probe:<br/>flip uncertain bits]
PROBE --> T1[Table 1<br/>base + probes]
PROBE --> T2[Table 2<br/>base + probes]
PROBE --> TL[Table L<br/>base + probes]
T1 --> UNION[Candidate Union<br/>deduplicate IDs]
T2 --> UNION
TL --> UNION
UNION --> RANK[Exact Re-rank<br/>compute true distance]
RANK --> TOPK[Return Top-K]
Insert Flow
flowchart LR
I[Insert: id, vector] --> DUP{ID exists?}
DUP -->|Yes| REM[Remove old<br/>hash entries]
DUP -->|No| NORM
REM --> NORM{Normalize?}
NORM -->|Cosine| DO_NORM[L2 Normalize]
NORM -->|Other| STORE
DO_NORM --> STORE
STORE[Compute L hashes] --> BUCK[Push id into<br/>L hash buckets]
BUCK --> VEC[Store vector in<br/>central HashMap]
Quick Start
Add the crate to your Cargo.toml:
[]
= "0.1"
Build an index, insert vectors, and query:
use ;
Feature Flags
| Flag | Effect |
|---|---|
parallel |
Parallel bulk insert and batch query via rayon |
persistence |
Save/load index to disk (serde + bincode + JSON) |
python |
Python bindings via PyO3 |
full |
Enables parallel + persistence |
Enable features in your Cargo.toml:
[]
= { = "0.1", = ["full"] }
Configuration Guide
The three main knobs that control the speed/recall/memory trade-off are:
| Parameter | What it controls | Higher value means |
|---|---|---|
num_hashes |
Hash bits per table (1--64) | Fewer, more precise buckets; lower recall per table but less wasted work |
num_tables |
Number of independent hash tables | Better recall (more chances to find a neighbor); more memory |
num_probes |
Extra neighboring buckets probed per table | Better recall without adding tables; slightly more query time |
Rules of thumb:
- Start with the defaults (
num_hashes=8,num_tables=16,num_probes=3) and measure recall on a held-out set. - If recall is too low, increase
num_tablesornum_probesfirst. - If queries are too slow (too many candidates), increase
num_hashesto make buckets more selective. - For cosine similarity the index L2-normalizes vectors on insertion by default
(
normalize_vectors=true).
Auto-Tuning
Use suggest_params to get a starting configuration based on your dataset size
and target recall:
use ;
let params = suggest_params;
println!;
You can also estimate the recall of a specific configuration without building an index:
use ;
let recall = estimate_recall;
println!;
Performance
LSH-based indexing provides sub-linear query time by reducing the search space to a small set of candidate vectors. In practice:
- For datasets under ~10,000 vectors, brute-force linear scan is often fast enough and gives exact results. LSH adds overhead that may not pay off at this scale.
- For datasets above ~10,000 vectors, LSH becomes increasingly beneficial. Query time grows much more slowly than dataset size.
- With well-tuned parameters you can typically achieve 80--95% recall while examining only a small fraction of the dataset.
The parallel feature flag enables rayon-based parallelism for bulk inserts
(par_insert_batch) and batch queries (par_query_batch), which can
significantly speed up workloads that operate on many vectors at once.
Use the built-in metrics collector (.enable_metrics() on the builder) to
monitor query latency, candidate counts, and bucket hit rates in production.
Comparison with Other Rust ANN Crates
| Crate | Algorithm | Notes |
|---|---|---|
| superbit | Random hyperplane LSH | Lightweight, pure Rust, no C/C++ deps. Good for prototyping and moderate-scale workloads. |
| usearch | HNSW | High performance, C++ core with Rust bindings. Better for large-scale production. |
| hora | HNSW / IVF-PQ | Pure Rust, multiple algorithms. More complex API. |
| hnsw_rs | HNSW | Pure Rust HNSW implementation. |
superbit is intentionally simple: a single algorithm, a small API
surface, and no native dependencies. It is a good fit for prototyping,
moderate-scale applications, and situations where you want to understand and
control the indexing behavior. For very large datasets (millions of vectors) or
when you need maximum throughput, a graph-based index like HNSW will generally
outperform LSH.
License
Licensed under either of
at your option.