# jin
Approximate nearest neighbor search in Rust.
Algorithms and benchmarks for vector search.
(jin: Chinese 近 "near")
Dual-licensed under MIT or Apache-2.0.
## Distance metrics (what `jin` actually does today)
Different index implementations in `jin` currently assume different distance semantics.
This is not yet uniform across the crate.
| `hnsw::HNSWIndex` | cosine distance | Fast path assumes **L2-normalized** vectors |
| `ivf_pq::IVFPQIndex` | cosine distance | Uses dot-based cosine distance for IVF + PQ |
| `scann::SCANNIndex` | inner product / cosine | Uses dot products; reranking uses cosine distance |
| `hnsw::dual_branch::DualBranchHNSW` | L2 distance | Experimental implementation |
| `quantization` | Hamming-like / binary distances | See `quantization::simd_ops::hamming_distance` and ternary helpers |
```rust
use jin::hnsw::HNSWIndex;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Vectors should be L2-normalized for cosine distance.
let vectors: Vec<Vec<f32>> = vec![
vec![1.0, 0.0, 0.0, 0.0],
vec![0.0, 1.0, 0.0, 0.0],
vec![0.0, 0.0, 1.0, 0.0],
vec![0.0, 0.0, 0.0, 1.0],
];
let query = vec![1.0, 0.0, 0.0, 0.0];
let mut index = HNSWIndex::new(4, 16, 32)?; // dim, m, m_max
for (id, v) in vectors.iter().enumerate() {
index.add_slice(id as u32, v)?;
}
index.build()?;
let results = index.search(&query, 2, 50)?; // k, ef_search
println!("{results:?}");
Ok(())
}
```
## The Problem
Given a query vector, find the k most similar vectors from a collection. Brute force computes all N distances — O(N) per query. For 1M vectors, that's 1M distance computations per query.
**ANN algorithms trade exactness for speed.** Instead of guaranteeing the true nearest neighbors, they find neighbors that are *probably* correct, *most* of the time.
## The Key Insight
HNSW (Hierarchical Navigable Small World) builds a graph where:
1. Each vector is a node
2. Edges connect similar vectors
3. Multiple layers provide "highway" shortcuts
Search starts at the top layer (sparse, long-range edges), descends through layers, and greedily follows edges toward the query.
```text
Layer 2: A -------- B (sparse, fast traversal)
| |
Layer 1: A -- C -- B -- D (medium density)
| | | |
Layer 0: A-E-C-F-B-G-D-H (dense, high recall)
```
## Recall vs Speed Tradeoff
The `ef_search` parameter controls how many candidates HNSW explores:
<p align="center">
<img src="doc/plots/recall_vs_ef.png" width="720" alt="Recall vs ef_search" />
</p>
Higher `ef_search` = better recall, slower queries. For most applications, `ef_search=50-100` achieves >95% recall.
## Dataset Difficulty
Not all datasets are equal. Recall depends on data characteristics:
<p align="center">
<img src="doc/plots/recall_vs_ef_by_difficulty.png" width="720" alt="Recall by difficulty" />
</p>
Based on He et al. (2012) and Radovanovic et al. (2010):
- **Relative Contrast**: $C_r = \bar{D} / D_{\min}$. Lower = harder.
- **Hubness**: Some points become neighbors to many queries. Higher = harder.
- **Distance Concentration**: In high dims, distances converge. Lower variance = harder.
<p align="center">
<img src="doc/plots/difficulty_comparison.png" width="720" alt="Difficulty metrics" />
</p>
```sh
cargo run --example 03_quick_benchmark --release # bench (medium)
JIN_DATASET=hard cargo run --example 03_quick_benchmark --release # hard (stress test)
```
## Algorithm Comparison
Different algorithms suit different constraints:
<p align="center">
<img src="doc/plots/algorithm_comparison.png" width="720" alt="Algorithm comparison" />
</p>
| **Brute force** | < 10K vectors | Exact, but O(N) |
| **LSH** | Binary/sparse data | Fast, lower recall |
| **IVF-PQ** | Memory-constrained | Compressed, lower recall |
| **HNSW** | General use | Strong recall/latency tradeoff |
## Build Cost
Graph construction time scales with `M` (edges per node):
<p align="center">
<img src="doc/plots/build_time_vs_m.png" width="720" alt="Build time vs M" />
</p>
Higher `M` = better recall, but more memory and slower builds.
## Memory Scaling
Memory usage scales linearly with vector count:
<p align="center">
<img src="doc/plots/memory_scaling.png" width="720" alt="Memory scaling" />
</p>
For dim=128, M=16: approximately 0.5 KB per vector (vector + graph edges).
## Algorithms
| Graph | HNSW, NSW, Vamana (DiskANN), SNG |
| Hash | LSH, MinHash, SimHash |
| Partition | IVF-PQ, ScaNN |
| Quantization | PQ, RaBitQ |
## Features
```toml
[dependencies]
jin = { version = "0.1", features = ["hnsw"] }
```
- `hnsw` — HNSW graph index (default)
- `lsh` — Locality-Sensitive Hashing
- `ivf_pq` — Inverted File with Product Quantization
- `persistence` — WAL-based durability
## Performance
Build with native CPU optimizations:
```sh
RUSTFLAGS="-C target-cpu=native" cargo build --release
```
Run benchmarks:
```sh
cargo bench
```
See [examples/](examples/) for more: semantic search, IVF-PQ, LSH, LID, and real dataset benchmarks.
For benchmarking datasets, see [doc/datasets.md](doc/datasets.md) — covers bundled data, ann-benchmarks.com datasets, and typical embedding dimensions.
## References
- Malkov & Yashunin (2018). [Efficient and robust approximate nearest neighbor search using HNSW graphs](https://arxiv.org/abs/1603.09320)
- Subramanya et al. (2019). [DiskANN: Fast Accurate Billion-point Nearest Neighbor Search](https://proceedings.neurips.cc/paper/2019/file/09853c7fb1d3f8ee67a61b6bf4a7f8e6-Paper.pdf)