jin

Approximate nearest neighbor search in Rust.

Algorithms and benchmarks for vector search.

(jin: Chinese 近 "near")

Dual-licensed under MIT or Apache-2.0.

Distance metrics (what `jin` actually does today)

Different index implementations in jin currently assume different distance semantics. This is not yet uniform across the crate.

Component	Metric	Notes
`hnsw::HNSWIndex`	cosine distance	Fast path assumes L2-normalized vectors
`ivf_pq::IVFPQIndex`	cosine distance	Uses dot-based cosine distance for IVF + PQ
`scann::SCANNIndex`	inner product / cosine	Uses dot products; reranking uses cosine distance
`hnsw::dual_branch::DualBranchHNSW`	L2 distance	Experimental implementation
`quantization`	Hamming-like / binary distances	See `quantization::simd_ops::hamming_distance` and ternary helpers

use jin::hnsw::HNSWIndex;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Vectors should be L2-normalized for cosine distance.
    let vectors: Vec<Vec<f32>> = vec![
        vec![1.0, 0.0, 0.0, 0.0],
        vec![0.0, 1.0, 0.0, 0.0],
        vec![0.0, 0.0, 1.0, 0.0],
        vec![0.0, 0.0, 0.0, 1.0],
    ];

    let query = vec![1.0, 0.0, 0.0, 0.0];

    let mut index = HNSWIndex::new(4, 16, 32)?; // dim, m, m_max
    for (id, v) in vectors.iter().enumerate() {
        index.add_slice(id as u32, v)?;
    }
    index.build()?;

    let results = index.search(&query, 2, 50)?; // k, ef_search
    println!("{results:?}");
    Ok(())
}

The Problem

Given a query vector, find the k most similar vectors from a collection. Brute force computes all N distances — O(N) per query. For 1M vectors, that's 1M distance computations per query.

ANN algorithms trade exactness for speed. Instead of guaranteeing the true nearest neighbors, they find neighbors that are probably correct, most of the time.

The Key Insight

HNSW (Hierarchical Navigable Small World) builds a graph where:

Each vector is a node
Edges connect similar vectors
Multiple layers provide "highway" shortcuts

Search starts at the top layer (sparse, long-range edges), descends through layers, and greedily follows edges toward the query.

Layer 2:  A -------- B          (sparse, fast traversal)
          |          |
Layer 1:  A -- C -- B -- D      (medium density)
          |    |    |    |
Layer 0:  A-E-C-F-B-G-D-H      (dense, high recall)

Recall vs Speed Tradeoff

The ef_search parameter controls how many candidates HNSW explores:

Higher ef_search = better recall, slower queries. For most applications, ef_search=50-100 achieves >95% recall.

Dataset Difficulty

Not all datasets are equal. Recall depends on data characteristics:

Based on He et al. (2012) and Radovanovic et al. (2010):

Relative Contrast: $C_r = \bar{D} / D_{\min}$. Lower = harder.
Hubness: Some points become neighbors to many queries. Higher = harder.
Distance Concentration: In high dims, distances converge. Lower variance = harder.

cargo run --example 03_quick_benchmark --release                   # bench (medium)
JIN_DATASET=hard cargo run --example 03_quick_benchmark --release  # hard (stress test)

Algorithm Comparison

Different algorithms suit different constraints:

Algorithm	Best For	Tradeoff
Brute force	< 10K vectors	Exact, but O(N)
LSH	Binary/sparse data	Fast, lower recall
IVF-PQ	Memory-constrained	Compressed, lower recall
HNSW	General use	Strong recall/latency tradeoff

Build Cost

Graph construction time scales with M (edges per node):

Higher M = better recall, but more memory and slower builds.

Memory Scaling

Memory usage scales linearly with vector count:

For dim=128, M=16: approximately 0.5 KB per vector (vector + graph edges).

Algorithms

Type	Implementations
Graph	HNSW, NSW, Vamana (DiskANN), SNG
Hash	LSH, MinHash, SimHash
Partition	IVF-PQ, ScaNN
Quantization	PQ, RaBitQ

Features

[dependencies]
jin = { version = "0.1", features = ["hnsw"] }

hnsw — HNSW graph index (default)
lsh — Locality-Sensitive Hashing
ivf_pq — Inverted File with Product Quantization
persistence — WAL-based durability

Performance

Build with native CPU optimizations:

RUSTFLAGS="-C target-cpu=native" cargo build --release

Run benchmarks:

cargo bench

See examples/ for more: semantic search, IVF-PQ, LSH, LID, and real dataset benchmarks.

For benchmarking datasets, see doc/datasets.md — covers bundled data, ann-benchmarks.com datasets, and typical embedding dimensions.

References

Malkov & Yashunin (2018). Efficient and robust approximate nearest neighbor search using HNSW graphs
Subramanya et al. (2019). DiskANN: Fast Accurate Billion-point Nearest Neighbor Search

jin 0.1.0

jin

Distance metrics (what jin actually does today)