jin 0.1.0

Approximate Nearest Neighbor Search: HNSW, DiskANN, IVF-PQ, ScaNN, quantization
Documentation

jin

Approximate nearest neighbor search in Rust.

Algorithms and benchmarks for vector search.

(jin: Chinese 近 "near")

Dual-licensed under MIT or Apache-2.0.

Distance metrics (what jin actually does today)

Different index implementations in jin currently assume different distance semantics. This is not yet uniform across the crate.

Component Metric Notes
hnsw::HNSWIndex cosine distance Fast path assumes L2-normalized vectors
ivf_pq::IVFPQIndex cosine distance Uses dot-based cosine distance for IVF + PQ
scann::SCANNIndex inner product / cosine Uses dot products; reranking uses cosine distance
hnsw::dual_branch::DualBranchHNSW L2 distance Experimental implementation
quantization Hamming-like / binary distances See quantization::simd_ops::hamming_distance and ternary helpers
use jin::hnsw::HNSWIndex;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Vectors should be L2-normalized for cosine distance.
    let vectors: Vec<Vec<f32>> = vec![
        vec![1.0, 0.0, 0.0, 0.0],
        vec![0.0, 1.0, 0.0, 0.0],
        vec![0.0, 0.0, 1.0, 0.0],
        vec![0.0, 0.0, 0.0, 1.0],
    ];

    let query = vec![1.0, 0.0, 0.0, 0.0];

    let mut index = HNSWIndex::new(4, 16, 32)?; // dim, m, m_max
    for (id, v) in vectors.iter().enumerate() {
        index.add_slice(id as u32, v)?;
    }
    index.build()?;

    let results = index.search(&query, 2, 50)?; // k, ef_search
    println!("{results:?}");
    Ok(())
}

The Problem

Given a query vector, find the k most similar vectors from a collection. Brute force computes all N distances — O(N) per query. For 1M vectors, that's 1M distance computations per query.

ANN algorithms trade exactness for speed. Instead of guaranteeing the true nearest neighbors, they find neighbors that are probably correct, most of the time.

The Key Insight

HNSW (Hierarchical Navigable Small World) builds a graph where:

  1. Each vector is a node
  2. Edges connect similar vectors
  3. Multiple layers provide "highway" shortcuts

Search starts at the top layer (sparse, long-range edges), descends through layers, and greedily follows edges toward the query.

Layer 2:  A -------- B          (sparse, fast traversal)
          |          |
Layer 1:  A -- C -- B -- D      (medium density)
          |    |    |    |
Layer 0:  A-E-C-F-B-G-D-H      (dense, high recall)

Recall vs Speed Tradeoff

The ef_search parameter controls how many candidates HNSW explores:

Higher ef_search = better recall, slower queries. For most applications, ef_search=50-100 achieves >95% recall.

Dataset Difficulty

Not all datasets are equal. Recall depends on data characteristics:

Based on He et al. (2012) and Radovanovic et al. (2010):

  • Relative Contrast: $C_r = \bar{D} / D_{\min}$. Lower = harder.
  • Hubness: Some points become neighbors to many queries. Higher = harder.
  • Distance Concentration: In high dims, distances converge. Lower variance = harder.
cargo run --example 03_quick_benchmark --release                   # bench (medium)
JIN_DATASET=hard cargo run --example 03_quick_benchmark --release  # hard (stress test)

Algorithm Comparison

Different algorithms suit different constraints:

Algorithm Best For Tradeoff
Brute force < 10K vectors Exact, but O(N)
LSH Binary/sparse data Fast, lower recall
IVF-PQ Memory-constrained Compressed, lower recall
HNSW General use Strong recall/latency tradeoff

Build Cost

Graph construction time scales with M (edges per node):

Higher M = better recall, but more memory and slower builds.

Memory Scaling

Memory usage scales linearly with vector count:

For dim=128, M=16: approximately 0.5 KB per vector (vector + graph edges).

Algorithms

Type Implementations
Graph HNSW, NSW, Vamana (DiskANN), SNG
Hash LSH, MinHash, SimHash
Partition IVF-PQ, ScaNN
Quantization PQ, RaBitQ

Features

[dependencies]
jin = { version = "0.1", features = ["hnsw"] }
  • hnsw — HNSW graph index (default)
  • lsh — Locality-Sensitive Hashing
  • ivf_pq — Inverted File with Product Quantization
  • persistence — WAL-based durability

Performance

Build with native CPU optimizations:

RUSTFLAGS="-C target-cpu=native" cargo build --release

Run benchmarks:

cargo bench

See examples/ for more: semantic search, IVF-PQ, LSH, LID, and real dataset benchmarks.

For benchmarking datasets, see doc/datasets.md — covers bundled data, ann-benchmarks.com datasets, and typical embedding dimensions.

References