Crate jin

Crate jin 

Source
Expand description

jin: Approximate Nearest Neighbor Search primitives.

Provides standalone implementations of state-of-the-art ANN algorithms:

§Which Index Should I Use?

SituationRecommendationFeature
General Purpose (Best Recall/Speed)hnsw::HNSWIndexhnsw (default)
Billion-Scale (Memory Constrained)ivf_pq::IVFPQIndexivf_pq
Flat Graph (Simpler, d > 32)nsw::NSWIndexnsw
Attribute Filteringhnsw::filteredhnsw
Out-of-Core (SSD-based)diskanndiskann (experimental)

Default features: hnsw, innr (SIMD).

§Recommendation Logic

  1. Start with HNSW. It’s the industry standard for a reason. It offers the best trade-off between search speed and recall for datasets that fit in RAM.

  2. Use IVF-PQ if your dataset is too large for RAM (e.g., > 10M vectors on a laptop). It compresses vectors (32x-64x) but has lower recall than HNSW.

  3. Use NSW (Flat) if you specifically want to avoid the hierarchy overhead of HNSW, or if your dimensionality is high enough (d > 32) that the “small world” navigation benefit of hierarchy is diminished. Note: HNSW is generally more robust.

  4. Use DiskANN (experimental) if you have an NVMe SSD and 1B+ vectors.

# Minimal (HNSW + SIMD)
jin = "0.1"

# With quantization support
jin = { version = "0.1", features = ["ivf_pq"] }

§Critical Nuances & The HNSW Critique (2025)

§1. The HNSW Dominance & Its Cracks

HNSW is the default because it’s “good enough” for most. But research (2025-2026) highlights structural weaknesses:

  • Local Minima: HNSW’s greedy search is prone to getting stuck in local optima, especially in clustered datasets. Newer graphs like NSG (Navigable Small World) and Vamana (DiskANN) use better edge selection (e.g., RNG) to ensure angular diversity, allowing “glances” around obstacles.
  • Memory Bloat: The hierarchical layers add 30-40% overhead. For d > 32, the hierarchy often provides negligible speedup over a well-constructed flat graph.
  • Hub Highway Hypothesis (2025): Research demonstrated that flat NSW graphs retain all benefits of HNSW on high-d data. Hub nodes form a “hub highway” that serves the same routing function as explicit hierarchy layers.
  • Construction Sensitivity: HNSW quality depends heavily on insertion order. LID-based ordering (descending) can improve recall by 12%. Vamana’s two-pass build (random graph -> refined) is more robust.

Verdict: Use HNSW for RAM-based, low-latency search. Look at Vamana/DiskANN for higher recall or SSD-resident data.

§2. The Hubness Problem

In high-dimensional spaces, some vectors become hubs—appearing as nearest neighbors to many other points, while antihubs rarely appear in any neighbor lists. This creates asymmetric NN relationships.

Mitigation: Local scaling, cosine over Euclidean, or dimensionality reduction.

§3. Curse of Dimensionality

For k-NN with neighborhood radius ℓ=0.1, you need n ≈ k × 10^d samples. For d > 100, this exceeds atoms in the observable universe.

Why ANN works anyway: Real data lies on low-dimensional manifolds. Intrinsic dimensionality << embedding dimensionality.

§4. When Exact Search Beats Approximate

  • Small datasets (< 10K vectors): Brute force is faster (SIMD is powerful).
  • Very high recall requirements (> 99.9%): ANN overhead not worth it.
  • Low intrinsic dimensionality: KD-trees can be exact and fast.

§5. Quantization Trade-offs (2025-2026 Research)

MethodCompressionRecallQuery SpeedupUse Case
Binary32x~96%24xExtreme speed, lower accuracy OK
Scalar (INT8)4x~99.3%4xBalance of speed and accuracy
Product (PQ)8-64x~95-98%8-32xMemory-constrained billion-scale
OPQ8-64x~97-99%8-32xBetter than PQ via rotation

Binary quantization uses Hamming distance (2 CPU cycles per comparison). Matryoshka embeddings enable dimension reduction before quantization for further compression (e.g., 256x total with MRL + binary).

§6. Future Directions (2025+ Research)

Several promising research directions are not yet implemented:

TechniqueBenefitReference
GPU Indexing9.3x speedup, 3.75x cost reductionOpenSearch cuVS 2025
Filtered ANNSCorrectness guarantees for metadata filtersACORN (SIGMOD 2024)
Probabilistic Routing1.6-2.5x throughput via PEOsTheoretical ANN 2024
FreshDiskANNReal-time updates with 2-hop reconnectionMicrosoft Research
Learned IndexesRL/DL for routing decisionsCRINN, learned ANN

§GPU Acceleration

NVIDIA cuVS (via OpenSearch) showed 9.3x query speedup and 3.75x cost reduction on billion-scale workloads. GPU acceleration is most beneficial for:

  • Large batch queries (amortize kernel launch)
  • High-dimensional data (d > 256)
  • Index construction (embarrassingly parallel)

§Filtered ANNS (ACORN)

Traditional approach: post-filter ANN results. Problem: may return < k results. ACORN builds predicate-aware graphs with formal correctness guarantees. Useful when: attribute filtering is common (e.g., “similar products in category X”).

§Probabilistic Routing

Instead of deterministic greedy search, use learned probabilistic routing. Position-Encoded Outputs (PEOs) achieve 1.6-2.5x throughput improvement with theoretical guarantees on recall.

Re-exports§

pub use ann::traits::ANNIndex;
pub use distance::DistanceMetric;
pub use error::Result;
pub use error::RetrieveError;

Modules§

adaptive
Adaptive computation patterns for approximate nearest neighbor search.
ann
Unified Approximate Nearest Neighbor (ANN) search algorithms.
benchmark
Benchmark utilities for ANN evaluation.
classic
Classic ANN methods implementation.
compression
Lossless compression for vector IDs in ANN indexes.
diskann
DiskANN: Billion-scale ANN on a single machine with SSD.
distance
Distance metrics for dense vectors.
error
Error types for jin.
evoc
EVōC (Embedding Vector Oriented Clustering) implementation.
filtering
Filtering and lightweight faceting support for vector search.
hnsw
Hierarchical Navigable Small World (HNSW) approximate nearest neighbor search.
ivf_pq
IVF-PQ: Inverted File with Product Quantization.
lid
Local Intrinsic Dimensionality (LID) estimation.
matryoshka
Matryoshka Embedding Support
nsw
Flat Navigable Small World (NSW) graph.
partitioning
Partitioning/clustering interface for ANN methods.
persistence
Disk persistence for jin indexes.
pq_simd
SIMD-accelerated Product Quantization distance computation.
quantization
Vector quantization: compress vectors while preserving distance.
scann
ScaNN: Google’s algorithm for Maximum Inner Product Search (MIPS).
simd
Vector operations with SIMD acceleration.
sng
OPT-SNG: Auto-tuned Sparse Neighborhood Graph.
streaming
Streaming updates for vector indices.
vamana
Vamana approximate nearest neighbor search.