Crate jin

Expand description

jin: Approximate Nearest Neighbor Search primitives.

Provides standalone implementations of state-of-the-art ANN algorithms:

Graph-based: hnsw, nsw, sng, vamana
Hash-based: hash (LSH, MinHash, SimHash) — requires lsh feature
Partition-based: ivf_pq, scann
Quantization: quantization (PQ, RaBitQ)

§Which Index Should I Use?

Situation	Recommendation	Feature
General Purpose (Best Recall/Speed)	`hnsw::HNSWIndex`	`hnsw` (default)
Billion-Scale (Memory Constrained)	`ivf_pq::IVFPQIndex`	`ivf_pq`
Flat Graph (Simpler, d > 32)	`nsw::NSWIndex`	`nsw`
Attribute Filtering	`hnsw::filtered`	`hnsw`
Out-of-Core (SSD-based)	`diskann`	`diskann` (experimental)

Default features: hnsw, innr (SIMD).

§Recommendation Logic

Start with HNSW. It’s the industry standard for a reason. It offers the best trade-off between search speed and recall for datasets that fit in RAM.
Use IVF-PQ if your dataset is too large for RAM (e.g., > 10M vectors on a laptop). It compresses vectors (32x-64x) but has lower recall than HNSW.
Use NSW (Flat) if you specifically want to avoid the hierarchy overhead of HNSW, or if your dimensionality is high enough (d > 32) that the “small world” navigation benefit of hierarchy is diminished. Note: HNSW is generally more robust.
Use DiskANN (experimental) if you have an NVMe SSD and 1B+ vectors.

# Minimal (HNSW + SIMD)
jin = "0.1"

# With quantization support
jin = { version = "0.1", features = ["ivf_pq"] }

§Critical Nuances & The HNSW Critique (2025)

§1. The HNSW Dominance & Its Cracks

HNSW is the default because it’s “good enough” for most. But research (2025-2026) highlights structural weaknesses:

Local Minima: HNSW’s greedy search is prone to getting stuck in local optima, especially in clustered datasets. Newer graphs like NSG (Navigable Small World) and Vamana (DiskANN) use better edge selection (e.g., RNG) to ensure angular diversity, allowing “glances” around obstacles.
Memory Bloat: The hierarchical layers add 30-40% overhead. For d > 32, the hierarchy often provides negligible speedup over a well-constructed flat graph.
Hub Highway Hypothesis (2025): Research demonstrated that flat NSW graphs retain all benefits of HNSW on high-d data. Hub nodes form a “hub highway” that serves the same routing function as explicit hierarchy layers.
Construction Sensitivity: HNSW quality depends heavily on insertion order. LID-based ordering (descending) can improve recall by 12%. Vamana’s two-pass build (random graph -> refined) is more robust.

Verdict: Use HNSW for RAM-based, low-latency search. Look at Vamana/DiskANN for higher recall or SSD-resident data.

§2. The Hubness Problem

In high-dimensional spaces, some vectors become hubs—appearing as nearest neighbors to many other points, while antihubs rarely appear in any neighbor lists. This creates asymmetric NN relationships.

Mitigation: Local scaling, cosine over Euclidean, or dimensionality reduction.

§3. Curse of Dimensionality

For k-NN with neighborhood radius ℓ=0.1, you need n ≈ k × 10^d samples. For d > 100, this exceeds atoms in the observable universe.

Why ANN works anyway: Real data lies on low-dimensional manifolds. Intrinsic dimensionality << embedding dimensionality.

§4. When Exact Search Beats Approximate

Small datasets (< 10K vectors): Brute force is faster (SIMD is powerful).
Very high recall requirements (> 99.9%): ANN overhead not worth it.
Low intrinsic dimensionality: KD-trees can be exact and fast.

§5. Quantization Trade-offs (2025-2026 Research)

Method	Compression	Recall	Query Speedup	Use Case
Binary	32x	~96%	24x	Extreme speed, lower accuracy OK
Scalar (INT8)	4x	~99.3%	4x	Balance of speed and accuracy
Product (PQ)	8-64x	~95-98%	8-32x	Memory-constrained billion-scale
OPQ	8-64x	~97-99%	8-32x	Better than PQ via rotation

Binary quantization uses Hamming distance (2 CPU cycles per comparison). Matryoshka embeddings enable dimension reduction before quantization for further compression (e.g., 256x total with MRL + binary).

§6. Future Directions (2025+ Research)

Several promising research directions are not yet implemented:

Technique	Benefit	Reference
GPU Indexing	9.3x speedup, 3.75x cost reduction	OpenSearch cuVS 2025
Filtered ANNS	Correctness guarantees for metadata filters	ACORN (SIGMOD 2024)
Probabilistic Routing	1.6-2.5x throughput via PEOs	Theoretical ANN 2024
FreshDiskANN	Real-time updates with 2-hop reconnection	Microsoft Research
Learned Indexes	RL/DL for routing decisions	CRINN, learned ANN

§GPU Acceleration

NVIDIA cuVS (via OpenSearch) showed 9.3x query speedup and 3.75x cost reduction on billion-scale workloads. GPU acceleration is most beneficial for:

Large batch queries (amortize kernel launch)
High-dimensional data (d > 256)
Index construction (embarrassingly parallel)

§Filtered ANNS (ACORN)

Traditional approach: post-filter ANN results. Problem: may return < k results. ACORN builds predicate-aware graphs with formal correctness guarantees. Useful when: attribute filtering is common (e.g., “similar products in category X”).

§Probabilistic Routing

Instead of deterministic greedy search, use learned probabilistic routing. Position-Encoded Outputs (PEOs) achieve 1.6-2.5x throughput improvement with theoretical guarantees on recall.

Re-exports§

pub use ann::traits::ANNIndex;
pub use distance::DistanceMetric;
pub use error::Result;
pub use error::RetrieveError;

Modules§

adaptive: Adaptive computation patterns for approximate nearest neighbor search.
ann: Unified Approximate Nearest Neighbor (ANN) search algorithms.
benchmark: Benchmark utilities for ANN evaluation.
classic: Classic ANN methods implementation.
compression: Lossless compression for vector IDs in ANN indexes.
diskann: DiskANN: Billion-scale ANN on a single machine with SSD.
distance: Distance metrics for dense vectors.
error: Error types for jin.
evoc: EVōC (Embedding Vector Oriented Clustering) implementation.
filtering: Filtering and lightweight faceting support for vector search.
hnsw: Hierarchical Navigable Small World (HNSW) approximate nearest neighbor search.
ivf_pq: IVF-PQ: Inverted File with Product Quantization.
lid: Local Intrinsic Dimensionality (LID) estimation.
matryoshka: Matryoshka Embedding Support
nsw: Flat Navigable Small World (NSW) graph.
partitioning: Partitioning/clustering interface for ANN methods.
persistence: Disk persistence for jin indexes.
pq_simd: SIMD-accelerated Product Quantization distance computation.
quantization: Vector quantization: compress vectors while preserving distance.
scann: ScaNN: Google’s algorithm for Maximum Inner Product Search (MIPS).
simd: Vector operations with SIMD acceleration.
sng: OPT-SNG: Auto-tuned Sparse Neighborhood Graph.
streaming: Streaming updates for vector indices.
vamana: Vamana approximate nearest neighbor search.

Crate jin

Crate jin Copy item path

§Which Index Should I Use?

§Recommendation Logic

§Critical Nuances & The HNSW Critique (2025)

§1. The HNSW Dominance & Its Cracks

§2. The Hubness Problem

§3. Curse of Dimensionality

§4. When Exact Search Beats Approximate

§5. Quantization Trade-offs (2025-2026 Research)

§6. Future Directions (2025+ Research)

§GPU Acceleration

§Filtered ANNS (ACORN)

§Probabilistic Routing

Re-exports§

Modules§

Crate jin