jin/
lib.rs

1// Crate-level lint configuration
2// Dead code is allowed since this is research code with partial implementations
3#![allow(dead_code)]
4// Allow unsafe operations in unsafe fn without explicit unsafe blocks
5// (Rust 2024 edition strictness - this is a SIMD crate where unsafe is pervasive)
6#![allow(unsafe_op_in_unsafe_fn)]
7
8//! jin: Approximate Nearest Neighbor Search primitives.
9//!
10//! Provides standalone implementations of state-of-the-art ANN algorithms:
11//!
12//! - **Graph-based**: [`hnsw`], [`nsw`], [`sng`], [`vamana`]
13//! - **Hash-based**: `hash` (LSH, MinHash, SimHash) — requires `lsh` feature
14//! - **Partition-based**: [`ivf_pq`], [`scann`]
15//! - **Quantization**: [`quantization`] (PQ, RaBitQ)
16//!
17//! # Which Index Should I Use?
18//!
19//! | Situation | Recommendation | Feature |
20//! |-----------|----------------|---------|
21//! | **General Purpose** (Best Recall/Speed) | [`hnsw::HNSWIndex`] | `hnsw` (default) |
22//! | **Billion-Scale** (Memory Constrained) | [`ivf_pq::IVFPQIndex`] | `ivf_pq` |
23//! | **Flat Graph** (Simpler, d > 32) | [`nsw::NSWIndex`] | `nsw` |
24//! | **Attribute Filtering** | [`hnsw::filtered`] | `hnsw` |
25//! | **Out-of-Core** (SSD-based) | [`diskann`] | `diskann` (experimental) |
26//!
27//! **Default features**: `hnsw`, `innr` (SIMD).
28//!
29//! ## Recommendation Logic
30//!
31//! 1. **Start with HNSW**. It's the industry standard for a reason. It offers the best
32//!    trade-off between search speed and recall for datasets that fit in RAM.
33//!
34//! 2. **Use IVF-PQ** if your dataset is too large for RAM (e.g., > 10M vectors on a laptop).
35//!    It compresses vectors (32x-64x) but has lower recall than HNSW.
36//!
37//! 3. **Use NSW (Flat)** if you specifically want to avoid the hierarchy overhead of HNSW,
38//!    or if your dimensionality is high enough (d > 32) that the "small world" navigation
39//!    benefit of hierarchy is diminished. *Note: HNSW is generally more robust.*
40//!
41//! 4. **Use DiskANN** (experimental) if you have an NVMe SSD and 1B+ vectors.
42//!
43//! ```toml
44//! # Minimal (HNSW + SIMD)
45//! jin = "0.1"
46//!
47//! # With quantization support
48//! jin = { version = "0.1", features = ["ivf_pq"] }
49//! ```
50//!
51//! # Critical Nuances & The HNSW Critique (2025)
52//!
53//! ## 1. The HNSW Dominance & Its Cracks
54//! HNSW is the default because it's "good enough" for most. But research (2025-2026)
55//! highlights structural weaknesses:
56//!
57//! - **Local Minima**: HNSW's greedy search is prone to getting stuck in local optima,
58//!   especially in clustered datasets. Newer graphs like **NSG** (Navigable Small World)
59//!   and **Vamana** (DiskANN) use better edge selection (e.g., RNG) to ensure
60//!   angular diversity, allowing "glances" around obstacles.
61//! - **Memory Bloat**: The hierarchical layers add 30-40% overhead. For d > 32,
62//!   the hierarchy often provides negligible speedup over a well-constructed flat graph.
63//! - **Hub Highway Hypothesis (2025)**: Research demonstrated that flat NSW graphs
64//!   retain *all* benefits of HNSW on high-d data. Hub nodes form a "hub highway"
65//!   that serves the same routing function as explicit hierarchy layers.
66//! - **Construction Sensitivity**: HNSW quality depends heavily on insertion order.
67//!   LID-based ordering (descending) can improve recall by 12%. Vamana's two-pass
68//!   build (random graph -> refined) is more robust.
69//!
70//! **Verdict**: Use HNSW for RAM-based, low-latency search. Look at Vamana/DiskANN
71//! for higher recall or SSD-resident data.
72//!
73//! ## 2. The Hubness Problem
74//!
75//! In high-dimensional spaces, some vectors become **hubs**—appearing as
76//! nearest neighbors to many other points, while **antihubs** rarely appear
77//! in any neighbor lists. This creates asymmetric NN relationships.
78//!
79//! **Mitigation**: Local scaling, cosine over Euclidean, or dimensionality reduction.
80//!
81//! ## 3. Curse of Dimensionality
82//!
83//! For k-NN with neighborhood radius ℓ=0.1, you need n ≈ k × 10^d samples.
84//! For d > 100, this exceeds atoms in the observable universe.
85//!
86//! **Why ANN works anyway**: Real data lies on low-dimensional manifolds.
87//! Intrinsic dimensionality << embedding dimensionality.
88//!
89//! ## 4. When Exact Search Beats Approximate
90//!
91//! - Small datasets (< 10K vectors): Brute force is faster (SIMD is powerful).
92//! - Very high recall requirements (> 99.9%): ANN overhead not worth it.
93//! - Low intrinsic dimensionality: KD-trees can be exact and fast.
94//!
95//! ## 5. Quantization Trade-offs (2025-2026 Research)
96//!
97//! | Method | Compression | Recall | Query Speedup | Use Case |
98//! |--------|-------------|--------|---------------|----------|
99//! | **Binary** | 32x | ~96% | 24x | Extreme speed, lower accuracy OK |
100//! | **Scalar (INT8)** | 4x | ~99.3% | 4x | Balance of speed and accuracy |
101//! | **Product (PQ)** | 8-64x | ~95-98% | 8-32x | Memory-constrained billion-scale |
102//! | **OPQ** | 8-64x | ~97-99% | 8-32x | Better than PQ via rotation |
103//!
104//! Binary quantization uses Hamming distance (2 CPU cycles per comparison).
105//! Matryoshka embeddings enable dimension reduction before quantization for
106//! further compression (e.g., 256x total with MRL + binary).
107//!
108//! ## 6. Future Directions (2025+ Research)
109//!
110//! Several promising research directions are not yet implemented:
111//!
112//! | Technique | Benefit | Reference |
113//! |-----------|---------|-----------|
114//! | **GPU Indexing** | 9.3x speedup, 3.75x cost reduction | OpenSearch cuVS 2025 |
115//! | **Filtered ANNS** | Correctness guarantees for metadata filters | ACORN (SIGMOD 2024) |
116//! | **Probabilistic Routing** | 1.6-2.5x throughput via PEOs | Theoretical ANN 2024 |
117//! | **FreshDiskANN** | Real-time updates with 2-hop reconnection | Microsoft Research |
118//! | **Learned Indexes** | RL/DL for routing decisions | CRINN, learned ANN |
119//!
120//! ### GPU Acceleration
121//!
122//! NVIDIA cuVS (via OpenSearch) showed 9.3x query speedup and 3.75x cost reduction
123//! on billion-scale workloads. GPU acceleration is most beneficial for:
124//! - Large batch queries (amortize kernel launch)
125//! - High-dimensional data (d > 256)
126//! - Index construction (embarrassingly parallel)
127//!
128//! ### Filtered ANNS (ACORN)
129//!
130//! Traditional approach: post-filter ANN results. Problem: may return < k results.
131//! ACORN builds predicate-aware graphs with formal correctness guarantees.
132//! Useful when: attribute filtering is common (e.g., "similar products in category X").
133//!
134//! ### Probabilistic Routing
135//!
136//! Instead of deterministic greedy search, use learned probabilistic routing.
137//! Position-Encoded Outputs (PEOs) achieve 1.6-2.5x throughput improvement
138//! with theoretical guarantees on recall.
139
140pub mod ann;
141pub mod classic;
142pub mod diskann;
143pub mod evoc;
144pub mod hnsw;
145pub mod ivf_pq;
146pub mod nsw;
147pub mod quantization;
148pub mod scann;
149pub mod sng;
150pub mod vamana;
151
152pub mod adaptive;
153pub mod matryoshka;
154pub mod partitioning;
155
156pub mod distance;
157pub mod filtering;
158pub mod lid;
159pub mod pq_simd;
160pub mod simd;
161
162// Hash-based methods (LSH, MinHash, SimHash)
163#[cfg(feature = "lsh")]
164pub mod hash;
165
166// Re-exports
167pub use ann::traits::ANNIndex;
168pub use distance::DistanceMetric;
169pub use error::{Result, RetrieveError};
170
171pub mod benchmark;
172pub mod compression;
173pub mod error;
174pub mod persistence;
175pub mod streaming;
jin/lib.rs

jin/
lib.rs