1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
//! IVF-PQ: Inverted File with Product Quantization.
//!
//! The workhorse of billion-scale similarity search. **32x compression** with ~90% recall.
//!
//! # Feature Flag
//!
//! ```toml
//! jin = { version = "0.1", features = ["ivf_pq"] }
//! ```
//!
//! # Quick Start
//!
//! ```ignore
//! use jin::ivf_pq::{IVFPQIndex, IVFPQParams};
//!
//! let params = IVFPQParams {
//! num_clusters: 1024, // sqrt(n) rule of thumb
//! num_codebooks: 8, // M subvectors
//! codebook_size: 256, // 8-bit codes
//! nprobe: 10, // cells to search
//! };
//!
//! let mut index = IVFPQIndex::new(128, params)?;
//! index.train(&training_vectors)?; // Need ~10k-100k samples
//! index.add_batch(&database)?;
//!
//! let results = index.search(&query, 10)?;
//! ```
//!
//! # Memory Calculation
//!
//! ```text
//! Compressed: n × M bytes (M = num_codebooks)
//! Codebooks: M × 256 × (d/M) × 4 bytes
//! Centroids: k × d × 4 bytes
//!
//! Example: 1B vectors, d=128, M=8, k=4096
//! Vectors: 1B × 8 = 8 GB (vs 512 GB uncompressed!)
//! Codebooks: 8 × 256 × 16 × 4 = 128 KB
//! Centroids: 4096 × 128 × 4 = 2 MB
//! Total: ~8 GB
//! ```
//!
//! # Two Key Ideas
//!
//! ## 1. IVF: Partition and Prune
//!
//! Cluster vectors into k cells. At search time, only scan `nprobe` nearest cells:
//!
//! ```text
//! Brute force: O(n) → IVF: O(nprobe × n/k)
//!
//! Query
//! |
//! +-------+-------+
//! | |
//! Cell A Cell B (probe 2 cells)
//! [vectors] [vectors] (skip other 1022 cells)
//! ```
//!
//! ## 2. PQ: Compress Vectors
//!
//! Split vector into M subvectors. Quantize each to 256 codewords (8 bits):
//!
//! ```text
//! Original: [v₁ v₂ ... v₁₂₈] (512 bytes)
//! └─┬─┘ └─┬─┘
//! ↓ ↓
//! [c₁] [c₂] ... (8 bytes for M=8)
//! ```
//!
//! Distance computed via table lookup (precompute query-to-codebook distances).
//!
//! # Parameter Recommendations
//!
//! | Dataset Size | num_clusters | num_codebooks | nprobe |
//! |--------------|--------------|---------------|--------|
//! | 100K | 256 | 8 | 4 |
//! | 1M | 1024 | 8-16 | 8 |
//! | 10M | 4096 | 16 | 16 |
//! | 100M | 16384 | 16-32 | 32 |
//! | 1B | 65536 | 32-64 | 64 |
//!
//! **Rules of thumb**:
//! - `num_clusters` ≈ 4√n (slightly more aggressive than √n)
//! - `nprobe` ≈ 1-5% of clusters for 90%+ recall
//! - `num_codebooks` = d/16 is often good (d/M ≈ 8-16 dims per subvector)
//!
//! # Trade-offs
//!
//! | ↑ Parameter | Better | Worse |
//! |-------------|--------|-------|
//! | nprobe | Recall | Search latency |
//! | num_clusters | Search speed | Training time, accuracy at edges |
//! | num_codebooks | Accuracy | Memory, training time |
//!
//! # When to Use
//!
//! - Dataset **doesn't fit in RAM** (> 10M vectors on typical hardware)
//! - Can tolerate **85-95% recall** (vs 99%+ with HNSW)
//! - Need **sub-second search at billion scale**
//!
//! # When NOT to Use
//!
//! - Dataset fits in RAM → use HNSW (better recall)
//! - Need > 95% recall → use HNSW or exact search
//! - Can't provide training data → PQ codebooks need ~10k samples
//!
//! # OPQ: Optimized Product Quantization
//!
//! PQ assumes subspaces are independent. Real data has correlations.
//! OPQ learns a rotation matrix that decorrelates dimensions first,
//! improving recall by 10-30% at same memory.
//!
//! # References
//!
//! - Jégou, Douze, Schmid (2011). "Product Quantization for Nearest Neighbor Search."
//! - Ge et al. (2014). "Optimized Product Quantization."
// IVF-PQ core implementation (always available when ivf_pq feature is enabled)
pub use ;
// NOTE: Advanced IVF-PQ components (OPQ, pure IVF) are stubs awaiting implementation.
// The core IVFPQIndex in search.rs provides the main functionality.
// TODO: Implement full OPQ (rotation matrix learning) and separate IVF index