1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
//! # Random Projection for Graph Laplacian Preprocessing
//!
//! ## Baseline Choice
//!
//! This module uses **random projection** (Gaussian or sparse Achlioptas) as the default
//! dimensionality reduction technique, not full PCA/SVD. Random projection is linear,
//! parameter-light, and preserves distances sufficiently for k-NN graph construction,
//! which is exactly what the Laplacian stage relies on in this codebase.
//!
//! The **Johnson–Lindenstrauss lemma** guarantees that a target dimension
//! \( r = \mathcal{O}(\log N_c / \varepsilon^2) \) suffices to approximately preserve
//! pairwise distances among \( N_c \) centroids. This makes the k-NN neighborhood
//! structure and weights stable while reducing computation from \( O(N_c F) \) to
//! \( O(N_c r) \) per pass before the Laplacian build.
//!
//! ## An example of why JL Preserves Distances BETWEEN Points
//!
//! The Johnson-Lindenstrauss lemma says: "For **n points**, you need r = O(log(n)/ε²) dimensions
//! to preserve **pairwise distances between those n points**."
//!
//! The crucial part: **n is the number of points you're comparing**, not the original feature dimension!
//!
//! ## Your Specific Case
//!
//! ```ignore
//! Original data: 10,000 items × 384 features
//! After clustering: 17 centroids × 384 features
//! ```
//!
//! When you compute the Laplacian, you build a k-NN graph **among the 17 centroids**. This requires
//! computing distances between all pairs of centroids. The JL lemma says:
//!
//! - To preserve the **17×17 distance matrix** (136 pairwise distances)
//! - You need r = 8×ln(17)/ε² ≈ 91 dimensions (with ε=0.3)
//!
//! This is **independent** of:
//!
//! - The original 10,000 items (already compressed to 17 centroids)
//! - The 384 features (will be reduced to 91)
//!
//! ## Why Fewer Centroids = Fewer Dimensions Needed
//!
//! **Clustering reduces the problem size dramatically**:
//!
//! | Scenario | Points to Compare | JL Dimension (ε=0.3) |
//! |----------|-------------------|----------------------|
//! | All items | 10,000 | ~737 dims |
//! | After clustering | 17 centroids | **~91 dims** |
//! | More clustering | 100 centroids | ~368 dims |
//!
//! The logarithmic scaling means: **fewer centroids → exponentially fewer dimensions needed to
//! preserve their relationships**.
//!
//! ## Why This Makes Sense for Graph Construction
//!
//! The Laplacian is built from the **centroid graph structure**. As long as:
//!
//! 1. Distances between centroids are preserved
//! 2. k-NN relationships stay correct
//! 3. Edge weights remain accurate
//!
//! Then the Laplacian (and lambdas computed from it) will be correct, regardless of whether you're
//! in 384-dimensional or 91-dimensional space.
//!
//! ## Paper Guidance Applied
//!
//! ### Laplacian Eigenmaps
//!
//! Belkin and Niyogi demonstrate the importance of locality-preserving weights and show
//! that spectral embedding quality depends on neighborhood fidelity. Projecting
//! to a dimension \( r \) that preserves distances ensures the adjacency built in reduced
//! space respects local structure with far less computational cost, matching the algorithm's
//! locality emphasis without introducing new algorithmic branches.
//!
//! ### Laplacian-based Dimensionality Reduction
//!
//! The survey on spectral dimensionality reduction methods (arXiv:2106.02154) unifies
//! these approaches around graph Laplacians. For a low-code baseline, a linear
//! projection is acceptable upstream of the Laplacian as long as neighbor relations are
//! not distorted, which Johnson-Lindenstrauss-based projections provide.
//! This avoids bespoke kernel or solver choices while maintaining efficiency.
//!
//! ### Spectral Filtering
//!
//! NeurIPS 2021 work on spectral filtering mitigates high-dimensional effects by acting
//! in graph frequency domains; however, that approach requires graph eigensolvers.
//! Random projection upstream achieves similar speed gains for graph construction with
//! minimal code and no added hyperparameters, fitting the objective to reduce Laplacian
//! build time with minimal changes.
//!
//! ### When to prefer SVD later
//!
//! If future goals include best-possible variance capture or interpretability of reduced axes,
//! switch the projection helper to randomized PCA; it needs more code (QR, small SVD) and careful
//! rank selection, but can be swapped behind the same call site thanks to the identical
//! “centroids in → reduced features out → transpose → Laplacian” interface already used here.
//!
//! ### Characteristics
//!
//! 1. **Minimal code**: ~150 lines total including helpers and integration
//! 2. **JL-guaranteed distance preservation**: Uses standard Gaussian projection with proper scaling
//! 3. **Auto-tuning**: Computes target dimension from centroid count and epsilon
//! 4. **Clean integration**: Operates between clustering and Laplacian build with no other changes
//! 5. **Speedup**: Reduces k-NN cost from O(F) to O(r) where r = O(log N_c / ε²)
//! 6. **Optional sparse variant**: Achlioptas projection for 3-10x faster projection on high dimensions
//! 7. **Comprehensive tests**: Validates correctness, structure preservation, and performance
//!
//! The speedup comes from reducing the feature dimension before the expensive k-NN graph construction in
//! `build_laplacian_matrix`, which internally uses CosinePair for neighbor queries. Each distance computation
//! drops from O(F) to O(r), multiplied across all n_centroids × k operations.
use ;
use SeedableRng;
use ChaCha8Rng;
use ;
use *;
use ;
/// Compute optimal target dimension using Johnson-Lindenstrauss bound
///
/// For n points and tolerance ε, need r = O(log(n) / ε²) dimensions
/// to preserve pairwise distances within (1±ε) with high probability.
/// Helper: Project matrix using ImplicitProjection