ferrolearn_cluster/feature_agglomeration.rs
1//! Feature Agglomeration — hierarchical clustering of features.
2//!
3//! [`FeatureAgglomeration`] transposes the data and applies agglomerative
4//! clustering to the *features* (columns) rather than the samples (rows).
5//! After fitting, features within each cluster are pooled (by mean or max)
6//! to produce a reduced-dimensionality representation.
7//!
8//! This is useful for supervised dimensionality reduction: correlated
9//! features are grouped, and the group summary (e.g. mean) replaces
10//! the original features.
11//!
12//! # Algorithm
13//!
14//! 1. **Fit**: transpose `X`, run agglomerative clustering on the columns
15//! (treated as data points) to obtain `n_clusters` feature groups.
16//! 2. **Transform**: for each feature group, apply the pooling function
17//! (mean or max) across the grouped columns. The output has
18//! `n_clusters` columns.
19//!
20//! # Examples
21//!
22//! ```
23//! use ferrolearn_cluster::{FeatureAgglomeration, PoolingFunc};
24//! use ferrolearn_core::traits::{Fit, Transform};
25//! use ndarray::Array2;
26//!
27//! let x = Array2::from_shape_vec((4, 6), vec![
28//! 1.0, 1.1, 5.0, 5.1, 9.0, 9.1,
29//! 2.0, 2.1, 6.0, 6.1, 8.0, 8.1,
30//! 3.0, 3.1, 7.0, 7.1, 7.0, 7.1,
31//! 4.0, 4.1, 8.0, 8.1, 6.0, 6.1,
32//! ]).unwrap();
33//!
34//! let fa = FeatureAgglomeration::<f64>::new(3);
35//! let fitted = fa.fit(&x, &()).unwrap();
36//! let reduced = fitted.transform(&x).unwrap();
37//! assert_eq!(reduced.ncols(), 3);
38//! ```
39//!
40//! # `## REQ status`
41//!
42//! Binary classification (R-DEFER-2): two states only — SHIPPED needs impl + a
43//! non-test production consumer + green verification + symbol-anchor + sklearn
44//! `file:line`; NOT-STARTED carries the open prereq blocker. **`FeatureAgglomeration`
45//! has NO PyO3 binding** — `grep -rln FeatureAgglomeration ferrolearn-python/` is
46//! EMPTY (no `_RsFeatureAgglomeration`, no `ferrolearn.FeatureAgglomeration`). The
47//! non-test production consumer is therefore the crate re-export at the crate root
48//! (`pub use feature_agglomeration::{AgglomerativeLinkage, FeatureAgglomeration,
49//! FittedFeatureAgglomeration, PoolingFunc}` in `ferrolearn-cluster/src/lib.rs`),
50//! exposing `fit` / `transform` / `labels()` (+ `feature_labels()` /
51//! `children()` / `distances()` / `n_leaves()` / `n_connected_components()`).
52//! **End-to-end VALUE parity is now achieved (post-#938):** the AgglomerativeClustering
53//! unit ships bit-exact `_hc_cut` `labels_` (commit 3e001cf4b), and
54//! `FeatureAgglomeration::fit` delegates to `AgglomerativeClustering::fit(X.T)`
55//! (`_agglomerative.py:1339`), so the now-correct `labels_` flow through unchanged
56//! (NO fit-side relabel: the inner `labels_` are stored verbatim into
57//! `feature_labels_`). `feature_labels_`/`labels()` is integer-EXACT vs sklearn
58//! `FeatureAgglomeration(...).fit(X).labels_` for all four linkages, k∈{2,3}; and
59//! because `fn transform` groups by ASCENDING label index (it accumulates into
60//! `result[[i, label]]`, column j = pool over features with `label == j`), the
61//! `transform` output is now VALUE-EXACT and COLUMN-ORDERED vs sklearn (mean +
62//! max), `_feature_agglomeration.py:51-64`. Fitted attrs `children_`/`distances_`/
63//! `n_leaves_`/`n_connected_components_` are delegated from the inner fit over `X.T`.
64//! Green verification = the in-tree `feature_agglom` lib tests plus the live-sklearn
65//! pins/guards: `divergence_feature_agglom_min_features_two` (#944) +
66//! `green_feature_agglom_*` (`tests/divergence_feature_agglomeration.rs`) and the new
67//! VALUE tests (`tests/divergence_feature_agglomeration_value.rs`):
68//! `value_labels_exact_all_linkages_k2_k3`, `value_transform_mean_column_ordered`,
69//! `value_transform_max_column_ordered`, `value_fitted_attrs_delegated`. Cites use
70//! symbol anchors (ferrolearn) / `file:line` (sklearn 1.5.2, commit 156ef14). Live
71//! oracle = installed sklearn 1.5.2. (REQ numbering follows
72//! `.design/cluster/feature_agglomeration.md`.)
73//!
74//! | REQ | Status | Evidence |
75//! |---|---|---|
76//! | REQ-4 (validation guards: `ensure_min_features=2`; `n_clusters >= 1`; `n_features >= n_clusters`; non-empty `X` — SHAPE/validation contract only) | SHIPPED (validation + shape + pooling-as-set contract only) | `fn fit` for `FeatureAgglomeration` rejects a 1-feature `X` with `FerroError::InvalidParameter { name: "X", reason: "Found array with {n} feature(s) while a minimum of 2 is required by FeatureAgglomeration" }`, mirroring `self._validate_data(X, ensure_min_features=2)` (`_agglomerative.py:1338`); rejects `n_clusters == 0` per `Interval(Integral, 1, None, closed="left")` (`:1281`); rejects `n_features < n_clusters` ("Cannot extract more clusters than samples", from `_hc_cut`); rejects empty `X` with `FerroError::InsufficientSamples`. The `Transform` impl (`fn transform`) returns output of SHAPE `(n_samples, n_clusters)`. Non-test consumer: crate re-export of `fit` / `transform` / `labels()` (`lib.rs`). Verified: now-PASSING pin `divergence_feature_agglom_min_features_two` (#944): `new(1).fit(X_(4,1))` returns `Err` (sklearn `FeatureAgglomeration(n_clusters=1).fit(X_1col)` raises `ValueError`); green guards `green_feature_agglom_transform_shape`, `green_feature_agglom_mean_pooling_as_set`, `green_feature_agglom_n_clusters_zero_rejected`, `green_feature_agglom_too_many_clusters_rejected`. GAP (NOT-STARTED): the sklearn `ValueError`/`InvalidParameterError` error ABI is NOT matched — ferrolearn uses `FerroError`. |
77//! | REQ-1 (`transform` mean-pooling VALUE) | SHIPPED | `fn transform` (`PoolingFunc::Mean`) computes per-cluster mean (`sum/count`) into `result[[i, label]]`, column j ordered by ASCENDING label index, mirroring the `bincount` fast path of `AgglomerationTransform.transform` (`_feature_agglomeration.py:51-57`: `nX[i] = np.bincount(labels_, X[i,:]) / size`, column = label index 0..n_clusters-1). With the now-correct delegated `labels_` (#938 shipped, commit 3e001cf4b), the output is VALUE-EXACT and COLUMN-ORDERED vs sklearn. Non-test consumer: crate re-export (`lib.rs`); also consumed by `api_proof.rs`/`conformance_wave4.rs`. Verified: `value_transform_mean_column_ordered` — ferrolearn `transform(X)` == sklearn `FeatureAgglomeration(n_clusters=3, pooling_func=np.mean).fit(X).transform(X)` (full matrix, ~1e-9) for ward+single; sklearn row0 = `[1.05, 9.05, 5.05]`, ferrolearn row0 = `[1.05, 9.05, 5.05]`. |
78//! | REQ-2 (`transform` max-pooling VALUE) | SHIPPED | `fn transform` (`PoolingFunc::Max`) takes per-cluster max into `result[[i, label]]`, ascending label-index column order, mirroring the general pooling path (`_feature_agglomeration.py:58-63`: `[pooling_func(X[:, labels_==l], axis=1) for l in np.unique(labels_)]`, columns by sorted unique label = 0..n_clusters-1). VALUE-EXACT and COLUMN-ORDERED post-#938. Non-test consumer: crate re-export (`lib.rs`). Verified: `value_transform_max_column_ordered` — full-matrix equality vs sklearn `pooling_func=np.max` (ward+complete); sklearn row0 = `[1.1, 9.1, 5.1]`, ferrolearn row0 = `[1.1, 9.1, 5.1]`. |
79//! | REQ-3 (`labels_` feature-cluster VALUE parity) | SHIPPED | `fn fit` delegates to `AgglomerativeClustering::new(n_clusters).with_linkage(...)` on `X.T` and stores `fitted_agg.labels_` verbatim into `feature_labels_` (NO post-relabel), mirroring `super()._fit(X.T)` (`_agglomerative.py:1339`). The AgglomerativeClustering unit now ships bit-exact `_hc_cut` `labels_` (`agglomerative.rs fn hc_cut`, `:731-775`; `np.searchsorted` numbering `:1105`; commit 3e001cf4b), so `feature_labels_`/`labels()` is integer-EXACT: sklearn `[0,0,2,2,1,1]` (k=3) / `[1,1,0,0,0,0]` (k=2) == ferrolearn. Accessor `fn labels` (sklearn name) returns the same data as `fn feature_labels`. Non-test consumer: crate re-export (`lib.rs`). Verified: `value_labels_exact_all_linkages_k2_k3` — integer-exact for {ward,complete,average,single} × {2,3}. |
80//! | REQ-5 (linkage variants ward/complete/average/single) | SHIPPED | `fn map_linkage` maps all four `AgglomerativeLinkage` variants to `Linkage`, delegated to `AgglomerativeClustering`, mirroring `_TREE_BUILDERS` (`_agglomerative.py:720-725`/`:1290`). VALUE parity (labels_ + transform) holds across all four post-#938. Non-test consumer: crate re-export (`lib.rs`). Verified: `value_labels_exact_all_linkages_k2_k3` (all four linkages) + `value_transform_mean_column_ordered` (ward+single) + `value_transform_max_column_ordered` (ward+complete). |
81//! | REQ-6 (`n_clusters=2` default + missing params metric/memory/connectivity/compute_full_tree/distance_threshold/compute_distances) | NOT-STARTED | open prereq blocker **#941**. sklearn `__init__` (`_agglomerative.py:1296-1319`) takes 9 params with `n_clusters=2` default. `FeatureAgglomeration<F>` REQUIRES `n_clusters` (`fn new`, no default) and has only `linkage`/`pooling_func`. `distance_threshold` (cut by distance with `n_clusters=None`, `:1281`/`:1091-1092`) is materially absent. |
82//! | REQ-7 (`pooling_func` as arbitrary callable) | NOT-STARTED | open prereq blocker **#941**. sklearn `_parameter_constraints["pooling_func"] = [callable]` (`_agglomerative.py:1291`) accepts any callable (default `np.mean`, `:1305`). ferrolearn offers only the closed `PoolingFunc::{Mean, Max}` enum (`fn with_pooling_func`). |
83//! | REQ-8 (`inverse_transform`) | SHIPPED | `FittedFeatureAgglomeration::inverse_transform` broadcasts each cluster's pooled value back to its member features: `result[i, f] = xred[i, labels_[f]]`, mirroring sklearn `AgglomerationTransform.inverse_transform` (`_feature_agglomeration.py:66-92`: `unil, inverse = np.unique(labels_, return_inverse=True); return X[..., inverse]` — `inverse == labels_` for the contiguous `_hc_cut` numbering). Output shape `(n_samples, n_features)`; accepts `xred.ncols() >= n_clusters` (ignoring trailing extra columns, matching sklearn's numpy fancy-index `X[..., inverse]`, #2187) and rejects `xred.ncols() < n_clusters` with `FerroError::ShapeMismatch` (sklearn raises `IndexError`). Non-test consumer: crate re-export (`lib.rs`). Verified: `value_inverse_transform_roundtrip_and_broadcast` (full-matrix ~1e-12, all four linkages) + `divergence_inverse_transform_extra_cols_ignored` (#2187: `(2,4)` xred with n_clusters=3 ignores the 4th column, matching sklearn). |
84//! | REQ-9 (fitted attrs labels_/n_leaves_/n_connected_components_/children_/distances_) | SHIPPED | `fn fit` stores the inner `AgglomerativeClustering::fit(X.T)` attributes into `FittedFeatureAgglomeration` (`children_`, `distances_`, `n_leaves_`, `n_connected_components_`), delegating exactly as sklearn `FeatureAgglomeration._fit` → `AgglomerativeClustering._fit(X.T)` (`_agglomerative.py:1339`, sets `labels_`/`n_clusters_`/`n_leaves_`/`n_connected_components_`/`children_`/`distances_` at `:1083-1095`). Accessors: `fn labels` (sklearn name, alias of `feature_labels`), `fn children`, `fn distances` (`Option`, `Some` iff `with_compute_distances(true)` — mirrors sklearn `compute_distances`, `:1319`), `fn n_leaves` (= `n_features`), `fn n_connected_components` (= 1 for the unstructured path). `with_compute_distances` passthrough sets the inner `compute_distances`. Non-test consumer: crate re-export (`lib.rs`). Verified: `value_fitted_attrs_delegated` — `children_` == sklearn `FeatureAgglomeration(compute_distances=True).fit(X).children_`; `distances_` == sklearn `distances_` (~1e-9); `n_leaves_ == 6 == n_features`; `n_connected_components_ == 1`. |
85//! | REQ-10 (PyO3 binding) | NOT-STARTED | open prereq blocker **#943**. `grep -rln FeatureAgglomeration ferrolearn-python/` is EMPTY — no `_RsFeatureAgglomeration`, so `import ferrolearn` cannot reach `FeatureAgglomeration`. The only non-test consumer of `fit` / `transform` / `feature_labels()` is the crate re-export (`lib.rs`). |
86//! | REQ-11 (ferray substrate) | NOT-STARTED | open prereq blocker **#943**. `feature_agglomeration.rs` imports `ndarray::{Array1, Array2}` + `num_traits::Float`; the delegated `agglomerative.rs` is likewise on `ndarray`. Not migrated to `ferray-core` / `ferray::linalg` (R-SUBSTRATE-1/2). |
87
88use ferrolearn_core::error::FerroError;
89use ferrolearn_core::traits::{Fit, Transform};
90use ndarray::{Array1, Array2};
91use num_traits::Float;
92
93use crate::agglomerative::{AgglomerativeClustering, Linkage};
94
95// ─────────────────────────────────────────────────────────────────────────────
96// Public enums
97// ─────────────────────────────────────────────────────────────────────────────
98
99/// The linkage criterion used by [`FeatureAgglomeration`].
100///
101/// Re-uses the same linkage strategies as [`AgglomerativeClustering`].
102#[derive(Debug, Clone, Copy, PartialEq, Eq)]
103pub enum AgglomerativeLinkage {
104 /// Ward linkage (minimises within-cluster variance).
105 Ward,
106 /// Complete linkage (max pairwise distance).
107 Complete,
108 /// Average linkage (mean pairwise distance).
109 Average,
110 /// Single linkage (min pairwise distance).
111 Single,
112}
113
114/// The pooling function applied to grouped features during transformation.
115#[derive(Debug, Clone, Copy, PartialEq, Eq)]
116pub enum PoolingFunc {
117 /// Take the mean of features in each cluster.
118 Mean,
119 /// Take the maximum of features in each cluster.
120 Max,
121}
122
123// ─────────────────────────────────────────────────────────────────────────────
124// FeatureAgglomeration (unfitted)
125// ─────────────────────────────────────────────────────────────────────────────
126
127/// Feature Agglomeration configuration (unfitted).
128///
129/// Call [`Fit::fit`] to cluster the features and obtain a
130/// [`FittedFeatureAgglomeration`].
131///
132/// # Type Parameters
133///
134/// - `F`: floating-point scalar type (`f32` or `f64`).
135#[derive(Debug, Clone)]
136pub struct FeatureAgglomeration<F> {
137 /// Target number of feature clusters.
138 n_clusters: usize,
139 /// Linkage strategy for the agglomerative clustering of features.
140 linkage: AgglomerativeLinkage,
141 /// Pooling function applied during transformation.
142 pooling_func: PoolingFunc,
143 /// Whether to compute and store the per-merge `distances_` of the inner
144 /// dendrogram, mirroring sklearn `compute_distances` (default `false`,
145 /// `_agglomerative.py:1319`).
146 compute_distances: bool,
147 _marker: std::marker::PhantomData<F>,
148}
149
150impl<F: Float + Send + Sync + 'static> FeatureAgglomeration<F> {
151 /// Create a new `FeatureAgglomeration` that reduces to `n_clusters` features.
152 ///
153 /// Defaults: `linkage = Ward`, `pooling_func = Mean`, `compute_distances = false`.
154 #[must_use]
155 pub fn new(n_clusters: usize) -> Self {
156 Self {
157 n_clusters,
158 linkage: AgglomerativeLinkage::Ward,
159 pooling_func: PoolingFunc::Mean,
160 compute_distances: false,
161 _marker: std::marker::PhantomData,
162 }
163 }
164
165 /// Set the linkage criterion.
166 #[must_use]
167 pub fn with_linkage(mut self, linkage: AgglomerativeLinkage) -> Self {
168 self.linkage = linkage;
169 self
170 }
171
172 /// Set the pooling function.
173 #[must_use]
174 pub fn with_pooling_func(mut self, pooling: PoolingFunc) -> Self {
175 self.pooling_func = pooling;
176 self
177 }
178
179 /// Compute and store the per-merge linkage `distances_` of the inner
180 /// dendrogram (over the transposed feature matrix), mirroring sklearn
181 /// `FeatureAgglomeration(compute_distances=True)` (`_agglomerative.py:1319`).
182 ///
183 /// When set, [`FittedFeatureAgglomeration::distances`] returns `Some`.
184 #[must_use]
185 pub fn with_compute_distances(mut self, compute_distances: bool) -> Self {
186 self.compute_distances = compute_distances;
187 self
188 }
189
190 /// Return the configured number of feature clusters.
191 #[must_use]
192 pub fn n_clusters(&self) -> usize {
193 self.n_clusters
194 }
195
196 /// Return the configured linkage.
197 #[must_use]
198 pub fn linkage(&self) -> AgglomerativeLinkage {
199 self.linkage
200 }
201
202 /// Return the configured pooling function.
203 #[must_use]
204 pub fn pooling_func(&self) -> PoolingFunc {
205 self.pooling_func
206 }
207}
208
209// ─────────────────────────────────────────────────────────────────────────────
210// FittedFeatureAgglomeration
211// ─────────────────────────────────────────────────────────────────────────────
212
213/// Fitted Feature Agglomeration model.
214///
215/// Stores the cluster label for each original feature and the pooling
216/// strategy. Implements [`Transform`] to reduce the dimensionality of
217/// new data.
218#[derive(Debug, Clone)]
219pub struct FittedFeatureAgglomeration<F> {
220 /// Cluster label for each original feature, length `n_features`.
221 ///
222 /// This is sklearn's `labels_` attribute: it is exactly the inner
223 /// [`AgglomerativeClustering`]'s `labels_` over the transposed feature
224 /// matrix `X.T` (`_agglomerative.py:1339`, `FeatureAgglomeration._fit`
225 /// → `AgglomerativeClustering._fit(X.T)`).
226 feature_labels_: Array1<usize>,
227 /// Number of feature clusters.
228 n_clusters_: usize,
229 /// Number of original features.
230 n_features_: usize,
231 /// Pooling function to aggregate features within each cluster.
232 pooling_func_: PoolingFunc,
233 /// The merge history (dendrogram) of the inner clustering over `X.T`,
234 /// length `n_features - 1`. Mirrors sklearn `children_`
235 /// (`_agglomerative.py:1083-1095`, delegated from `AgglomerativeClustering`).
236 children_: Vec<(usize, usize)>,
237 /// The per-merge linkage distances of the inner clustering over `X.T`,
238 /// in `children_` row order. `Some` when `compute_distances` was set.
239 /// Mirrors sklearn `distances_`.
240 distances_: Option<Array1<F>>,
241 /// Number of leaves in the inner hierarchical tree (`== n_features` for the
242 /// unstructured path). Mirrors sklearn `n_leaves_`.
243 n_leaves_: usize,
244 /// Estimated number of connected components in the inner graph (always `1`
245 /// for the unstructured `connectivity=None` path). Mirrors sklearn
246 /// `n_connected_components_`.
247 n_connected_components_: usize,
248 _marker: std::marker::PhantomData<F>,
249}
250
251impl<F: Float + Send + Sync + 'static> FittedFeatureAgglomeration<F> {
252 /// Return the cluster label assigned to each feature.
253 ///
254 /// This is the sklearn name for the per-feature cluster assignment
255 /// (`labels_`). Identical data to [`feature_labels`](Self::feature_labels).
256 #[must_use]
257 pub fn labels(&self) -> &Array1<usize> {
258 &self.feature_labels_
259 }
260
261 /// Return the cluster label assigned to each feature.
262 ///
263 /// Alias of [`labels`](Self::labels) retained for backward compatibility;
264 /// sklearn names this attribute `labels_`.
265 #[must_use]
266 pub fn feature_labels(&self) -> &Array1<usize> {
267 &self.feature_labels_
268 }
269
270 /// Return the number of feature clusters.
271 #[must_use]
272 pub fn n_clusters(&self) -> usize {
273 self.n_clusters_
274 }
275
276 /// Return the number of original features.
277 #[must_use]
278 pub fn n_features(&self) -> usize {
279 self.n_features_
280 }
281
282 /// Return the merge history (dendrogram) of the inner clustering over `X.T`.
283 ///
284 /// Mirrors sklearn's `children_`, length `n_features - 1`. Delegated from
285 /// the inner [`AgglomerativeClustering`] fit (`_agglomerative.py:1339`).
286 #[must_use]
287 pub fn children(&self) -> &[(usize, usize)] {
288 &self.children_
289 }
290
291 /// Return the per-merge linkage distances of the inner dendrogram (in
292 /// `children_` row order), or `None` if `compute_distances` was not set.
293 ///
294 /// Mirrors sklearn's `distances_` attribute.
295 #[must_use]
296 pub fn distances(&self) -> Option<&Array1<F>> {
297 self.distances_.as_ref()
298 }
299
300 /// Return the number of leaves in the inner hierarchical tree
301 /// (`== n_features` for the unstructured path). Mirrors sklearn `n_leaves_`.
302 #[must_use]
303 pub fn n_leaves(&self) -> usize {
304 self.n_leaves_
305 }
306
307 /// Return the estimated number of connected components in the inner graph
308 /// (always `1` for the unstructured `connectivity=None` path). Mirrors
309 /// sklearn `n_connected_components_`.
310 #[must_use]
311 pub fn n_connected_components(&self) -> usize {
312 self.n_connected_components_
313 }
314}
315
316// ─────────────────────────────────────────────────────────────────────────────
317// Map between linkage types
318// ─────────────────────────────────────────────────────────────────────────────
319
320/// Convert [`AgglomerativeLinkage`] to the internal [`Linkage`] enum.
321fn map_linkage(l: AgglomerativeLinkage) -> Linkage {
322 match l {
323 AgglomerativeLinkage::Ward => Linkage::Ward,
324 AgglomerativeLinkage::Complete => Linkage::Complete,
325 AgglomerativeLinkage::Average => Linkage::Average,
326 AgglomerativeLinkage::Single => Linkage::Single,
327 }
328}
329
330// ─────────────────────────────────────────────────────────────────────────────
331// Trait impls
332// ─────────────────────────────────────────────────────────────────────────────
333
334impl<F: Float + Send + Sync + 'static> Fit<Array2<F>, ()> for FeatureAgglomeration<F> {
335 type Fitted = FittedFeatureAgglomeration<F>;
336 type Error = FerroError;
337
338 /// Fit feature agglomeration by clustering the features (columns).
339 ///
340 /// Transposes `x` so each column becomes a row (data point) and
341 /// runs agglomerative clustering.
342 ///
343 /// # Errors
344 ///
345 /// - [`FerroError::InvalidParameter`] if `n_clusters == 0`.
346 /// - [`FerroError::InsufficientSamples`] if `n_features < n_clusters`.
347 fn fit(&self, x: &Array2<F>, _y: &()) -> Result<FittedFeatureAgglomeration<F>, FerroError> {
348 let n_features = x.ncols();
349
350 if self.n_clusters == 0 {
351 return Err(FerroError::InvalidParameter {
352 name: "n_clusters".into(),
353 reason: "must be at least 1".into(),
354 });
355 }
356 if n_features < 2 {
357 return Err(FerroError::InvalidParameter {
358 name: "X".into(),
359 reason: format!(
360 "Found array with {n_features} feature(s) while a minimum of 2 is required by FeatureAgglomeration"
361 ),
362 });
363 }
364 if n_features < self.n_clusters {
365 return Err(FerroError::InvalidParameter {
366 name: "n_clusters".into(),
367 reason: format!(
368 "n_clusters ({}) exceeds n_features ({})",
369 self.n_clusters, n_features
370 ),
371 });
372 }
373 if x.nrows() == 0 {
374 return Err(FerroError::InsufficientSamples {
375 required: 1,
376 actual: 0,
377 context: "FeatureAgglomeration::fit requires at least 1 sample".into(),
378 });
379 }
380
381 // Transpose: each feature (column) becomes a row.
382 // Use as_standard_layout() to ensure row-major (C) order, which
383 // is required for AgglomerativeClustering's internal as_slice().
384 let x_t = x.t().as_standard_layout().into_owned();
385
386 // Run agglomerative clustering on the transposed data.
387 //
388 // sklearn `FeatureAgglomeration._fit` delegates verbatim to
389 // `AgglomerativeClustering._fit(X.T)` (`_agglomerative.py:1339`), so
390 // `labels_`/`children_`/`distances_`/`n_leaves_`/`n_connected_components_`
391 // are exactly the inner estimator's attributes over `X.T`.
392 let agg = AgglomerativeClustering::<F>::new(self.n_clusters)
393 .with_linkage(map_linkage(self.linkage))
394 .with_compute_distances(self.compute_distances);
395 let fitted_agg = agg.fit(&x_t, &())?;
396
397 Ok(FittedFeatureAgglomeration {
398 feature_labels_: fitted_agg.labels_,
399 n_clusters_: self.n_clusters,
400 n_features_: n_features,
401 pooling_func_: self.pooling_func,
402 children_: fitted_agg.children_,
403 distances_: fitted_agg.distances_,
404 n_leaves_: fitted_agg.n_leaves_,
405 n_connected_components_: fitted_agg.n_connected_components_,
406 _marker: std::marker::PhantomData,
407 })
408 }
409}
410
411impl<F: Float + Send + Sync + 'static> Transform<Array2<F>> for FittedFeatureAgglomeration<F> {
412 type Output = Array2<F>;
413 type Error = FerroError;
414
415 /// Transform data by pooling features within each cluster.
416 ///
417 /// The output has shape `(n_samples, n_clusters)`.
418 ///
419 /// # Errors
420 ///
421 /// Returns [`FerroError::ShapeMismatch`] if the number of columns does not
422 /// match the number of features seen during fitting.
423 fn transform(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
424 if x.ncols() != self.n_features_ {
425 return Err(FerroError::ShapeMismatch {
426 expected: vec![x.nrows(), self.n_features_],
427 actual: vec![x.nrows(), x.ncols()],
428 context: "FittedFeatureAgglomeration::transform".into(),
429 });
430 }
431
432 let n_samples = x.nrows();
433 let mut result = Array2::<F>::zeros((n_samples, self.n_clusters_));
434
435 match self.pooling_func_ {
436 PoolingFunc::Mean => {
437 // Count features per cluster.
438 let mut counts = vec![0usize; self.n_clusters_];
439 for &label in &self.feature_labels_ {
440 counts[label] += 1;
441 }
442
443 // Sum features per cluster.
444 for i in 0..n_samples {
445 for (j, &label) in self.feature_labels_.iter().enumerate() {
446 result[[i, label]] = result[[i, label]] + x[[i, j]];
447 }
448 }
449
450 // Divide by count to get mean.
451 for i in 0..n_samples {
452 for c in 0..self.n_clusters_ {
453 if counts[c] > 0 {
454 result[[i, c]] = result[[i, c]] / F::from(counts[c]).unwrap();
455 }
456 }
457 }
458 }
459 PoolingFunc::Max => {
460 // Initialize with negative infinity.
461 result.fill(F::neg_infinity());
462
463 for i in 0..n_samples {
464 for (j, &label) in self.feature_labels_.iter().enumerate() {
465 if x[[i, j]] > result[[i, label]] {
466 result[[i, label]] = x[[i, j]];
467 }
468 }
469 }
470 }
471 }
472
473 Ok(result)
474 }
475}
476
477impl<F: Float + Send + Sync + 'static> FittedFeatureAgglomeration<F> {
478 /// Inverse the pooling transformation: broadcast each cluster's pooled
479 /// value back to every feature in that cluster.
480 ///
481 /// Mirrors sklearn `AgglomerationTransform.inverse_transform`
482 /// (`_feature_agglomeration.py:66-92`): `unil, inverse =
483 /// np.unique(labels_, return_inverse=True); return X[..., inverse]`. Since
484 /// `labels_` is the contiguous `_hc_cut` numbering `0..n_clusters`,
485 /// `inverse == labels_`, so output column `f` is reduced column
486 /// `labels_[f]`: `result[i, f] = xred[i, labels_[f]]`.
487 ///
488 /// `xred` has shape `(n_samples, n_clusters)`; the result has shape
489 /// `(n_samples, n_features)`.
490 ///
491 /// # Errors
492 ///
493 /// Returns [`FerroError::ShapeMismatch`] if `xred.ncols() < n_clusters`.
494 /// Mirroring sklearn's numpy fancy-index `X[..., inverse]` (which reads only
495 /// columns `0..n_clusters` and ignores any trailing columns, #2187), a wider
496 /// `xred` is accepted and the extra columns are ignored; too few columns is
497 /// the only error (sklearn raises `IndexError`).
498 pub fn inverse_transform(&self, xred: &Array2<F>) -> Result<Array2<F>, FerroError> {
499 if xred.ncols() < self.n_clusters_ {
500 return Err(FerroError::ShapeMismatch {
501 expected: vec![xred.nrows(), self.n_clusters_],
502 actual: vec![xred.nrows(), xred.ncols()],
503 context: "FittedFeatureAgglomeration::inverse_transform".into(),
504 });
505 }
506
507 let n_samples = xred.nrows();
508 let mut result = Array2::<F>::zeros((n_samples, self.n_features_));
509 for i in 0..n_samples {
510 for (f, &label) in self.feature_labels_.iter().enumerate() {
511 result[[i, f]] = xred[[i, label]];
512 }
513 }
514 Ok(result)
515 }
516}
517
518// ─────────────────────────────────────────────────────────────────────────────
519// Tests
520// ─────────────────────────────────────────────────────────────────────────────
521
522#[cfg(test)]
523mod tests {
524 use super::*;
525 use approx::assert_abs_diff_eq;
526
527 fn make_correlated_features() -> Array2<f64> {
528 // 6 features: (0,1) correlated, (2,3) correlated, (4,5) correlated.
529 Array2::from_shape_vec(
530 (5, 6),
531 vec![
532 1.0, 1.1, 5.0, 5.1, 9.0, 9.1, 2.0, 2.1, 6.0, 6.1, 8.0, 8.1, 3.0, 3.1, 7.0, 7.1,
533 7.0, 7.1, 4.0, 4.1, 8.0, 8.1, 6.0, 6.1, 5.0, 5.1, 9.0, 9.1, 5.0, 5.1,
534 ],
535 )
536 .unwrap()
537 }
538
539 #[test]
540 fn test_feature_agglom_basic() {
541 let x = make_correlated_features();
542 let fa = FeatureAgglomeration::<f64>::new(3);
543 let fitted = fa.fit(&x, &()).unwrap();
544 let reduced = fitted.transform(&x).unwrap();
545 assert_eq!(reduced.dim(), (5, 3));
546 }
547
548 #[test]
549 fn test_feature_agglom_output_shape() {
550 let x = make_correlated_features();
551 let fa = FeatureAgglomeration::<f64>::new(2);
552 let fitted = fa.fit(&x, &()).unwrap();
553 let reduced = fitted.transform(&x).unwrap();
554 assert_eq!(reduced.ncols(), 2);
555 assert_eq!(reduced.nrows(), 5);
556 }
557
558 #[test]
559 fn test_feature_agglom_labels_valid_range() {
560 let x = make_correlated_features();
561 let fa = FeatureAgglomeration::<f64>::new(3);
562 let fitted = fa.fit(&x, &()).unwrap();
563 for &l in fitted.feature_labels() {
564 assert!(l < 3, "label {l} out of range");
565 }
566 }
567
568 #[test]
569 fn test_feature_agglom_correlated_grouped() {
570 // With 4 features merging to 2 clusters, pairs of nearly-identical
571 // features should end up together. Use Single linkage to guarantee
572 // nearest-neighbor pairing.
573 let x = Array2::from_shape_vec(
574 (5, 4),
575 vec![
576 // feat 0 feat 1 feat 2 feat 3
577 1.0, 1.001, 100.0, 100.001, 2.0, 2.001, 90.0, 90.001, 3.0, 3.001, 80.0, 80.001, 4.0,
578 4.001, 70.0, 70.001, 5.0, 5.001, 60.0, 60.001,
579 ],
580 )
581 .unwrap();
582 let fa = FeatureAgglomeration::<f64>::new(2).with_linkage(AgglomerativeLinkage::Single);
583 let fitted = fa.fit(&x, &()).unwrap();
584 let labels = fitted.feature_labels();
585 // Features 0 and 1 are nearly identical, should be in the same cluster.
586 assert_eq!(labels[0], labels[1]);
587 // Features 2 and 3 are nearly identical.
588 assert_eq!(labels[2], labels[3]);
589 // The two pairs should be in different clusters.
590 assert_ne!(labels[0], labels[2]);
591 }
592
593 #[test]
594 fn test_feature_agglom_mean_pooling() {
595 // Simple case: two features that should be grouped.
596 let x = Array2::from_shape_vec((3, 2), vec![2.0, 4.0, 6.0, 8.0, 10.0, 12.0]).unwrap();
597 let fa = FeatureAgglomeration::<f64>::new(1);
598 let fitted = fa.fit(&x, &()).unwrap();
599 let reduced = fitted.transform(&x).unwrap();
600 assert_eq!(reduced.ncols(), 1);
601 // Mean of (2, 4) = 3, (6, 8) = 7, (10, 12) = 11.
602 assert_abs_diff_eq!(reduced[[0, 0]], 3.0, epsilon = 1e-10);
603 assert_abs_diff_eq!(reduced[[1, 0]], 7.0, epsilon = 1e-10);
604 assert_abs_diff_eq!(reduced[[2, 0]], 11.0, epsilon = 1e-10);
605 }
606
607 #[test]
608 fn test_feature_agglom_max_pooling() {
609 let x = Array2::from_shape_vec((3, 2), vec![2.0, 4.0, 6.0, 8.0, 10.0, 12.0]).unwrap();
610 let fa = FeatureAgglomeration::<f64>::new(1).with_pooling_func(PoolingFunc::Max);
611 let fitted = fa.fit(&x, &()).unwrap();
612 let reduced = fitted.transform(&x).unwrap();
613 assert_eq!(reduced.ncols(), 1);
614 // Max of (2, 4) = 4, (6, 8) = 8, (10, 12) = 12.
615 assert_abs_diff_eq!(reduced[[0, 0]], 4.0, epsilon = 1e-10);
616 assert_abs_diff_eq!(reduced[[1, 0]], 8.0, epsilon = 1e-10);
617 assert_abs_diff_eq!(reduced[[2, 0]], 12.0, epsilon = 1e-10);
618 }
619
620 #[test]
621 fn test_feature_agglom_complete_linkage() {
622 let x = make_correlated_features();
623 let fa = FeatureAgglomeration::<f64>::new(3).with_linkage(AgglomerativeLinkage::Complete);
624 let fitted = fa.fit(&x, &()).unwrap();
625 let reduced = fitted.transform(&x).unwrap();
626 assert_eq!(reduced.ncols(), 3);
627 }
628
629 #[test]
630 fn test_feature_agglom_average_linkage() {
631 let x = make_correlated_features();
632 let fa = FeatureAgglomeration::<f64>::new(3).with_linkage(AgglomerativeLinkage::Average);
633 let fitted = fa.fit(&x, &()).unwrap();
634 let reduced = fitted.transform(&x).unwrap();
635 assert_eq!(reduced.ncols(), 3);
636 }
637
638 #[test]
639 fn test_feature_agglom_single_linkage() {
640 let x = make_correlated_features();
641 let fa = FeatureAgglomeration::<f64>::new(3).with_linkage(AgglomerativeLinkage::Single);
642 let fitted = fa.fit(&x, &()).unwrap();
643 let reduced = fitted.transform(&x).unwrap();
644 assert_eq!(reduced.ncols(), 3);
645 }
646
647 #[test]
648 fn test_feature_agglom_n_clusters_equals_n_features() {
649 let x = make_correlated_features();
650 let fa = FeatureAgglomeration::<f64>::new(6);
651 let fitted = fa.fit(&x, &()).unwrap();
652 let reduced = fitted.transform(&x).unwrap();
653 // No reduction; each feature is its own cluster.
654 assert_eq!(reduced.ncols(), 6);
655 }
656
657 #[test]
658 fn test_feature_agglom_zero_clusters_error() {
659 let x = make_correlated_features();
660 let fa = FeatureAgglomeration::<f64>::new(0);
661 assert!(fa.fit(&x, &()).is_err());
662 }
663
664 #[test]
665 fn test_feature_agglom_too_many_clusters_error() {
666 let x = make_correlated_features();
667 let fa = FeatureAgglomeration::<f64>::new(10); // only 6 features
668 assert!(fa.fit(&x, &()).is_err());
669 }
670
671 #[test]
672 fn test_feature_agglom_empty_data_error() {
673 let x = Array2::<f64>::zeros((0, 4));
674 let fa = FeatureAgglomeration::<f64>::new(2);
675 assert!(fa.fit(&x, &()).is_err());
676 }
677
678 #[test]
679 fn test_feature_agglom_transform_shape_mismatch() {
680 let x = make_correlated_features();
681 let fa = FeatureAgglomeration::<f64>::new(3);
682 let fitted = fa.fit(&x, &()).unwrap();
683 let x_bad =
684 Array2::from_shape_vec((2, 4), vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]).unwrap();
685 assert!(fitted.transform(&x_bad).is_err());
686 }
687
688 #[test]
689 fn test_feature_agglom_f32() {
690 let x = Array2::<f32>::from_shape_vec(
691 (4, 4),
692 vec![
693 1.0, 1.1, 5.0, 5.1, 2.0, 2.1, 6.0, 6.1, 3.0, 3.1, 7.0, 7.1, 4.0, 4.1, 8.0, 8.1,
694 ],
695 )
696 .unwrap();
697 let fa = FeatureAgglomeration::<f32>::new(2);
698 let fitted = fa.fit(&x, &()).unwrap();
699 let reduced = fitted.transform(&x).unwrap();
700 assert_eq!(reduced.ncols(), 2);
701 }
702
703 #[test]
704 fn test_feature_agglom_getters() {
705 let fa = FeatureAgglomeration::<f64>::new(3)
706 .with_linkage(AgglomerativeLinkage::Complete)
707 .with_pooling_func(PoolingFunc::Max);
708 assert_eq!(fa.n_clusters(), 3);
709 assert_eq!(fa.linkage(), AgglomerativeLinkage::Complete);
710 assert_eq!(fa.pooling_func(), PoolingFunc::Max);
711 }
712
713 #[test]
714 fn test_feature_agglom_n_features_getter() {
715 let x = make_correlated_features();
716 let fa = FeatureAgglomeration::<f64>::new(3);
717 let fitted = fa.fit(&x, &()).unwrap();
718 assert_eq!(fitted.n_features(), 6);
719 assert_eq!(fitted.n_clusters(), 3);
720 }
721}