ferrolearn_cluster/
feature_agglomeration.rs

1//! Feature Agglomeration — hierarchical clustering of features.
2//!
3//! [`FeatureAgglomeration`] transposes the data and applies agglomerative
4//! clustering to the *features* (columns) rather than the samples (rows).
5//! After fitting, features within each cluster are pooled (by mean or max)
6//! to produce a reduced-dimensionality representation.
7//!
8//! This is useful for supervised dimensionality reduction: correlated
9//! features are grouped, and the group summary (e.g. mean) replaces
10//! the original features.
11//!
12//! # Algorithm
13//!
14//! 1. **Fit**: transpose `X`, run agglomerative clustering on the columns
15//!    (treated as data points) to obtain `n_clusters` feature groups.
16//! 2. **Transform**: for each feature group, apply the pooling function
17//!    (mean or max) across the grouped columns.  The output has
18//!    `n_clusters` columns.
19//!
20//! # Examples
21//!
22//! ```
23//! use ferrolearn_cluster::{FeatureAgglomeration, PoolingFunc};
24//! use ferrolearn_core::traits::{Fit, Transform};
25//! use ndarray::Array2;
26//!
27//! let x = Array2::from_shape_vec((4, 6), vec![
28//!     1.0, 1.1, 5.0, 5.1, 9.0, 9.1,
29//!     2.0, 2.1, 6.0, 6.1, 8.0, 8.1,
30//!     3.0, 3.1, 7.0, 7.1, 7.0, 7.1,
31//!     4.0, 4.1, 8.0, 8.1, 6.0, 6.1,
32//! ]).unwrap();
33//!
34//! let fa = FeatureAgglomeration::<f64>::new(3);
35//! let fitted = fa.fit(&x, &()).unwrap();
36//! let reduced = fitted.transform(&x).unwrap();
37//! assert_eq!(reduced.ncols(), 3);
38//! ```
39//!
40//! # `## REQ status`
41//!
42//! Binary classification (R-DEFER-2): two states only — SHIPPED needs impl + a
43//! non-test production consumer + green verification + symbol-anchor + sklearn
44//! `file:line`; NOT-STARTED carries the open prereq blocker. **`FeatureAgglomeration`
45//! has NO PyO3 binding** — `grep -rln FeatureAgglomeration ferrolearn-python/` is
46//! EMPTY (no `_RsFeatureAgglomeration`, no `ferrolearn.FeatureAgglomeration`). The
47//! non-test production consumer is therefore the crate re-export at the crate root
48//! (`pub use feature_agglomeration::{AgglomerativeLinkage, FeatureAgglomeration,
49//! FittedFeatureAgglomeration, PoolingFunc}` in `ferrolearn-cluster/src/lib.rs`),
50//! exposing `fit` / `transform` / `labels()` (+ `feature_labels()` /
51//! `children()` / `distances()` / `n_leaves()` / `n_connected_components()`).
52//! **End-to-end VALUE parity is now achieved (post-#938):** the AgglomerativeClustering
53//! unit ships bit-exact `_hc_cut` `labels_` (commit 3e001cf4b), and
54//! `FeatureAgglomeration::fit` delegates to `AgglomerativeClustering::fit(X.T)`
55//! (`_agglomerative.py:1339`), so the now-correct `labels_` flow through unchanged
56//! (NO fit-side relabel: the inner `labels_` are stored verbatim into
57//! `feature_labels_`). `feature_labels_`/`labels()` is integer-EXACT vs sklearn
58//! `FeatureAgglomeration(...).fit(X).labels_` for all four linkages, k∈{2,3}; and
59//! because `fn transform` groups by ASCENDING label index (it accumulates into
60//! `result[[i, label]]`, column j = pool over features with `label == j`), the
61//! `transform` output is now VALUE-EXACT and COLUMN-ORDERED vs sklearn (mean +
62//! max), `_feature_agglomeration.py:51-64`. Fitted attrs `children_`/`distances_`/
63//! `n_leaves_`/`n_connected_components_` are delegated from the inner fit over `X.T`.
64//! Green verification = the in-tree `feature_agglom` lib tests plus the live-sklearn
65//! pins/guards: `divergence_feature_agglom_min_features_two` (#944) +
66//! `green_feature_agglom_*` (`tests/divergence_feature_agglomeration.rs`) and the new
67//! VALUE tests (`tests/divergence_feature_agglomeration_value.rs`):
68//! `value_labels_exact_all_linkages_k2_k3`, `value_transform_mean_column_ordered`,
69//! `value_transform_max_column_ordered`, `value_fitted_attrs_delegated`. Cites use
70//! symbol anchors (ferrolearn) / `file:line` (sklearn 1.5.2, commit 156ef14). Live
71//! oracle = installed sklearn 1.5.2. (REQ numbering follows
72//! `.design/cluster/feature_agglomeration.md`.)
73//!
74//! | REQ | Status | Evidence |
75//! |---|---|---|
76//! | REQ-4 (validation guards: `ensure_min_features=2`; `n_clusters >= 1`; `n_features >= n_clusters`; non-empty `X` — SHAPE/validation contract only) | SHIPPED (validation + shape + pooling-as-set contract only) | `fn fit` for `FeatureAgglomeration` rejects a 1-feature `X` with `FerroError::InvalidParameter { name: "X", reason: "Found array with {n} feature(s) while a minimum of 2 is required by FeatureAgglomeration" }`, mirroring `self._validate_data(X, ensure_min_features=2)` (`_agglomerative.py:1338`); rejects `n_clusters == 0` per `Interval(Integral, 1, None, closed="left")` (`:1281`); rejects `n_features < n_clusters` ("Cannot extract more clusters than samples", from `_hc_cut`); rejects empty `X` with `FerroError::InsufficientSamples`. The `Transform` impl (`fn transform`) returns output of SHAPE `(n_samples, n_clusters)`. Non-test consumer: crate re-export of `fit` / `transform` / `labels()` (`lib.rs`). Verified: now-PASSING pin `divergence_feature_agglom_min_features_two` (#944): `new(1).fit(X_(4,1))` returns `Err` (sklearn `FeatureAgglomeration(n_clusters=1).fit(X_1col)` raises `ValueError`); green guards `green_feature_agglom_transform_shape`, `green_feature_agglom_mean_pooling_as_set`, `green_feature_agglom_n_clusters_zero_rejected`, `green_feature_agglom_too_many_clusters_rejected`. GAP (NOT-STARTED): the sklearn `ValueError`/`InvalidParameterError` error ABI is NOT matched — ferrolearn uses `FerroError`. |
77//! | REQ-1 (`transform` mean-pooling VALUE) | SHIPPED | `fn transform` (`PoolingFunc::Mean`) computes per-cluster mean (`sum/count`) into `result[[i, label]]`, column j ordered by ASCENDING label index, mirroring the `bincount` fast path of `AgglomerationTransform.transform` (`_feature_agglomeration.py:51-57`: `nX[i] = np.bincount(labels_, X[i,:]) / size`, column = label index 0..n_clusters-1). With the now-correct delegated `labels_` (#938 shipped, commit 3e001cf4b), the output is VALUE-EXACT and COLUMN-ORDERED vs sklearn. Non-test consumer: crate re-export (`lib.rs`); also consumed by `api_proof.rs`/`conformance_wave4.rs`. Verified: `value_transform_mean_column_ordered` — ferrolearn `transform(X)` == sklearn `FeatureAgglomeration(n_clusters=3, pooling_func=np.mean).fit(X).transform(X)` (full matrix, ~1e-9) for ward+single; sklearn row0 = `[1.05, 9.05, 5.05]`, ferrolearn row0 = `[1.05, 9.05, 5.05]`. |
78//! | REQ-2 (`transform` max-pooling VALUE) | SHIPPED | `fn transform` (`PoolingFunc::Max`) takes per-cluster max into `result[[i, label]]`, ascending label-index column order, mirroring the general pooling path (`_feature_agglomeration.py:58-63`: `[pooling_func(X[:, labels_==l], axis=1) for l in np.unique(labels_)]`, columns by sorted unique label = 0..n_clusters-1). VALUE-EXACT and COLUMN-ORDERED post-#938. Non-test consumer: crate re-export (`lib.rs`). Verified: `value_transform_max_column_ordered` — full-matrix equality vs sklearn `pooling_func=np.max` (ward+complete); sklearn row0 = `[1.1, 9.1, 5.1]`, ferrolearn row0 = `[1.1, 9.1, 5.1]`. |
79//! | REQ-3 (`labels_` feature-cluster VALUE parity) | SHIPPED | `fn fit` delegates to `AgglomerativeClustering::new(n_clusters).with_linkage(...)` on `X.T` and stores `fitted_agg.labels_` verbatim into `feature_labels_` (NO post-relabel), mirroring `super()._fit(X.T)` (`_agglomerative.py:1339`). The AgglomerativeClustering unit now ships bit-exact `_hc_cut` `labels_` (`agglomerative.rs fn hc_cut`, `:731-775`; `np.searchsorted` numbering `:1105`; commit 3e001cf4b), so `feature_labels_`/`labels()` is integer-EXACT: sklearn `[0,0,2,2,1,1]` (k=3) / `[1,1,0,0,0,0]` (k=2) == ferrolearn. Accessor `fn labels` (sklearn name) returns the same data as `fn feature_labels`. Non-test consumer: crate re-export (`lib.rs`). Verified: `value_labels_exact_all_linkages_k2_k3` — integer-exact for {ward,complete,average,single} × {2,3}. |
80//! | REQ-5 (linkage variants ward/complete/average/single) | SHIPPED | `fn map_linkage` maps all four `AgglomerativeLinkage` variants to `Linkage`, delegated to `AgglomerativeClustering`, mirroring `_TREE_BUILDERS` (`_agglomerative.py:720-725`/`:1290`). VALUE parity (labels_ + transform) holds across all four post-#938. Non-test consumer: crate re-export (`lib.rs`). Verified: `value_labels_exact_all_linkages_k2_k3` (all four linkages) + `value_transform_mean_column_ordered` (ward+single) + `value_transform_max_column_ordered` (ward+complete). |
81//! | REQ-6 (`n_clusters=2` default + missing params metric/memory/connectivity/compute_full_tree/distance_threshold/compute_distances) | NOT-STARTED | open prereq blocker **#941**. sklearn `__init__` (`_agglomerative.py:1296-1319`) takes 9 params with `n_clusters=2` default. `FeatureAgglomeration<F>` REQUIRES `n_clusters` (`fn new`, no default) and has only `linkage`/`pooling_func`. `distance_threshold` (cut by distance with `n_clusters=None`, `:1281`/`:1091-1092`) is materially absent. |
82//! | REQ-7 (`pooling_func` as arbitrary callable) | NOT-STARTED | open prereq blocker **#941**. sklearn `_parameter_constraints["pooling_func"] = [callable]` (`_agglomerative.py:1291`) accepts any callable (default `np.mean`, `:1305`). ferrolearn offers only the closed `PoolingFunc::{Mean, Max}` enum (`fn with_pooling_func`). |
83//! | REQ-8 (`inverse_transform`) | SHIPPED | `FittedFeatureAgglomeration::inverse_transform` broadcasts each cluster's pooled value back to its member features: `result[i, f] = xred[i, labels_[f]]`, mirroring sklearn `AgglomerationTransform.inverse_transform` (`_feature_agglomeration.py:66-92`: `unil, inverse = np.unique(labels_, return_inverse=True); return X[..., inverse]` — `inverse == labels_` for the contiguous `_hc_cut` numbering). Output shape `(n_samples, n_features)`; accepts `xred.ncols() >= n_clusters` (ignoring trailing extra columns, matching sklearn's numpy fancy-index `X[..., inverse]`, #2187) and rejects `xred.ncols() < n_clusters` with `FerroError::ShapeMismatch` (sklearn raises `IndexError`). Non-test consumer: crate re-export (`lib.rs`). Verified: `value_inverse_transform_roundtrip_and_broadcast` (full-matrix ~1e-12, all four linkages) + `divergence_inverse_transform_extra_cols_ignored` (#2187: `(2,4)` xred with n_clusters=3 ignores the 4th column, matching sklearn). |
84//! | REQ-9 (fitted attrs labels_/n_leaves_/n_connected_components_/children_/distances_) | SHIPPED | `fn fit` stores the inner `AgglomerativeClustering::fit(X.T)` attributes into `FittedFeatureAgglomeration` (`children_`, `distances_`, `n_leaves_`, `n_connected_components_`), delegating exactly as sklearn `FeatureAgglomeration._fit` → `AgglomerativeClustering._fit(X.T)` (`_agglomerative.py:1339`, sets `labels_`/`n_clusters_`/`n_leaves_`/`n_connected_components_`/`children_`/`distances_` at `:1083-1095`). Accessors: `fn labels` (sklearn name, alias of `feature_labels`), `fn children`, `fn distances` (`Option`, `Some` iff `with_compute_distances(true)` — mirrors sklearn `compute_distances`, `:1319`), `fn n_leaves` (= `n_features`), `fn n_connected_components` (= 1 for the unstructured path). `with_compute_distances` passthrough sets the inner `compute_distances`. Non-test consumer: crate re-export (`lib.rs`). Verified: `value_fitted_attrs_delegated` — `children_` == sklearn `FeatureAgglomeration(compute_distances=True).fit(X).children_`; `distances_` == sklearn `distances_` (~1e-9); `n_leaves_ == 6 == n_features`; `n_connected_components_ == 1`. |
85//! | REQ-10 (PyO3 binding) | NOT-STARTED | open prereq blocker **#943**. `grep -rln FeatureAgglomeration ferrolearn-python/` is EMPTY — no `_RsFeatureAgglomeration`, so `import ferrolearn` cannot reach `FeatureAgglomeration`. The only non-test consumer of `fit` / `transform` / `feature_labels()` is the crate re-export (`lib.rs`). |
86//! | REQ-11 (ferray substrate) | NOT-STARTED | open prereq blocker **#943**. `feature_agglomeration.rs` imports `ndarray::{Array1, Array2}` + `num_traits::Float`; the delegated `agglomerative.rs` is likewise on `ndarray`. Not migrated to `ferray-core` / `ferray::linalg` (R-SUBSTRATE-1/2). |
87
88use ferrolearn_core::error::FerroError;
89use ferrolearn_core::traits::{Fit, Transform};
90use ndarray::{Array1, Array2};
91use num_traits::Float;
92
93use crate::agglomerative::{AgglomerativeClustering, Linkage};
94
95// ─────────────────────────────────────────────────────────────────────────────
96// Public enums
97// ─────────────────────────────────────────────────────────────────────────────
98
99/// The linkage criterion used by [`FeatureAgglomeration`].
100///
101/// Re-uses the same linkage strategies as [`AgglomerativeClustering`].
102#[derive(Debug, Clone, Copy, PartialEq, Eq)]
103pub enum AgglomerativeLinkage {
104    /// Ward linkage (minimises within-cluster variance).
105    Ward,
106    /// Complete linkage (max pairwise distance).
107    Complete,
108    /// Average linkage (mean pairwise distance).
109    Average,
110    /// Single linkage (min pairwise distance).
111    Single,
112}
113
114/// The pooling function applied to grouped features during transformation.
115#[derive(Debug, Clone, Copy, PartialEq, Eq)]
116pub enum PoolingFunc {
117    /// Take the mean of features in each cluster.
118    Mean,
119    /// Take the maximum of features in each cluster.
120    Max,
121}
122
123// ─────────────────────────────────────────────────────────────────────────────
124// FeatureAgglomeration (unfitted)
125// ─────────────────────────────────────────────────────────────────────────────
126
127/// Feature Agglomeration configuration (unfitted).
128///
129/// Call [`Fit::fit`] to cluster the features and obtain a
130/// [`FittedFeatureAgglomeration`].
131///
132/// # Type Parameters
133///
134/// - `F`: floating-point scalar type (`f32` or `f64`).
135#[derive(Debug, Clone)]
136pub struct FeatureAgglomeration<F> {
137    /// Target number of feature clusters.
138    n_clusters: usize,
139    /// Linkage strategy for the agglomerative clustering of features.
140    linkage: AgglomerativeLinkage,
141    /// Pooling function applied during transformation.
142    pooling_func: PoolingFunc,
143    /// Whether to compute and store the per-merge `distances_` of the inner
144    /// dendrogram, mirroring sklearn `compute_distances` (default `false`,
145    /// `_agglomerative.py:1319`).
146    compute_distances: bool,
147    _marker: std::marker::PhantomData<F>,
148}
149
150impl<F: Float + Send + Sync + 'static> FeatureAgglomeration<F> {
151    /// Create a new `FeatureAgglomeration` that reduces to `n_clusters` features.
152    ///
153    /// Defaults: `linkage = Ward`, `pooling_func = Mean`, `compute_distances = false`.
154    #[must_use]
155    pub fn new(n_clusters: usize) -> Self {
156        Self {
157            n_clusters,
158            linkage: AgglomerativeLinkage::Ward,
159            pooling_func: PoolingFunc::Mean,
160            compute_distances: false,
161            _marker: std::marker::PhantomData,
162        }
163    }
164
165    /// Set the linkage criterion.
166    #[must_use]
167    pub fn with_linkage(mut self, linkage: AgglomerativeLinkage) -> Self {
168        self.linkage = linkage;
169        self
170    }
171
172    /// Set the pooling function.
173    #[must_use]
174    pub fn with_pooling_func(mut self, pooling: PoolingFunc) -> Self {
175        self.pooling_func = pooling;
176        self
177    }
178
179    /// Compute and store the per-merge linkage `distances_` of the inner
180    /// dendrogram (over the transposed feature matrix), mirroring sklearn
181    /// `FeatureAgglomeration(compute_distances=True)` (`_agglomerative.py:1319`).
182    ///
183    /// When set, [`FittedFeatureAgglomeration::distances`] returns `Some`.
184    #[must_use]
185    pub fn with_compute_distances(mut self, compute_distances: bool) -> Self {
186        self.compute_distances = compute_distances;
187        self
188    }
189
190    /// Return the configured number of feature clusters.
191    #[must_use]
192    pub fn n_clusters(&self) -> usize {
193        self.n_clusters
194    }
195
196    /// Return the configured linkage.
197    #[must_use]
198    pub fn linkage(&self) -> AgglomerativeLinkage {
199        self.linkage
200    }
201
202    /// Return the configured pooling function.
203    #[must_use]
204    pub fn pooling_func(&self) -> PoolingFunc {
205        self.pooling_func
206    }
207}
208
209// ─────────────────────────────────────────────────────────────────────────────
210// FittedFeatureAgglomeration
211// ─────────────────────────────────────────────────────────────────────────────
212
213/// Fitted Feature Agglomeration model.
214///
215/// Stores the cluster label for each original feature and the pooling
216/// strategy.  Implements [`Transform`] to reduce the dimensionality of
217/// new data.
218#[derive(Debug, Clone)]
219pub struct FittedFeatureAgglomeration<F> {
220    /// Cluster label for each original feature, length `n_features`.
221    ///
222    /// This is sklearn's `labels_` attribute: it is exactly the inner
223    /// [`AgglomerativeClustering`]'s `labels_` over the transposed feature
224    /// matrix `X.T` (`_agglomerative.py:1339`, `FeatureAgglomeration._fit`
225    /// → `AgglomerativeClustering._fit(X.T)`).
226    feature_labels_: Array1<usize>,
227    /// Number of feature clusters.
228    n_clusters_: usize,
229    /// Number of original features.
230    n_features_: usize,
231    /// Pooling function to aggregate features within each cluster.
232    pooling_func_: PoolingFunc,
233    /// The merge history (dendrogram) of the inner clustering over `X.T`,
234    /// length `n_features - 1`. Mirrors sklearn `children_`
235    /// (`_agglomerative.py:1083-1095`, delegated from `AgglomerativeClustering`).
236    children_: Vec<(usize, usize)>,
237    /// The per-merge linkage distances of the inner clustering over `X.T`,
238    /// in `children_` row order. `Some` when `compute_distances` was set.
239    /// Mirrors sklearn `distances_`.
240    distances_: Option<Array1<F>>,
241    /// Number of leaves in the inner hierarchical tree (`== n_features` for the
242    /// unstructured path). Mirrors sklearn `n_leaves_`.
243    n_leaves_: usize,
244    /// Estimated number of connected components in the inner graph (always `1`
245    /// for the unstructured `connectivity=None` path). Mirrors sklearn
246    /// `n_connected_components_`.
247    n_connected_components_: usize,
248    _marker: std::marker::PhantomData<F>,
249}
250
251impl<F: Float + Send + Sync + 'static> FittedFeatureAgglomeration<F> {
252    /// Return the cluster label assigned to each feature.
253    ///
254    /// This is the sklearn name for the per-feature cluster assignment
255    /// (`labels_`). Identical data to [`feature_labels`](Self::feature_labels).
256    #[must_use]
257    pub fn labels(&self) -> &Array1<usize> {
258        &self.feature_labels_
259    }
260
261    /// Return the cluster label assigned to each feature.
262    ///
263    /// Alias of [`labels`](Self::labels) retained for backward compatibility;
264    /// sklearn names this attribute `labels_`.
265    #[must_use]
266    pub fn feature_labels(&self) -> &Array1<usize> {
267        &self.feature_labels_
268    }
269
270    /// Return the number of feature clusters.
271    #[must_use]
272    pub fn n_clusters(&self) -> usize {
273        self.n_clusters_
274    }
275
276    /// Return the number of original features.
277    #[must_use]
278    pub fn n_features(&self) -> usize {
279        self.n_features_
280    }
281
282    /// Return the merge history (dendrogram) of the inner clustering over `X.T`.
283    ///
284    /// Mirrors sklearn's `children_`, length `n_features - 1`. Delegated from
285    /// the inner [`AgglomerativeClustering`] fit (`_agglomerative.py:1339`).
286    #[must_use]
287    pub fn children(&self) -> &[(usize, usize)] {
288        &self.children_
289    }
290
291    /// Return the per-merge linkage distances of the inner dendrogram (in
292    /// `children_` row order), or `None` if `compute_distances` was not set.
293    ///
294    /// Mirrors sklearn's `distances_` attribute.
295    #[must_use]
296    pub fn distances(&self) -> Option<&Array1<F>> {
297        self.distances_.as_ref()
298    }
299
300    /// Return the number of leaves in the inner hierarchical tree
301    /// (`== n_features` for the unstructured path). Mirrors sklearn `n_leaves_`.
302    #[must_use]
303    pub fn n_leaves(&self) -> usize {
304        self.n_leaves_
305    }
306
307    /// Return the estimated number of connected components in the inner graph
308    /// (always `1` for the unstructured `connectivity=None` path). Mirrors
309    /// sklearn `n_connected_components_`.
310    #[must_use]
311    pub fn n_connected_components(&self) -> usize {
312        self.n_connected_components_
313    }
314}
315
316// ─────────────────────────────────────────────────────────────────────────────
317// Map between linkage types
318// ─────────────────────────────────────────────────────────────────────────────
319
320/// Convert [`AgglomerativeLinkage`] to the internal [`Linkage`] enum.
321fn map_linkage(l: AgglomerativeLinkage) -> Linkage {
322    match l {
323        AgglomerativeLinkage::Ward => Linkage::Ward,
324        AgglomerativeLinkage::Complete => Linkage::Complete,
325        AgglomerativeLinkage::Average => Linkage::Average,
326        AgglomerativeLinkage::Single => Linkage::Single,
327    }
328}
329
330// ─────────────────────────────────────────────────────────────────────────────
331// Trait impls
332// ─────────────────────────────────────────────────────────────────────────────
333
334impl<F: Float + Send + Sync + 'static> Fit<Array2<F>, ()> for FeatureAgglomeration<F> {
335    type Fitted = FittedFeatureAgglomeration<F>;
336    type Error = FerroError;
337
338    /// Fit feature agglomeration by clustering the features (columns).
339    ///
340    /// Transposes `x` so each column becomes a row (data point) and
341    /// runs agglomerative clustering.
342    ///
343    /// # Errors
344    ///
345    /// - [`FerroError::InvalidParameter`] if `n_clusters == 0`.
346    /// - [`FerroError::InsufficientSamples`] if `n_features < n_clusters`.
347    fn fit(&self, x: &Array2<F>, _y: &()) -> Result<FittedFeatureAgglomeration<F>, FerroError> {
348        let n_features = x.ncols();
349
350        if self.n_clusters == 0 {
351            return Err(FerroError::InvalidParameter {
352                name: "n_clusters".into(),
353                reason: "must be at least 1".into(),
354            });
355        }
356        if n_features < 2 {
357            return Err(FerroError::InvalidParameter {
358                name: "X".into(),
359                reason: format!(
360                    "Found array with {n_features} feature(s) while a minimum of 2 is required by FeatureAgglomeration"
361                ),
362            });
363        }
364        if n_features < self.n_clusters {
365            return Err(FerroError::InvalidParameter {
366                name: "n_clusters".into(),
367                reason: format!(
368                    "n_clusters ({}) exceeds n_features ({})",
369                    self.n_clusters, n_features
370                ),
371            });
372        }
373        if x.nrows() == 0 {
374            return Err(FerroError::InsufficientSamples {
375                required: 1,
376                actual: 0,
377                context: "FeatureAgglomeration::fit requires at least 1 sample".into(),
378            });
379        }
380
381        // Transpose: each feature (column) becomes a row.
382        // Use as_standard_layout() to ensure row-major (C) order, which
383        // is required for AgglomerativeClustering's internal as_slice().
384        let x_t = x.t().as_standard_layout().into_owned();
385
386        // Run agglomerative clustering on the transposed data.
387        //
388        // sklearn `FeatureAgglomeration._fit` delegates verbatim to
389        // `AgglomerativeClustering._fit(X.T)` (`_agglomerative.py:1339`), so
390        // `labels_`/`children_`/`distances_`/`n_leaves_`/`n_connected_components_`
391        // are exactly the inner estimator's attributes over `X.T`.
392        let agg = AgglomerativeClustering::<F>::new(self.n_clusters)
393            .with_linkage(map_linkage(self.linkage))
394            .with_compute_distances(self.compute_distances);
395        let fitted_agg = agg.fit(&x_t, &())?;
396
397        Ok(FittedFeatureAgglomeration {
398            feature_labels_: fitted_agg.labels_,
399            n_clusters_: self.n_clusters,
400            n_features_: n_features,
401            pooling_func_: self.pooling_func,
402            children_: fitted_agg.children_,
403            distances_: fitted_agg.distances_,
404            n_leaves_: fitted_agg.n_leaves_,
405            n_connected_components_: fitted_agg.n_connected_components_,
406            _marker: std::marker::PhantomData,
407        })
408    }
409}
410
411impl<F: Float + Send + Sync + 'static> Transform<Array2<F>> for FittedFeatureAgglomeration<F> {
412    type Output = Array2<F>;
413    type Error = FerroError;
414
415    /// Transform data by pooling features within each cluster.
416    ///
417    /// The output has shape `(n_samples, n_clusters)`.
418    ///
419    /// # Errors
420    ///
421    /// Returns [`FerroError::ShapeMismatch`] if the number of columns does not
422    /// match the number of features seen during fitting.
423    fn transform(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
424        if x.ncols() != self.n_features_ {
425            return Err(FerroError::ShapeMismatch {
426                expected: vec![x.nrows(), self.n_features_],
427                actual: vec![x.nrows(), x.ncols()],
428                context: "FittedFeatureAgglomeration::transform".into(),
429            });
430        }
431
432        let n_samples = x.nrows();
433        let mut result = Array2::<F>::zeros((n_samples, self.n_clusters_));
434
435        match self.pooling_func_ {
436            PoolingFunc::Mean => {
437                // Count features per cluster.
438                let mut counts = vec![0usize; self.n_clusters_];
439                for &label in &self.feature_labels_ {
440                    counts[label] += 1;
441                }
442
443                // Sum features per cluster.
444                for i in 0..n_samples {
445                    for (j, &label) in self.feature_labels_.iter().enumerate() {
446                        result[[i, label]] = result[[i, label]] + x[[i, j]];
447                    }
448                }
449
450                // Divide by count to get mean.
451                for i in 0..n_samples {
452                    for c in 0..self.n_clusters_ {
453                        if counts[c] > 0 {
454                            result[[i, c]] = result[[i, c]] / F::from(counts[c]).unwrap();
455                        }
456                    }
457                }
458            }
459            PoolingFunc::Max => {
460                // Initialize with negative infinity.
461                result.fill(F::neg_infinity());
462
463                for i in 0..n_samples {
464                    for (j, &label) in self.feature_labels_.iter().enumerate() {
465                        if x[[i, j]] > result[[i, label]] {
466                            result[[i, label]] = x[[i, j]];
467                        }
468                    }
469                }
470            }
471        }
472
473        Ok(result)
474    }
475}
476
477impl<F: Float + Send + Sync + 'static> FittedFeatureAgglomeration<F> {
478    /// Inverse the pooling transformation: broadcast each cluster's pooled
479    /// value back to every feature in that cluster.
480    ///
481    /// Mirrors sklearn `AgglomerationTransform.inverse_transform`
482    /// (`_feature_agglomeration.py:66-92`): `unil, inverse =
483    /// np.unique(labels_, return_inverse=True); return X[..., inverse]`. Since
484    /// `labels_` is the contiguous `_hc_cut` numbering `0..n_clusters`,
485    /// `inverse == labels_`, so output column `f` is reduced column
486    /// `labels_[f]`: `result[i, f] = xred[i, labels_[f]]`.
487    ///
488    /// `xred` has shape `(n_samples, n_clusters)`; the result has shape
489    /// `(n_samples, n_features)`.
490    ///
491    /// # Errors
492    ///
493    /// Returns [`FerroError::ShapeMismatch`] if `xred.ncols() < n_clusters`.
494    /// Mirroring sklearn's numpy fancy-index `X[..., inverse]` (which reads only
495    /// columns `0..n_clusters` and ignores any trailing columns, #2187), a wider
496    /// `xred` is accepted and the extra columns are ignored; too few columns is
497    /// the only error (sklearn raises `IndexError`).
498    pub fn inverse_transform(&self, xred: &Array2<F>) -> Result<Array2<F>, FerroError> {
499        if xred.ncols() < self.n_clusters_ {
500            return Err(FerroError::ShapeMismatch {
501                expected: vec![xred.nrows(), self.n_clusters_],
502                actual: vec![xred.nrows(), xred.ncols()],
503                context: "FittedFeatureAgglomeration::inverse_transform".into(),
504            });
505        }
506
507        let n_samples = xred.nrows();
508        let mut result = Array2::<F>::zeros((n_samples, self.n_features_));
509        for i in 0..n_samples {
510            for (f, &label) in self.feature_labels_.iter().enumerate() {
511                result[[i, f]] = xred[[i, label]];
512            }
513        }
514        Ok(result)
515    }
516}
517
518// ─────────────────────────────────────────────────────────────────────────────
519// Tests
520// ─────────────────────────────────────────────────────────────────────────────
521
522#[cfg(test)]
523mod tests {
524    use super::*;
525    use approx::assert_abs_diff_eq;
526
527    fn make_correlated_features() -> Array2<f64> {
528        // 6 features: (0,1) correlated, (2,3) correlated, (4,5) correlated.
529        Array2::from_shape_vec(
530            (5, 6),
531            vec![
532                1.0, 1.1, 5.0, 5.1, 9.0, 9.1, 2.0, 2.1, 6.0, 6.1, 8.0, 8.1, 3.0, 3.1, 7.0, 7.1,
533                7.0, 7.1, 4.0, 4.1, 8.0, 8.1, 6.0, 6.1, 5.0, 5.1, 9.0, 9.1, 5.0, 5.1,
534            ],
535        )
536        .unwrap()
537    }
538
539    #[test]
540    fn test_feature_agglom_basic() {
541        let x = make_correlated_features();
542        let fa = FeatureAgglomeration::<f64>::new(3);
543        let fitted = fa.fit(&x, &()).unwrap();
544        let reduced = fitted.transform(&x).unwrap();
545        assert_eq!(reduced.dim(), (5, 3));
546    }
547
548    #[test]
549    fn test_feature_agglom_output_shape() {
550        let x = make_correlated_features();
551        let fa = FeatureAgglomeration::<f64>::new(2);
552        let fitted = fa.fit(&x, &()).unwrap();
553        let reduced = fitted.transform(&x).unwrap();
554        assert_eq!(reduced.ncols(), 2);
555        assert_eq!(reduced.nrows(), 5);
556    }
557
558    #[test]
559    fn test_feature_agglom_labels_valid_range() {
560        let x = make_correlated_features();
561        let fa = FeatureAgglomeration::<f64>::new(3);
562        let fitted = fa.fit(&x, &()).unwrap();
563        for &l in fitted.feature_labels() {
564            assert!(l < 3, "label {l} out of range");
565        }
566    }
567
568    #[test]
569    fn test_feature_agglom_correlated_grouped() {
570        // With 4 features merging to 2 clusters, pairs of nearly-identical
571        // features should end up together.  Use Single linkage to guarantee
572        // nearest-neighbor pairing.
573        let x = Array2::from_shape_vec(
574            (5, 4),
575            vec![
576                // feat 0  feat 1  feat 2   feat 3
577                1.0, 1.001, 100.0, 100.001, 2.0, 2.001, 90.0, 90.001, 3.0, 3.001, 80.0, 80.001, 4.0,
578                4.001, 70.0, 70.001, 5.0, 5.001, 60.0, 60.001,
579            ],
580        )
581        .unwrap();
582        let fa = FeatureAgglomeration::<f64>::new(2).with_linkage(AgglomerativeLinkage::Single);
583        let fitted = fa.fit(&x, &()).unwrap();
584        let labels = fitted.feature_labels();
585        // Features 0 and 1 are nearly identical, should be in the same cluster.
586        assert_eq!(labels[0], labels[1]);
587        // Features 2 and 3 are nearly identical.
588        assert_eq!(labels[2], labels[3]);
589        // The two pairs should be in different clusters.
590        assert_ne!(labels[0], labels[2]);
591    }
592
593    #[test]
594    fn test_feature_agglom_mean_pooling() {
595        // Simple case: two features that should be grouped.
596        let x = Array2::from_shape_vec((3, 2), vec![2.0, 4.0, 6.0, 8.0, 10.0, 12.0]).unwrap();
597        let fa = FeatureAgglomeration::<f64>::new(1);
598        let fitted = fa.fit(&x, &()).unwrap();
599        let reduced = fitted.transform(&x).unwrap();
600        assert_eq!(reduced.ncols(), 1);
601        // Mean of (2, 4) = 3, (6, 8) = 7, (10, 12) = 11.
602        assert_abs_diff_eq!(reduced[[0, 0]], 3.0, epsilon = 1e-10);
603        assert_abs_diff_eq!(reduced[[1, 0]], 7.0, epsilon = 1e-10);
604        assert_abs_diff_eq!(reduced[[2, 0]], 11.0, epsilon = 1e-10);
605    }
606
607    #[test]
608    fn test_feature_agglom_max_pooling() {
609        let x = Array2::from_shape_vec((3, 2), vec![2.0, 4.0, 6.0, 8.0, 10.0, 12.0]).unwrap();
610        let fa = FeatureAgglomeration::<f64>::new(1).with_pooling_func(PoolingFunc::Max);
611        let fitted = fa.fit(&x, &()).unwrap();
612        let reduced = fitted.transform(&x).unwrap();
613        assert_eq!(reduced.ncols(), 1);
614        // Max of (2, 4) = 4, (6, 8) = 8, (10, 12) = 12.
615        assert_abs_diff_eq!(reduced[[0, 0]], 4.0, epsilon = 1e-10);
616        assert_abs_diff_eq!(reduced[[1, 0]], 8.0, epsilon = 1e-10);
617        assert_abs_diff_eq!(reduced[[2, 0]], 12.0, epsilon = 1e-10);
618    }
619
620    #[test]
621    fn test_feature_agglom_complete_linkage() {
622        let x = make_correlated_features();
623        let fa = FeatureAgglomeration::<f64>::new(3).with_linkage(AgglomerativeLinkage::Complete);
624        let fitted = fa.fit(&x, &()).unwrap();
625        let reduced = fitted.transform(&x).unwrap();
626        assert_eq!(reduced.ncols(), 3);
627    }
628
629    #[test]
630    fn test_feature_agglom_average_linkage() {
631        let x = make_correlated_features();
632        let fa = FeatureAgglomeration::<f64>::new(3).with_linkage(AgglomerativeLinkage::Average);
633        let fitted = fa.fit(&x, &()).unwrap();
634        let reduced = fitted.transform(&x).unwrap();
635        assert_eq!(reduced.ncols(), 3);
636    }
637
638    #[test]
639    fn test_feature_agglom_single_linkage() {
640        let x = make_correlated_features();
641        let fa = FeatureAgglomeration::<f64>::new(3).with_linkage(AgglomerativeLinkage::Single);
642        let fitted = fa.fit(&x, &()).unwrap();
643        let reduced = fitted.transform(&x).unwrap();
644        assert_eq!(reduced.ncols(), 3);
645    }
646
647    #[test]
648    fn test_feature_agglom_n_clusters_equals_n_features() {
649        let x = make_correlated_features();
650        let fa = FeatureAgglomeration::<f64>::new(6);
651        let fitted = fa.fit(&x, &()).unwrap();
652        let reduced = fitted.transform(&x).unwrap();
653        // No reduction; each feature is its own cluster.
654        assert_eq!(reduced.ncols(), 6);
655    }
656
657    #[test]
658    fn test_feature_agglom_zero_clusters_error() {
659        let x = make_correlated_features();
660        let fa = FeatureAgglomeration::<f64>::new(0);
661        assert!(fa.fit(&x, &()).is_err());
662    }
663
664    #[test]
665    fn test_feature_agglom_too_many_clusters_error() {
666        let x = make_correlated_features();
667        let fa = FeatureAgglomeration::<f64>::new(10); // only 6 features
668        assert!(fa.fit(&x, &()).is_err());
669    }
670
671    #[test]
672    fn test_feature_agglom_empty_data_error() {
673        let x = Array2::<f64>::zeros((0, 4));
674        let fa = FeatureAgglomeration::<f64>::new(2);
675        assert!(fa.fit(&x, &()).is_err());
676    }
677
678    #[test]
679    fn test_feature_agglom_transform_shape_mismatch() {
680        let x = make_correlated_features();
681        let fa = FeatureAgglomeration::<f64>::new(3);
682        let fitted = fa.fit(&x, &()).unwrap();
683        let x_bad =
684            Array2::from_shape_vec((2, 4), vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]).unwrap();
685        assert!(fitted.transform(&x_bad).is_err());
686    }
687
688    #[test]
689    fn test_feature_agglom_f32() {
690        let x = Array2::<f32>::from_shape_vec(
691            (4, 4),
692            vec![
693                1.0, 1.1, 5.0, 5.1, 2.0, 2.1, 6.0, 6.1, 3.0, 3.1, 7.0, 7.1, 4.0, 4.1, 8.0, 8.1,
694            ],
695        )
696        .unwrap();
697        let fa = FeatureAgglomeration::<f32>::new(2);
698        let fitted = fa.fit(&x, &()).unwrap();
699        let reduced = fitted.transform(&x).unwrap();
700        assert_eq!(reduced.ncols(), 2);
701    }
702
703    #[test]
704    fn test_feature_agglom_getters() {
705        let fa = FeatureAgglomeration::<f64>::new(3)
706            .with_linkage(AgglomerativeLinkage::Complete)
707            .with_pooling_func(PoolingFunc::Max);
708        assert_eq!(fa.n_clusters(), 3);
709        assert_eq!(fa.linkage(), AgglomerativeLinkage::Complete);
710        assert_eq!(fa.pooling_func(), PoolingFunc::Max);
711    }
712
713    #[test]
714    fn test_feature_agglom_n_features_getter() {
715        let x = make_correlated_features();
716        let fa = FeatureAgglomeration::<f64>::new(3);
717        let fitted = fa.fit(&x, &()).unwrap();
718        assert_eq!(fitted.n_features(), 6);
719        assert_eq!(fitted.n_clusters(), 3);
720    }
721}
ferrolearn_cluster/feature_agglomeration.rs

ferrolearn_cluster/
feature_agglomeration.rs