//! Incremental Principal Component Analysis (IncrementalPCA).
//!
//! [`IncrementalPCA`] performs PCA incrementally by processing data in batches.
//! This is useful for datasets that are too large to fit in memory, or when
//! data arrives as a stream.
//!
//! # Algorithm
//!
//! Mirrors scikit-learn's `IncrementalPCA.partial_fit`. Maintains a running
//! per-feature mean and variance (Youngs-Cramer / Chan update) and performs an
//! incremental SVD update. For each batch `X_batch` (with `n` samples already
//! seen):
//! 1. Update the running `(mean, var, n_samples_seen)` via the numerically
//! stable incremental mean-and-variance combination.
//! 2. If this is the first batch, centre by the column mean: `M = X_centred`.
//! Otherwise centre the batch by its **own batch mean** and stack three
//! blocks: `M = vstack([singular_values * components, X_batch_centred,
//! mean_correction])`, where `mean_correction = sqrt((n / n_total) * n_batch)
//! * (running_mean - batch_mean)`.
//! 3. Compute a thin SVD of `M` and apply `svd_flip` (per component row, the
//! max-abs entry is made positive) for a deterministic sign.
//! 4. Extract updated `components` (rows of `V^T`), `singular_values`,
//! `explained_variance = S^2 / (n_total - 1)`, and
//! `explained_variance_ratio = S^2 / sum(var * n_total)` (fraction of total
//! feature variance).
//!
//! The [`Fit::fit`] method processes the dataset in `batch_size` chunks
//! internally. Use [`IncrementalPCA::partial_fit`] to update the model with
//! one batch at a time (for streaming use cases).
//!
//! # Examples
//!
//! ```
//! use ferrolearn_decomp::IncrementalPCA;
//! use ferrolearn_core::traits::{Fit, Transform};
//! use ndarray::array;
//!
//! let ipca = IncrementalPCA::<f64>::new(1);
//! let x = array![[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0]];
//! let fitted = ipca.fit(&x, &()).unwrap();
//! let projected = fitted.transform(&x).unwrap();
//! assert_eq!(projected.ncols(), 1);
//! ```
//!
//! ## REQ status
//!
//! Design: `.design/decomp/incremental_pca.md`. Tracking: #1584. Each REQ is
//! BINARY — SHIPPED (impl + non-test consumer + tests + green verification) or
//! NOT-STARTED (concrete open blocker). Non-test consumers: crate re-export
//! (`lib.rs:88`) + the PyO3 `_RsIncrementalPCA` binding
//! (`ferrolearn-python/src/extras.rs:1094`, registered `lib.rs:72`). Oracle = live
//! sklearn 1.5.2 (`_incremental_pca.py`), run from `/tmp` (R-CHAR-3). `partial_fit`
//! is DETERMINISTIC and now ports sklearn faithfully — full value parity (single +
//! multi-batch).
//!
//! | REQ | Scope | Status | Evidence / Blocker |
//! |---|---|---|---|
//! | REQ-1 | `svd_flip` sign + `components_`/`transform` value parity | SHIPPED | per-Vt-row max-abs-positive after `thin_svd` (= `_incremental_pca.py:357`); matches sklearn incl. sign (single 2.2e-16, multi 6.3e-15). Was #1585, fixed. Test `divergence_svd_flip_sign` |
//! | REQ-2 | Multi-batch `mean_correction` + batch-mean centering (3-block SVD stack) | SHIPPED | `partial_fit_batch` ports `_incremental_pca.py:342-354`; multi-batch components/EV/SV match sklearn ≤2.8e-14. Was #1586, fixed. Test `divergence_multibatch_mean_correction` |
//! | REQ-3 | `explained_variance_ratio_` = `S²/Σ(var·n_total)` (total feature variance) + running `var_` | SHIPPED | `= _incremental_pca.py:359`; running variance via Youngs-Cramer port (`extmath.py:1057-1180`); ratio + `var_` match sklearn (var_ exact). Was #1587, fixed. Test `divergence_explained_variance_ratio_denominator` |
//! | REQ-4 | Running mean (incremental) | SHIPPED | matches sklearn `mean_` element-wise (1e-9, multi-batch); `test_mean_is_correct`, `test_batch_size_*` |
//! | REQ-5 | `components_` shape `(n_components, n_features)` | SHIPPED | `test_fit_output_shape`/`_two_components`; matches sklearn element-wise (REQ-1) |
//! | REQ-6 | `explained_variance_` = `S²/(n_total−1)` | SHIPPED | matches sklearn (single 1.8e-15); `test_explained_variance_positive` |
//! | REQ-7 | `singular_values_` | SHIPPED | matches sklearn (single 1.3e-15) |
//! | REQ-8 | `components_` rows unit-norm | SHIPPED | `test_components_approx_unit_length` |
//! | REQ-9 | Error/parameter contracts (incl. NON-FINITE rejection) | SHIPPED (scoped) | `fit`/`partial_fit`/`transform` guards. NON-FINITE: `partial_fit_batch` (the fit/partial_fit core) + `transform` call `reject_non_finite` (`incremental_pca.rs` symbol `reject_non_finite`) BEFORE the incremental-SVD/projection math, returning the CLEAN finiteness `InvalidParameter{name:"X", reason:"Input X contains NaN or infinity."}` = sklearn `_validate_data(force_all_finite=True)` (`_incremental_pca.py:227`,`:281`,`utils/validation.py:147-154`). `tests/divergence_nonfinite_spillover.rs::divergence_incremental_pca_fit_nan`/`_transform_nan` match the live sklearn 1.5.2 oracle (#2290). FLAG: ferrolearn rejects `n_components>=n_features`; sklearn allows `n_components==n_features ≤ min(n,p)` (REQ-14 #1590) |
//! | REQ-10 | `n_samples_seen` accumulation + `partial_fit` chaining + batch chunking | SHIPPED | `test_partial_fit_chaining`, `test_n_samples_seen`, `test_batch_size_*` |
//! | REQ-12 | f32/f64 generic | SHIPPED | `test_f32_support` |
//! | REQ-15 | running `var_` fitted attr + accessor | SHIPPED | `var_` field + `var()` accessor; matches sklearn `var_` exactly (was #1591, retired into REQ-3) |
//! | REQ-11 | `PipelineTransformer` integration | NOT-STARTED | no impl (cf. `pca.rs:565`) — blocker #1588 |
//! | REQ-13 | `batch_size` auto-default = `5*n_features` | NOT-STARTED | ferrolearn defaults to full-data — blocker #1589 |
//! | REQ-14 | `n_components=None` + `n_components==n_features` acceptance | NOT-STARTED | sklearn `_incremental_pca.py:294-308` — blocker #1590 |
//! | REQ-16 | `noise_variance_` attr | NOT-STARTED | sklearn `_incremental_pca.py:369-372` — blocker #1592 |
//! | REQ-17 | `whiten` + `copy` ctor params | NOT-STARTED | sklearn `_incremental_pca.py:194-198` — blocker #1593 |
//! | REQ-18 | `n_features_in_` attr | NOT-STARTED | blocker #1594 |
//! | REQ-19 | ferray substrate | NOT-STARTED | `ndarray` + hand-rolled Jacobi — blocker #1595 |
//!
//! Count: **12 SHIPPED (REQ-1..10,12,15) / 7 NOT-STARTED (REQ-11,13,14,16,17,18,19)**.
use ferrolearn_core::error::FerroError;
use ferrolearn_core::traits::{Fit, Transform};
use ndarray::{Array1, Array2};
use num_traits::Float;
use std::any::TypeId;
/// Reject non-finite input the way sklearn's `_validate_data` does.
///
/// sklearn runs `check_array` with the default `force_all_finite=True` at the
/// top of `IncrementalPCA.fit`/`partial_fit`/`transform`
/// (`sklearn/decomposition/_incremental_pca.py:227`,`:281`), raising
/// `ValueError("Input X contains NaN.")` / `"... contains infinity ..."`
/// (`sklearn/utils/validation.py:147-154`) BEFORE any SVD math. NaN AND
/// infinity are both rejected. The message names "NaN" and "infinity" to mirror
/// sklearn's `ValueError`. Never panics (R-CODE-2).
fn reject_non_finite<F: Float>(x: &Array2<F>) -> Result<(), FerroError> {
if x.iter().any(|v| !v.is_finite()) {
return Err(FerroError::InvalidParameter {
name: "X".into(),
reason: "Input X contains NaN or infinity.".into(),
});
}
Ok(())
}
// ---------------------------------------------------------------------------
// IncrementalPCA (unfitted)
// ---------------------------------------------------------------------------
/// Incremental PCA configuration.
///
/// Holds `n_components` and an optional `batch_size`. Calling [`Fit::fit`]
/// processes the data in batches and returns a [`FittedIncrementalPCA`].
///
/// # Type Parameters
///
/// - `F`: The floating-point type (`f32` or `f64`).
#[derive(Debug, Clone)]
pub struct IncrementalPCA<F> {
/// Number of principal components to retain.
n_components: usize,
/// Number of samples per batch. If `None`, the whole dataset is processed
/// in a single batch (equivalent to standard PCA on the full data).
batch_size: Option<usize>,
_marker: std::marker::PhantomData<F>,
}
impl<F: Float + Send + Sync + 'static> IncrementalPCA<F> {
/// Create a new `IncrementalPCA` that retains `n_components` components.
///
/// If `batch_size` is not set, the whole dataset is processed at once.
#[must_use]
pub fn new(n_components: usize) -> Self {
Self {
n_components,
batch_size: None,
_marker: std::marker::PhantomData,
}
}
/// Set the batch size.
///
/// Each call to the internal loop will process this many samples.
#[must_use]
pub fn with_batch_size(mut self, batch_size: usize) -> Self {
self.batch_size = Some(batch_size);
self
}
/// Return the number of components.
#[must_use]
pub fn n_components(&self) -> usize {
self.n_components
}
/// Return the configured batch size, if any.
#[must_use]
pub fn batch_size(&self) -> Option<usize> {
self.batch_size
}
}
// ---------------------------------------------------------------------------
// FittedIncrementalPCA
// ---------------------------------------------------------------------------
/// A fitted Incremental PCA model.
///
/// Created either by calling [`Fit::fit`] on an [`IncrementalPCA`] or by
/// calling [`IncrementalPCA::partial_fit`] one batch at a time.
///
/// Implements [`Transform<Array2<F>>`] to project new data onto the
/// learned principal components.
#[derive(Debug, Clone)]
pub struct FittedIncrementalPCA<F> {
/// Principal component directions, shape `(n_components, n_features)`.
components_: Array2<F>,
/// Variance explained by each component (singular_value^2 / n_samples_seen).
explained_variance_: Array1<F>,
/// Ratio of variance explained by each component to total variance.
explained_variance_ratio_: Array1<F>,
/// Per-feature running mean.
mean_: Array1<F>,
/// Per-feature running population variance (ddof=0), tracked via the
/// Youngs-Cramer / Chan parallel-variance update (sklearn
/// `_incremental_mean_and_var`). Used as the denominator of
/// `explained_variance_ratio_`.
var_: Array1<F>,
/// Number of samples seen so far.
n_samples_seen_: usize,
/// Singular values of the current incremental SVD.
singular_values_: Array1<F>,
}
impl<F: Float + Send + Sync + 'static> FittedIncrementalPCA<F> {
/// Principal components, shape `(n_components, n_features)`.
#[must_use]
pub fn components(&self) -> &Array2<F> {
&self.components_
}
/// Explained variance per component.
#[must_use]
pub fn explained_variance(&self) -> &Array1<F> {
&self.explained_variance_
}
/// Explained variance ratio per component.
#[must_use]
pub fn explained_variance_ratio(&self) -> &Array1<F> {
&self.explained_variance_ratio_
}
/// Running per-feature mean.
#[must_use]
pub fn mean(&self) -> &Array1<F> {
&self.mean_
}
/// Running per-feature population variance (ddof=0), as tracked by the
/// incremental Youngs-Cramer update. Mirrors sklearn `IncrementalPCA.var_`.
#[must_use]
pub fn var(&self) -> &Array1<F> {
&self.var_
}
/// Number of samples seen during fitting.
#[must_use]
pub fn n_samples_seen(&self) -> usize {
self.n_samples_seen_
}
/// Singular values of the incremental SVD.
#[must_use]
pub fn singular_values(&self) -> &Array1<F> {
&self.singular_values_
}
/// Map reduced data back to the original feature space. Mirrors
/// sklearn `IncrementalPCA.inverse_transform`. Returns shape
/// `(n_samples, n_features)` via `X_reduced @ components + mean`.
///
/// # Errors
///
/// Returns [`FerroError::ShapeMismatch`] if `x_reduced.ncols()` does
/// not equal the number of components.
pub fn inverse_transform(&self, x_reduced: &Array2<F>) -> Result<Array2<F>, FerroError> {
let n_components = self.components_.nrows();
if x_reduced.ncols() != n_components {
return Err(FerroError::ShapeMismatch {
expected: vec![x_reduced.nrows(), n_components],
actual: vec![x_reduced.nrows(), x_reduced.ncols()],
context: "FittedIncrementalPCA::inverse_transform".into(),
});
}
let mut result = x_reduced.dot(&self.components_);
for mut row in result.rows_mut() {
for (v, &m) in row.iter_mut().zip(self.mean_.iter()) {
*v = *v + m;
}
}
Ok(result)
}
/// Process one additional batch, updating the model in-place.
///
/// This is the core of the incremental algorithm. See the module-level
/// documentation for the algorithm details.
///
/// # Errors
///
/// - [`FerroError::InsufficientSamples`] if the batch is empty.
/// - [`FerroError::ShapeMismatch`] if the batch has the wrong number of
/// features (after the first batch has been seen).
/// - [`FerroError::NumericalInstability`] if a numerical failure occurs.
pub fn partial_fit_batch(&mut self, x_batch: &Array2<F>) -> Result<(), FerroError> {
let (batch_n, n_features) = x_batch.dim();
if batch_n == 0 {
return Err(FerroError::InsufficientSamples {
required: 1,
actual: 0,
context: "IncrementalPCA::partial_fit_batch requires non-empty batch".into(),
});
}
// Check feature consistency if we have already seen samples.
if self.n_samples_seen_ > 0 && n_features != self.mean_.len() {
return Err(FerroError::ShapeMismatch {
expected: vec![self.mean_.len()],
actual: vec![n_features],
context: "IncrementalPCA::partial_fit_batch: feature dimension mismatch".into(),
});
}
// Reject NaN/Inf BEFORE any incremental-SVD math (sklearn
// `_validate_data(force_all_finite=True)` at `_incremental_pca.py:227`
// (fit) / `:281` (partial_fit), `utils/validation.py:147-154`).
reject_non_finite(x_batch)?;
let n_components = self.components_.nrows();
let last_count = self.n_samples_seen_;
let new_n = last_count + batch_n;
let new_n_f = F::from(new_n).unwrap_or_else(F::one);
// ----------------------------------------------------------------
// Update running per-feature mean and (population) variance via the
// Youngs-Cramer / Chan parallel-variance formula, porting sklearn
// `_incremental_mean_and_var` (`sklearn/utils/extmath.py:1057-1180`).
// `last_sample_count` is the scalar `n_samples_seen_` broadcast over
// all features (`_incremental_pca.py:329-334`), so the per-feature
// counts are equal and we can use scalars.
// ----------------------------------------------------------------
let last_count_f = F::from(last_count).unwrap_or_else(F::zero);
let new_count_f = F::from(batch_n).unwrap_or_else(F::one);
// new_sum = X.sum(axis=0); batch_mean = new_sum / new_count.
let mut new_sum = Array1::<F>::zeros(n_features);
for row in x_batch.rows() {
for (s, &v) in new_sum.iter_mut().zip(row.iter()) {
*s = *s + v;
}
}
let mut col_batch_mean = Array1::<F>::zeros(n_features);
for j in 0..n_features {
col_batch_mean[j] = new_sum[j] / new_count_f;
}
// last_sum = last_mean * last_count (per feature).
// updated_mean = (last_sum + new_sum) / updated_count.
let mut col_mean = Array1::<F>::zeros(n_features);
for j in 0..n_features {
let last_sum = self.mean_[j] * last_count_f;
col_mean[j] = (last_sum + new_sum[j]) / new_n_f;
}
// new_unnormalized_variance = sum((X - T)^2) - correction^2 / new_count
// where T = batch_mean and correction = sum(X - T) (the corrected
// 2-pass term, `extmath.py:1142-1162`).
let mut new_unnorm_var = Array1::<F>::zeros(n_features);
{
let mut correction = Array1::<F>::zeros(n_features);
for row in x_batch.rows() {
for j in 0..n_features {
let temp = row[j] - col_batch_mean[j];
correction[j] = correction[j] + temp;
new_unnorm_var[j] = new_unnorm_var[j] + temp * temp;
}
}
for j in 0..n_features {
new_unnorm_var[j] = new_unnorm_var[j] - correction[j] * correction[j] / new_count_f;
}
}
// Combine with the prior unnormalized variance (Chan update). When
// last_count == 0 the combined value is just the batch's.
let mut col_var = Array1::<F>::zeros(n_features);
for j in 0..n_features {
let updated_unnorm_var = if last_count == 0 {
new_unnorm_var[j]
} else {
let last_sum = self.mean_[j] * last_count_f;
let last_unnorm_var = self.var_[j] * last_count_f;
let last_over_new_count = last_count_f / new_count_f;
let diff = last_sum / last_over_new_count - new_sum[j];
last_unnorm_var + new_unnorm_var[j] + last_over_new_count / new_n_f * diff * diff
};
col_var[j] = updated_unnorm_var / new_n_f;
}
// ----------------------------------------------------------------
// Whitening + build the stacked matrix M (`_incremental_pca.py:337-354`).
// First step: centre by `col_mean`, M = X_centred. Otherwise centre by
// the BATCH mean and stack THREE blocks:
// [ singular_values * components ; X_batch_centred ; mean_correction ]
// with mean_correction a single length-`n_features` row
// sqrt((n_samples_seen / n_total) * n_batch) * (mean_ - col_batch_mean).
// ----------------------------------------------------------------
let m_mat: Array2<F> = if last_count == 0 || n_components == 0 {
let mut x_centred = x_batch.to_owned();
for mut row in x_centred.rows_mut() {
for (v, &m) in row.iter_mut().zip(col_mean.iter()) {
*v = *v - m;
}
}
x_centred
} else {
// Batch-mean-centred data.
let mut x_centred = x_batch.to_owned();
for mut row in x_centred.rows_mut() {
for (v, &m) in row.iter_mut().zip(col_batch_mean.iter()) {
*v = *v - m;
}
}
// Block 1: singular_values[:,None] * components_.
let mut weighted = Array2::zeros((n_components, n_features));
for k in 0..n_components {
let sv = self.singular_values_[k];
for j in 0..n_features {
weighted[[k, j]] = sv * self.components_[[k, j]];
}
}
// Block 3: mean_correction row.
let scale = (last_count_f / new_n_f * new_count_f).sqrt();
let mut mean_correction = Array2::zeros((1, n_features));
for j in 0..n_features {
mean_correction[[0, j]] = scale * (self.mean_[j] - col_batch_mean[j]);
}
// Stack: [weighted ; x_centred ; mean_correction].
let top = stack_vertical(&weighted, &x_centred);
stack_vertical(&top, &mean_correction)
};
// ----------------------------------------------------------------
// Thin SVD of M.
// ----------------------------------------------------------------
let max_rank = m_mat.nrows().min(m_mat.ncols()).min(n_components);
if max_rank == 0 {
self.mean_ = col_mean;
self.var_ = col_var;
self.n_samples_seen_ = new_n;
return Ok(());
}
let (_, sigma, vt) = thin_svd(&m_mat, max_rank)?;
// ----------------------------------------------------------------
// Update components and singular values, applying
// `svd_flip(u_based_decision=False)` (`_incremental_pca.py:357`,
// `extmath.py:897-905`): for each Vt row find the column index of the
// max ABS value (numpy `argmax` → FIRST on ties, via strict `>`); if
// that entry is negative, negate the WHOLE row so its max-abs entry is
// positive. Same convention as `pca.rs`.
// ----------------------------------------------------------------
for k in 0..n_components.min(max_rank) {
for j in 0..n_features {
self.components_[[k, j]] = vt[[k, j]];
}
self.singular_values_[k] = if k < sigma.len() { sigma[k] } else { F::zero() };
let mut j_max = 0usize;
let mut max_abs = self.components_[[k, 0]].abs();
for j in 1..n_features {
let abs_j = self.components_[[k, j]].abs();
if abs_j > max_abs {
max_abs = abs_j;
j_max = j;
}
}
if self.components_[[k, j_max]] < F::zero() {
for j in 0..n_features {
self.components_[[k, j]] = -self.components_[[k, j]];
}
}
}
// Zero out any components beyond max_rank if n_components > max_rank.
for k in max_rank..n_components {
for j in 0..n_features {
self.components_[[k, j]] = F::zero();
}
self.singular_values_[k] = F::zero();
}
// ----------------------------------------------------------------
// Explained variance / ratio (`_incremental_pca.py:358-359`):
// explained_variance = S^2 / (n_total - 1)
// explained_variance_ratio = S^2 / sum(col_var * n_total)
// The ratio denominator is the TOTAL feature variance, NOT the sum of
// the retained explained variances.
// ----------------------------------------------------------------
let denom = F::from(new_n.saturating_sub(1).max(1)).unwrap_or_else(F::one);
for k in 0..n_components {
let sv = self.singular_values_[k];
self.explained_variance_[k] = sv * sv / denom;
}
let mut total_feature_var = F::zero();
for j in 0..n_features {
total_feature_var = total_feature_var + col_var[j] * new_n_f;
}
if total_feature_var > F::zero() {
for k in 0..n_components {
let sv = self.singular_values_[k];
self.explained_variance_ratio_[k] = sv * sv / total_feature_var;
}
} else {
for k in 0..n_components {
self.explained_variance_ratio_[k] = F::zero();
}
}
// Update state.
self.mean_ = col_mean;
self.var_ = col_var;
self.n_samples_seen_ = new_n;
Ok(())
}
}
// ---------------------------------------------------------------------------
// IncrementalPCA: partial_fit (public streaming API)
// ---------------------------------------------------------------------------
impl<F: Float + Send + Sync + 'static> IncrementalPCA<F> {
/// Process a single batch and return the updated fitted model.
///
/// Calling `partial_fit` repeatedly on successive batches gives the same
/// result as calling `fit` on the concatenation of all batches (up to
/// floating-point rounding).
///
/// The first call initialises the model; subsequent calls update it.
///
/// # Errors
///
/// - [`FerroError::InvalidParameter`] if `n_components == 0` or
/// `n_components >= n_features`.
/// - [`FerroError::InsufficientSamples`] if the batch is empty.
/// - [`FerroError::ShapeMismatch`] if the batch has the wrong number of
/// features after the first batch.
pub fn partial_fit(
&self,
x_batch: &Array2<F>,
state: Option<FittedIncrementalPCA<F>>,
) -> Result<FittedIncrementalPCA<F>, FerroError> {
let n_features = x_batch.ncols();
if self.n_components == 0 {
return Err(FerroError::InvalidParameter {
name: "n_components".into(),
reason: "must be at least 1".into(),
});
}
if n_features == 0 {
return Err(FerroError::InvalidParameter {
name: "n_features".into(),
reason: "must be at least 1".into(),
});
}
if self.n_components >= n_features {
return Err(FerroError::InvalidParameter {
name: "n_components".into(),
reason: format!(
"n_components ({}) must be < n_features ({})",
self.n_components, n_features
),
});
}
let mut fitted = state.unwrap_or_else(|| FittedIncrementalPCA {
components_: Array2::zeros((self.n_components, n_features)),
explained_variance_: Array1::zeros(self.n_components),
explained_variance_ratio_: Array1::zeros(self.n_components),
mean_: Array1::zeros(n_features),
var_: Array1::zeros(n_features),
n_samples_seen_: 0,
singular_values_: Array1::zeros(self.n_components),
});
fitted.partial_fit_batch(x_batch)?;
Ok(fitted)
}
}
// ---------------------------------------------------------------------------
// Fit trait
// ---------------------------------------------------------------------------
impl<F: Float + Send + Sync + 'static> Fit<Array2<F>, ()> for IncrementalPCA<F> {
type Fitted = FittedIncrementalPCA<F>;
type Error = FerroError;
/// Fit the model by processing the data in mini-batches.
///
/// If `batch_size` is `None`, the entire dataset is processed in one batch.
///
/// # Errors
///
/// - [`FerroError::InvalidParameter`] if `n_components == 0`,
/// `n_components >= n_features`, or `batch_size == 0`.
/// - [`FerroError::InsufficientSamples`] if there are fewer than 2 samples.
fn fit(&self, x: &Array2<F>, _y: &()) -> Result<FittedIncrementalPCA<F>, FerroError> {
let (n_samples, n_features) = x.dim();
if self.n_components == 0 {
return Err(FerroError::InvalidParameter {
name: "n_components".into(),
reason: "must be at least 1".into(),
});
}
if n_features == 0 {
return Err(FerroError::InvalidParameter {
name: "n_features".into(),
reason: "must be at least 1".into(),
});
}
if self.n_components >= n_features {
return Err(FerroError::InvalidParameter {
name: "n_components".into(),
reason: format!(
"n_components ({}) must be < n_features ({})",
self.n_components, n_features
),
});
}
if n_samples < 2 {
return Err(FerroError::InsufficientSamples {
required: 2,
actual: n_samples,
context: "IncrementalPCA::fit requires at least 2 samples".into(),
});
}
let batch_size = match self.batch_size {
Some(bs) => {
if bs == 0 {
return Err(FerroError::InvalidParameter {
name: "batch_size".into(),
reason: "must be at least 1 when specified".into(),
});
}
bs
}
// sklearn: `batch_size_ = 5 * n_features` when `batch_size is None`
// (`sklearn/decomposition/_incremental_pca.py:236-237`), to balance
// approximation accuracy and memory. `5*n_features` may exceed
// `n_samples` (→ a single batch, handled by the gen_batches loop
// below — no panic, R-CODE-2).
None => 5 * n_features,
};
// Slice into batches exactly like sklearn's
// `gen_batches(n_samples, batch_size_, min_batch_size=self.n_components or 0)`
// (`sklearn/decomposition/_incremental_pca.py:241-243`,
// `sklearn/utils/_chunking.py:67-75`). The `min_batch_size = n_components`
// term MERGES a trailing remainder smaller than `n_components` into the
// previous full batch (e.g. 20 rows, batch_size 6, n_components 3 →
// batches 6+6+8, NOT 6+6+6+2), so the incremental merge sequence — and
// hence `components_`/`singular_values_`/`explained_variance_` — matches
// sklearn (#2386).
let min_batch_size = self.n_components;
let mut state: Option<FittedIncrementalPCA<F>> = None;
let mut start = 0;
for _ in 0..(n_samples / batch_size) {
let end = start + batch_size;
// Skip this full batch when the remainder after it would be a
// too-small trailing batch (`end + min_batch_size > n`): leave
// `start` unchanged so the trailing rows fold into the final slice.
if end + min_batch_size > n_samples {
continue;
}
let batch = x.slice(ndarray::s![start..end, ..]).to_owned();
state = Some(self.partial_fit(&batch, state)?);
start = end;
}
if start < n_samples {
let batch = x.slice(ndarray::s![start..n_samples, ..]).to_owned();
state = Some(self.partial_fit(&batch, state)?);
}
state.ok_or_else(|| FerroError::InsufficientSamples {
required: 1,
actual: 0,
context: "IncrementalPCA::fit: no batches processed".into(),
})
}
}
// ---------------------------------------------------------------------------
// Transform trait
// ---------------------------------------------------------------------------
impl<F: Float + Send + Sync + 'static> Transform<Array2<F>> for FittedIncrementalPCA<F> {
type Output = Array2<F>;
type Error = FerroError;
/// Project data onto the principal components: `(X - mean) @ components^T`.
///
/// # Errors
///
/// Returns [`FerroError::ShapeMismatch`] if the number of columns does not
/// match the number of features seen during fitting.
fn transform(&self, x: &Array2<F>) -> Result<Array2<F>, FerroError> {
let n_features = self.mean_.len();
if x.ncols() != n_features {
return Err(FerroError::ShapeMismatch {
expected: vec![x.nrows(), n_features],
actual: vec![x.nrows(), x.ncols()],
context: "FittedIncrementalPCA::transform".into(),
});
}
// Reject NaN/Inf BEFORE the projection (sklearn re-validates with
// `_validate_data(reset=False, force_all_finite=True)` at
// `_incremental_pca.py:281`, `utils/validation.py:147-154`).
reject_non_finite(x)?;
let mut x_centred = x.to_owned();
for mut row in x_centred.rows_mut() {
for (v, &m) in row.iter_mut().zip(self.mean_.iter()) {
*v = *v - m;
}
}
// Project: X_centred @ components^T
Ok(x_centred.dot(&self.components_.t()))
}
}
// ---------------------------------------------------------------------------
// Internal linear algebra helpers (no external SVD library needed)
// ---------------------------------------------------------------------------
/// Stack two matrices vertically: `[a; b]`.
fn stack_vertical<F: Float>(a: &Array2<F>, b: &Array2<F>) -> Array2<F> {
let na = a.nrows();
let nb = b.nrows();
let p = a.ncols();
let mut out = Array2::zeros((na + nb, p));
for i in 0..na {
for j in 0..p {
out[[i, j]] = a[[i, j]];
}
}
for i in 0..nb {
for j in 0..p {
out[[na + i, j]] = b[[i, j]];
}
}
out
}
/// `(U, sigma, Vt)` triple returned by [`thin_svd`].
type SvdTriple<F> = (Array2<F>, Array1<F>, Array2<F>);
/// Thin SVD of the merge matrix `m` via `ferray::linalg::svd_lapack` (LAPACK
/// `gesdd`) for f64/f32, with a Jacobi-eigendecomposition fallback for exotic
/// float types.
///
/// This routes the incremental-PCA merge SVD through the SAME LAPACK `gesdd`
/// driver `scipy.linalg.svd(X, full_matrices=False)` calls in sklearn's
/// `IncrementalPCA.partial_fit` (`sklearn/decomposition/_incremental_pca.py:356`),
/// exactly as `pca.rs` does for `PCA`'s `full` solver. The hand-rolled Jacobi
/// `thin_svd` failed to reproduce LAPACK's SVD of the recursively-stacked merge
/// matrix to working precision (single-batch ~1e-15; 4-batch ~1e-2 — far past
/// the R-DEV-1 ~1e-6 bar, #2386); LAPACK `gesdd` is bit-identical to scipy
/// (ferray #2116), so the multi-batch spectrum now matches sklearn.
///
/// Returns `(U, sigma, Vt)` where:
/// - `sigma` has length `max_rank`, sorted descending (non-increasing, the
/// LAPACK `gesdd` contract).
/// - `Vt` has shape `(max_rank, n_features)`, row `k` = `k`-th right singular
/// vector — exactly sklearn's `Vt[:n_components_]` (`_incremental_pca.py:362`).
/// - `U` is unused by the caller (sklearn's `svd_flip(U, Vt,
/// u_based_decision=False)` only needs `Vt`), so an empty `(nr, 0)` matrix is
/// returned to keep the SVD engine from doing wasted work.
///
/// Never panics (R-CODE-2): a `svd_lapack`/conversion failure propagates as
/// [`FerroError::NumericalInstability`].
fn thin_svd<F: Float + Send + Sync + 'static>(
m: &Array2<F>,
max_rank: usize,
) -> Result<SvdTriple<F>, FerroError> {
let (nr, nc) = m.dim();
if nr == 0 || nc == 0 || max_rank == 0 {
return Ok((
Array2::zeros((nr, 0)),
Array1::zeros(0),
Array2::zeros((0, nc)),
));
}
let rank = max_rank.min(nr).min(nc);
// Full thin SVD (`min(nr, nc)` singular values / Vt rows). LAPACK `gesdd`
// for f64/f32; Jacobi fallback for exotic F. `s_full`/`vt_full` come back
// descending, then we truncate to `rank` (sklearn truncates `Vt`/`S` to
// `n_components_`, `_incremental_pca.py:362-363`).
let (s_full, vt_full) = thin_svd_full(m)?;
let mut sigma = Array1::zeros(rank);
let mut vt = Array2::zeros((rank, nc));
for k in 0..rank {
sigma[k] = if k < s_full.len() {
s_full[k]
} else {
F::zero()
};
if k < vt_full.nrows() {
for j in 0..nc {
vt[[k, j]] = vt_full[[k, j]];
}
}
}
// `U` is unused at the call site (svd_flip with `u_based_decision=False`
// operates only on `Vt`); return an empty matrix.
Ok((Array2::zeros((nr, 0)), sigma, vt))
}
/// `(S, Vt)` full thin-SVD of `m`: `S` are the `min(nr, nc)` singular values in
/// non-increasing order, `Vt` is `(min(nr, nc), nc)`. f64/f32 dispatch to
/// `ferray::linalg::svd_lapack` (LAPACK `gesdd`); other float types use the
/// Jacobi eigendecomposition of the Gram matrix as a fallback.
fn thin_svd_full<F: Float + Send + Sync + 'static>(
m: &Array2<F>,
) -> Result<(Array1<F>, Array2<F>), FerroError> {
// SAFETY: each branch checks `TypeId` at runtime and only reinterprets the
// ndarray buffers when the concrete type matches (`F == f64` / `F == f32`),
// so every transmute is between identical types. Same pattern as
// `pca.rs::svd_dispatch`. `forget` prevents a double free of the moved-out
// f64/f32 arrays.
if TypeId::of::<F>() == TypeId::of::<f64>() {
let m_f64: &Array2<f64> = unsafe { &*(std::ptr::from_ref(m).cast::<Array2<f64>>()) };
let (s, vt) = ferray_svd_lapack_f64(m_f64)?;
let s_f: Array1<F> = unsafe { std::mem::transmute_copy::<Array1<f64>, Array1<F>>(&s) };
let vt_f: Array2<F> = unsafe { std::mem::transmute_copy::<Array2<f64>, Array2<F>>(&vt) };
std::mem::forget(s);
std::mem::forget(vt);
Ok((s_f, vt_f))
} else if TypeId::of::<F>() == TypeId::of::<f32>() {
let m_f32: &Array2<f32> = unsafe { &*(std::ptr::from_ref(m).cast::<Array2<f32>>()) };
let (s, vt) = ferray_svd_lapack_f32(m_f32)?;
let s_f: Array1<F> = unsafe { std::mem::transmute_copy::<Array1<f32>, Array1<F>>(&s) };
let vt_f: Array2<F> = unsafe { std::mem::transmute_copy::<Array2<f32>, Array2<F>>(&vt) };
std::mem::forget(s);
std::mem::forget(vt);
Ok((s_f, vt_f))
} else {
// Exotic-F fallback: eigendecompose the Gram matrix C = MᵀM
// (nc × nc). Its eigenvectors are the right singular vectors V;
// singular values are sqrt(max(eigval, 0)), sorted descending.
let nc = m.ncols();
let mtm = m.t().dot(m);
let max_iter = nc * nc * 100 + 1000;
let (eigenvalues, eigenvectors) = jacobi_eigen_symmetric(&mtm, max_iter)?;
let mut indices: Vec<usize> = (0..nc).collect();
indices.sort_by(|&a, &b| {
eigenvalues[b]
.partial_cmp(&eigenvalues[a])
.unwrap_or(std::cmp::Ordering::Equal)
});
let size = m.nrows().min(nc);
let mut s = Array1::<F>::zeros(size);
let mut vt = Array2::<F>::zeros((size, nc));
for (k, &idx) in indices.iter().take(size).enumerate() {
let ev = eigenvalues[idx];
s[k] = if ev > F::zero() { ev.sqrt() } else { F::zero() };
for j in 0..nc {
vt[[k, j]] = eigenvectors[[j, idx]];
}
}
Ok((s, vt))
}
}
/// Thin SVD of an f64 matrix via `ferray::linalg::svd_lapack` (LAPACK `gesdd`),
/// mirroring `scipy.linalg.svd(X, full_matrices=False)` of sklearn's
/// `IncrementalPCA.partial_fit` (`sklearn/decomposition/_incremental_pca.py:356`).
/// Returns `(s, vt)`: `s` the `min(m, n)` singular values descending, `vt` the
/// `(min(m, n), n)` right singular vectors. `U` is not needed (svd_flip uses
/// only `Vt`). Copies the `pca.rs::ferray_svd_lapack_f64` bridge (R-SUBSTRATE-4).
fn ferray_svd_lapack_f64(a: &Array2<f64>) -> Result<(Array1<f64>, Array2<f64>), FerroError> {
let (m, n) = a.dim();
let data: Vec<f64> = a.iter().copied().collect();
let fa = ferray::Array::<f64, ferray::Ix2>::from_vec(ferray::Ix2::new([m, n]), data).map_err(
|e| FerroError::NumericalInstability {
message: format!("ferray array construction failed: {e}"),
},
)?;
let (_u, s, vt) =
ferray::linalg::svd_lapack(&fa, false).map_err(|e| FerroError::NumericalInstability {
message: format!("ferray svd_lapack (gesdd) failed: {e}"),
})?;
let s_nd: Array1<f64> = s.into_ndarray();
let vt_nd: Array2<f64> = vt.into_ndarray();
Ok((s_nd, vt_nd))
}
/// Thin SVD of an f32 matrix via `ferray::linalg::svd_lapack` (LAPACK `gesdd`).
/// See [`ferray_svd_lapack_f64`].
fn ferray_svd_lapack_f32(a: &Array2<f32>) -> Result<(Array1<f32>, Array2<f32>), FerroError> {
let (m, n) = a.dim();
let data: Vec<f32> = a.iter().copied().collect();
let fa = ferray::Array::<f32, ferray::Ix2>::from_vec(ferray::Ix2::new([m, n]), data).map_err(
|e| FerroError::NumericalInstability {
message: format!("ferray array construction failed: {e}"),
},
)?;
let (_u, s, vt) =
ferray::linalg::svd_lapack(&fa, false).map_err(|e| FerroError::NumericalInstability {
message: format!("ferray svd_lapack (gesdd) failed: {e}"),
})?;
let s_nd: Array1<f32> = s.into_ndarray();
let vt_nd: Array2<f32> = vt.into_ndarray();
Ok((s_nd, vt_nd))
}
/// Jacobi eigendecomposition for symmetric matrices.
///
/// Returns `(eigenvalues, eigenvectors)` where column `i` of `eigenvectors`
/// corresponds to `eigenvalues[i]`. The output is **not** sorted.
fn jacobi_eigen_symmetric<F: Float + Send + Sync + 'static>(
a: &Array2<F>,
max_iter: usize,
) -> Result<(Array1<F>, Array2<F>), FerroError> {
let n = a.nrows();
if n == 0 {
return Ok((Array1::zeros(0), Array2::zeros((0, 0))));
}
if n == 1 {
return Ok((
Array1::from_vec(vec![a[[0, 0]]]),
Array2::from_shape_vec((1, 1), vec![F::one()]).unwrap(),
));
}
let mut mat = a.to_owned();
let mut v = Array2::<F>::zeros((n, n));
for i in 0..n {
v[[i, i]] = F::one();
}
let tol = F::from(1e-12).unwrap_or_else(F::epsilon);
for _iteration in 0..max_iter {
// Find the off-diagonal element with the largest absolute value.
let mut max_off = F::zero();
let mut p = 0;
let mut q = 1;
for i in 0..n {
for j in (i + 1)..n {
let val = mat[[i, j]].abs();
if val > max_off {
max_off = val;
p = i;
q = j;
}
}
}
if max_off < tol {
let eigenvalues = Array1::from_shape_fn(n, |i| mat[[i, i]]);
return Ok((eigenvalues, v));
}
let app = mat[[p, p]];
let aqq = mat[[q, q]];
let apq = mat[[p, q]];
let theta = if (app - aqq).abs() < tol {
F::from(std::f64::consts::FRAC_PI_4).unwrap_or_else(F::one)
} else {
let tau = (aqq - app) / (F::from(2.0).unwrap() * apq);
let t = if tau >= F::zero() {
F::one() / (tau.abs() + (F::one() + tau * tau).sqrt())
} else {
-F::one() / (tau.abs() + (F::one() + tau * tau).sqrt())
};
t.atan()
};
let c = theta.cos();
let s = theta.sin();
let mut new_mat = mat.clone();
for i in 0..n {
if i != p && i != q {
let mip = mat[[i, p]];
let miq = mat[[i, q]];
new_mat[[i, p]] = c * mip - s * miq;
new_mat[[p, i]] = new_mat[[i, p]];
new_mat[[i, q]] = s * mip + c * miq;
new_mat[[q, i]] = new_mat[[i, q]];
}
}
new_mat[[p, p]] = c * c * app - F::from(2.0).unwrap() * s * c * apq + s * s * aqq;
new_mat[[q, q]] = s * s * app + F::from(2.0).unwrap() * s * c * apq + c * c * aqq;
new_mat[[p, q]] = F::zero();
new_mat[[q, p]] = F::zero();
mat = new_mat;
for i in 0..n {
let vip = v[[i, p]];
let viq = v[[i, q]];
v[[i, p]] = c * vip - s * viq;
v[[i, q]] = s * vip + c * viq;
}
}
Err(FerroError::ConvergenceFailure {
iterations: max_iter,
message: "Jacobi eigendecomposition did not converge in IncrementalPCA".into(),
})
}
// ---------------------------------------------------------------------------
// Tests
// ---------------------------------------------------------------------------
#[cfg(test)]
mod tests {
use super::*;
use approx::assert_abs_diff_eq;
use ndarray::array;
// -----------------------------------------------------------------------
// Basic shape and structure tests
// -----------------------------------------------------------------------
#[test]
fn test_fit_output_shape() {
let ipca = IncrementalPCA::<f64>::new(1);
let x = array![[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0]];
let fitted = ipca.fit(&x, &()).unwrap();
assert_eq!(fitted.components().dim(), (1, 2));
assert_eq!(fitted.explained_variance().len(), 1);
assert_eq!(fitted.explained_variance_ratio().len(), 1);
assert_eq!(fitted.mean().len(), 2);
assert_eq!(fitted.n_samples_seen(), 4);
}
#[test]
fn test_transform_output_shape() {
let ipca = IncrementalPCA::<f64>::new(1);
let x = array![[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0]];
let fitted = ipca.fit(&x, &()).unwrap();
let projected = fitted.transform(&x).unwrap();
assert_eq!(projected.dim(), (4, 1));
}
#[test]
fn test_fit_two_components() {
let ipca = IncrementalPCA::<f64>::new(2);
let x = array![
[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0],
[7.0, 8.0, 9.0],
[10.0, 11.0, 12.0],
];
let fitted = ipca.fit(&x, &()).unwrap();
assert_eq!(fitted.components().dim(), (2, 3));
let projected = fitted.transform(&x).unwrap();
assert_eq!(projected.dim(), (4, 2));
}
#[test]
fn test_mean_is_correct() {
// Column means for [[0,0],[2,4]] should be [1, 2].
let ipca = IncrementalPCA::<f64>::new(1);
let x = array![[0.0, 0.0], [2.0, 4.0]];
let fitted = ipca.fit(&x, &()).unwrap();
assert_abs_diff_eq!(fitted.mean()[0], 1.0, epsilon = 1e-10);
assert_abs_diff_eq!(fitted.mean()[1], 2.0, epsilon = 1e-10);
}
#[test]
fn test_explained_variance_positive() {
let ipca = IncrementalPCA::<f64>::new(1);
let x = array![[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0]];
let fitted = ipca.fit(&x, &()).unwrap();
for &v in fitted.explained_variance() {
assert!(v >= 0.0, "explained variance should be non-negative: {v}");
}
}
#[test]
fn test_explained_variance_ratio_in_unit_interval() {
let ipca = IncrementalPCA::<f64>::new(1);
let x = array![[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0]];
let fitted = ipca.fit(&x, &()).unwrap();
let ratio_sum: f64 = fitted.explained_variance_ratio().iter().sum();
assert!(
(0.0..=1.0 + 1e-10).contains(&ratio_sum),
"ratio sum {ratio_sum} not in [0,1]"
);
}
#[test]
fn test_batch_size_single_batch() {
// batch_size == n_samples should give the same result as no batch_size.
let ipca_no_bs = IncrementalPCA::<f64>::new(1);
let ipca_bs = IncrementalPCA::<f64>::new(1).with_batch_size(4);
let x = array![[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0]];
let fitted_no_bs = ipca_no_bs.fit(&x, &()).unwrap();
let fitted_bs = ipca_bs.fit(&x, &()).unwrap();
// Means should be identical.
for (a, b) in fitted_no_bs.mean().iter().zip(fitted_bs.mean().iter()) {
assert_abs_diff_eq!(a, b, epsilon = 1e-10);
}
assert_eq!(fitted_no_bs.n_samples_seen(), fitted_bs.n_samples_seen());
}
#[test]
fn test_batch_size_two_batches() {
let ipca = IncrementalPCA::<f64>::new(1).with_batch_size(2);
let x = array![[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0]];
let fitted = ipca.fit(&x, &()).unwrap();
assert_eq!(fitted.n_samples_seen(), 4);
assert_eq!(fitted.components().dim(), (1, 2));
}
#[test]
fn test_partial_fit_chaining() {
// Fit in two batches using partial_fit.
let ipca = IncrementalPCA::<f64>::new(1);
let b1 = array![[1.0, 2.0], [3.0, 4.0]];
let b2 = array![[5.0, 6.0], [7.0, 8.0]];
let state1 = ipca.partial_fit(&b1, None).unwrap();
assert_eq!(state1.n_samples_seen(), 2);
let state2 = ipca.partial_fit(&b2, Some(state1)).unwrap();
assert_eq!(state2.n_samples_seen(), 4);
}
#[test]
fn test_transform_shape_mismatch_error() {
let ipca = IncrementalPCA::<f64>::new(1);
let x = array![[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0]];
let fitted = ipca.fit(&x, &()).unwrap();
let x_bad = array![[1.0, 2.0, 3.0]];
assert!(fitted.transform(&x_bad).is_err());
}
#[test]
fn test_invalid_n_components_zero_error() {
let ipca = IncrementalPCA::<f64>::new(0);
let x = array![[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]];
assert!(ipca.fit(&x, &()).is_err());
}
#[test]
fn test_invalid_n_components_ge_n_features_error() {
let ipca = IncrementalPCA::<f64>::new(2);
let x = array![[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]];
assert!(ipca.fit(&x, &()).is_err());
}
#[test]
fn test_insufficient_samples_error() {
let ipca = IncrementalPCA::<f64>::new(1);
let x = array![[1.0, 2.0]]; // only 1 sample
assert!(ipca.fit(&x, &()).is_err());
}
#[test]
fn test_zero_batch_size_error() {
let ipca = IncrementalPCA::<f64>::new(1).with_batch_size(0);
let x = array![[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]];
assert!(ipca.fit(&x, &()).is_err());
}
#[test]
fn test_f32_support() {
let ipca = IncrementalPCA::<f32>::new(1);
let x: Array2<f32> = array![[1.0f32, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0]];
let fitted = ipca.fit(&x, &()).unwrap();
let projected = fitted.transform(&x).unwrap();
assert_eq!(projected.ncols(), 1);
}
#[test]
fn test_components_approx_unit_length() {
let ipca = IncrementalPCA::<f64>::new(1);
let x = array![[2.5, 2.4], [0.5, 0.7], [2.2, 2.9], [1.9, 2.2], [3.1, 3.0],];
let fitted = ipca.fit(&x, &()).unwrap();
let c = fitted.components();
for i in 0..c.nrows() {
let norm: f64 = c.row(i).iter().map(|v| v * v).sum::<f64>().sqrt();
assert_abs_diff_eq!(norm, 1.0, epsilon = 1e-6);
}
}
#[test]
fn test_n_samples_seen() {
let ipca = IncrementalPCA::<f64>::new(1).with_batch_size(3);
let x = array![[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0], [9.0, 10.0]];
let fitted = ipca.fit(&x, &()).unwrap();
assert_eq!(fitted.n_samples_seen(), 5);
}
#[test]
fn test_getters() {
let ipca = IncrementalPCA::<f64>::new(2).with_batch_size(50);
assert_eq!(ipca.n_components(), 2);
assert_eq!(ipca.batch_size(), Some(50));
let ipca2 = IncrementalPCA::<f64>::new(1);
assert!(ipca2.batch_size().is_none());
}
}