Struct PCA

Source
pub struct PCA { /* private fields */ }
Expand description

Principal component analysis (PCA) structure.

This struct holds the results of a PCA (mean, scale, and rotation matrix) and can be used to transform data into the principal component space. It supports both exact PCA computation and a faster, approximate randomized PCA. Models can also be loaded from/saved to files.

Implementations§

Source§

impl PCA

Source

pub fn new() -> Self

Creates a new, empty PCA struct.

The PCA model is not fitted and needs to be computed using fit or rfit, or loaded using load_model or with_model.

§Examples
use efficient_pca::PCA; // Assuming efficient_pca is your crate name
let pca = PCA::new();
Source

pub fn with_model( rotation: Array2<f64>, mean: Array1<f64>, raw_standard_deviations: Array1<f64>, ) -> Result<Self, Box<dyn Error>>

Creates a new PCA instance from a pre-computed model.

This is useful for loading a PCA model whose components (rotation matrix, mean, and original standard deviations) were computed externally or previously. The library will sanitize the provided standard deviations for consistent scaling.

  • rotation - The rotation matrix (principal components), shape (d_features, k_components).
  • mean - The mean vector of the original data used to compute the PCA, shape (d_features).
  • raw_standard_deviations - The raw standard deviation vector of the original data, shape (d_features). Values that are not strictly positive (i.e., s <= 1e-9, zero, negative), or are non-finite, will be sanitized to 1.0 before being stored. If the original PCA did not involve scaling (e.g., data was already standardized, or only centering was desired), pass a vector of ones.
§Errors

Returns an error if feature dimensions are inconsistent or if raw_standard_deviations contains non-finite values (this check is performed before sanitization).

Source

pub fn rotation(&self) -> Option<&Array2<f64>>

Returns a reference to the rotation matrix (principal components), if computed.

The rotation matrix has dimensions (n_features, k_components). Returns None if the PCA model has not been fitted, or if the rotation matrix is not available (e.g., if fitting resulted in zero components).

Source

pub fn explained_variance(&self) -> Option<&Array1<f64>>

Returns a reference to the explained variance for each principal component.

These are the eigenvalues of the covariance matrix of the scaled data, ordered from largest to smallest. Returns None if the PCA model has not been fitted or if variances are not available.

Source

pub fn fit( &mut self, data_matrix: Array2<f64>, tolerance: Option<f64>, ) -> Result<(), Box<dyn Error>>

Fits the PCA model to the data using an exact covariance/Gram matrix approach.

This method computes the mean, (sanitized) scaling factors, and principal axes (rotation) via an eigen-decomposition of the covariance matrix (if n_features <= n_samples) or the Gram matrix (if n_features > n_samples, the “Gram trick”). The resulting principal components (columns of the rotation matrix) are normalized to unit length.

Note: For very large datasets, rfit is generally recommended for better performance.

  • data_matrix - Input data as a 2D array, shape (n_samples, n_features).
  • tolerance - Optional: Tolerance for excluding low-variance components (fraction of the largest eigenvalue). If None, all components up to the effective rank of the matrix are kept.
§Errors

Returns an error if the input matrix has zero dimensions, fewer than 2 samples, or if matrix operations (like eigen-decomposition) fail.

Source

pub fn rfit( &mut self, x_input_data: Array2<f64>, n_components_requested: usize, n_oversamples: usize, seed: Option<u64>, tol: Option<f64>, ) -> Result<Array2<f64>, Box<dyn Error>>

Fits the PCA model using a memory-efficient randomized SVD approach and returns the transformed principal component scores.

This method computes the mean of the input data, (sanitized) feature-wise scaling factors (standard deviations), and an approximate rotation matrix (principal components). It is specifically designed for computational efficiency and reduced memory footprint when working with large datasets, particularly those with a very large number of features, as it avoids forming the full covariance matrix.

The core of this method is a randomized SVD algorithm (based on Halko, Martinsson, Tropp, 2011) that constructs a low-rank approximation of the input data. It adaptively chooses its sketching strategy based on the dimensions of the input data matrix A (n_samples × n_features, after centering and scaling):

  • If n_features <= n_samples (data matrix is tall or square, D <= N): The algorithm directly sketches the input matrix A by forming Y = A @ Omega', where Omega' is a random Gaussian matrix of shape (n_features × l_sketch_components). An orthonormal basis Q_basis_prime for the range of Y is found (N × l_sketch_components). The data is then projected onto this basis: B_projected_prime = Q_basis_prime.T @ A (l_sketch_components × n_features).

  • If n_features > n_samples (data matrix is wide, D > N): To handle a large number of features efficiently, the algorithm sketches the transpose A.T. It computes Y = A.T @ Omega (n_features × l_sketch_components), where Omega is a random Gaussian matrix (n_samples × l_sketch_components). An orthonormal basis Q_basis for the range of Y is found (n_features × l_sketch_components). The data is then projected: B_projected = Q_basis.T @ A.T = (A @ Q_basis).T (l_sketch_components × n_samples).

In both cases, a few power iterations are used to refine the orthonormal basis (Q_basis_prime or Q_basis) for improved accuracy by better capturing the dominant singular vectors of A.

An SVD is then performed on the smaller projected matrix (B_projected_prime or B_projected). The principal components (columns of the rotation matrix, stored in self.rotation) are derived from this SVD (from V.T in the D <= N case, or Q_basis @ U in the D > N case) and are normalized to unit length. The number of components kept can be influenced by the tol (tolerance) parameter, up to n_components_requested.

The method stores the computed mean, scale (sanitized standard deviations), rotation matrix, and explained_variance within the PCA struct instance.

  • x_input_data - Input data as a 2D array with shape (n_samples, n_features). This matrix will be consumed and its data modified in place for mean centering and scaling.
  • n_components_requested - The target number of principal components to compute and keep. The actual number of components kept may be less if the data’s effective rank is lower or if tol filters out components.
  • n_oversamples - Number of additional random dimensions (p) to sample during the sketching phase, forming a sketch of size l = k + p (where k is n_components_requested). This helps improve the accuracy of the randomized SVD.
    • If 0, an adaptive default for p is used (typically 10% of n_components_requested, clamped between 5 and 20).
    • If positive, this value is used for p, but an internal minimum (e.g., 4) is enforced for robustness. Recommended values when specifying explicitly: 5-20.
  • seed - Optional u64 seed for the random number generator used in sketching, allowing for reproducible results. If None, a random seed is used.
  • tol - Optional tolerance (a float between 0.0 and 1.0, exclusive of 0.0 if used for filtering). If Some(t_val), components are kept if their corresponding singular value s_i from the internal SVD of the projected sketch satisfies s_i > t_val * s_max, where s_max is the largest singular value from that SVD. The number of components kept will be at most n_components_requested. If None, tolerance-based filtering based on singular value ratios is skipped, and up to n_components_requested components (or the effective rank of the sketch) are kept.
§Returns

A Result containing:

  • Ok(Array2<f64>): The transformed data (principal component scores) of shape (n_samples, k_components_kept), where k_components_kept is the actual number of principal components retained after all filtering and rank considerations.
  • Err(Box<dyn Error>): If an error occurs during the process.
§Errors

Returns an error if:

  • The input matrix x_input_data has zero samples or zero features.
  • The number of samples n_samples is less than 2.
  • n_components_requested is 0.
  • Internal matrix operations (like QR decomposition or SVD) fail.
  • Random number generation fails.
Source

pub fn transform(&self, x: Array2<f64>) -> Result<Array2<f64>, Box<dyn Error>>

Applies the PCA transformation to the given data.

The data is centered and scaled using the mean and scale factors learned during fitting (or loaded into the model), and then projected onto the principal components.

  • x - Input data to transform, shape (m_samples, d_features). Can be a single sample (1 row) or multiple samples. This matrix is modified in place.
§Errors

Returns an error if the PCA model is not fitted/loaded (i.e., missing mean, scale, or rotation components), or if the input data’s feature dimension does not match the model’s feature dimension.

Source

pub fn save_model<P: AsRef<Path>>(&self, path: P) -> Result<(), Box<dyn Error>>

Saves the current PCA model to a file using bincode.

The model must contain rotation, mean, and scale components for saving. The explained_variance field can be None (e.g., if the model was created via with_model and eigenvalues were not supplied).

  • path - The file path to save the model to.
§Errors

Returns an error if essential model components (rotation, mean, scale) are missing, or if file I/O or serialization fails.

Source

pub fn load_model<P: AsRef<Path>>(path: P) -> Result<Self, Box<dyn Error>>

Loads a PCA model from a file previously saved with save_model.

  • path - The file path to load the model from.
§Errors

Returns an error if file I/O or deserialization fails, or if the loaded model is found to be incomplete, internally inconsistent (e.g., mismatched dimensions), or contains non-positive scale factors.

Trait Implementations§

Source§

impl Debug for PCA

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl Default for PCA

Source§

fn default() -> Self

Returns the “default value” for a type. Read more
Source§

impl<'de> Deserialize<'de> for PCA

Source§

fn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error>
where __D: Deserializer<'de>,

Deserialize this value from the given Serde deserializer. Read more
Source§

impl Serialize for PCA

Source§

fn serialize<__S>(&self, __serializer: __S) -> Result<__S::Ok, __S::Error>
where __S: Serializer,

Serialize this value into the given Serde serializer. Read more

Auto Trait Implementations§

§

impl Freeze for PCA

§

impl RefUnwindSafe for PCA

§

impl Send for PCA

§

impl Sync for PCA

§

impl Unpin for PCA

§

impl UnwindSafe for PCA

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T> DeserializeOwned for T
where T: for<'de> Deserialize<'de>,