pub struct PCA { /* private fields */ }
Expand description
Principal component analysis (PCA) structure.
This struct holds the results of a PCA (mean, scale, and rotation matrix) and can be used to transform data into the principal component space. It supports both exact PCA computation and a faster, approximate randomized PCA. Models can also be loaded from/saved to files.
Implementations§
Source§impl PCA
impl PCA
Sourcepub fn new() -> Self
pub fn new() -> Self
Creates a new, empty PCA struct.
The PCA model is not fitted and needs to be computed using fit
or rfit
,
or loaded using load_model
or with_model
.
§Examples
use efficient_pca::PCA; // Assuming efficient_pca is your crate name
let pca = PCA::new();
Sourcepub fn with_model(
rotation: Array2<f64>,
mean: Array1<f64>,
raw_standard_deviations: Array1<f64>,
) -> Result<Self, Box<dyn Error>>
pub fn with_model( rotation: Array2<f64>, mean: Array1<f64>, raw_standard_deviations: Array1<f64>, ) -> Result<Self, Box<dyn Error>>
Creates a new PCA instance from a pre-computed model.
This is useful for loading a PCA model whose components (rotation matrix, mean, and original standard deviations) were computed externally or previously. The library will sanitize the provided standard deviations for consistent scaling.
rotation
- The rotation matrix (principal components), shape (d_features, k_components).mean
- The mean vector of the original data used to compute the PCA, shape (d_features).raw_standard_deviations
- The raw standard deviation vector of the original data, shape (d_features). Values that are not strictly positive (i.e.,s <= 1e-9
, zero, negative), or are non-finite, will be sanitized to1.0
before being stored. If the original PCA did not involve scaling (e.g., data was already standardized, or only centering was desired), pass a vector of ones.
§Errors
Returns an error if feature dimensions are inconsistent or if raw_standard_deviations
contains non-finite values (this check is performed before sanitization).
Sourcepub fn rotation(&self) -> Option<&Array2<f64>>
pub fn rotation(&self) -> Option<&Array2<f64>>
Returns a reference to the rotation matrix (principal components), if computed.
The rotation matrix has dimensions (n_features, k_components).
Returns None
if the PCA model has not been fitted, or if the rotation matrix
is not available (e.g., if fitting resulted in zero components).
Sourcepub fn explained_variance(&self) -> Option<&Array1<f64>>
pub fn explained_variance(&self) -> Option<&Array1<f64>>
Returns a reference to the explained variance for each principal component.
These are the eigenvalues of the covariance matrix of the scaled data,
ordered from largest to smallest.
Returns None
if the PCA model has not been fitted or if variances are not available.
Sourcepub fn fit(
&mut self,
data_matrix: Array2<f64>,
tolerance: Option<f64>,
) -> Result<(), Box<dyn Error>>
pub fn fit( &mut self, data_matrix: Array2<f64>, tolerance: Option<f64>, ) -> Result<(), Box<dyn Error>>
Fits the PCA model to the data using an exact covariance/Gram matrix approach.
This method computes the mean, (sanitized) scaling factors, and principal axes (rotation) via an eigen-decomposition of the covariance matrix (if n_features <= n_samples) or the Gram matrix (if n_features > n_samples, the “Gram trick”). The resulting principal components (columns of the rotation matrix) are normalized to unit length.
Note: For very large datasets, rfit
is generally recommended for better performance.
data_matrix
- Input data as a 2D array, shape (n_samples, n_features).tolerance
- Optional: Tolerance for excluding low-variance components (fraction of the largest eigenvalue). IfNone
, all components up to the effective rank of the matrix are kept.
§Errors
Returns an error if the input matrix has zero dimensions, fewer than 2 samples, or if matrix operations (like eigen-decomposition) fail.
Sourcepub fn rfit(
&mut self,
x_input_data: Array2<f64>,
n_components_requested: usize,
n_oversamples: usize,
seed: Option<u64>,
tol: Option<f64>,
) -> Result<Array2<f64>, Box<dyn Error>>
pub fn rfit( &mut self, x_input_data: Array2<f64>, n_components_requested: usize, n_oversamples: usize, seed: Option<u64>, tol: Option<f64>, ) -> Result<Array2<f64>, Box<dyn Error>>
Fits the PCA model using a memory-efficient randomized SVD approach and returns the transformed principal component scores.
This method computes the mean of the input data, (sanitized) feature-wise scaling factors (standard deviations), and an approximate rotation matrix (principal components). It is specifically designed for computational efficiency and reduced memory footprint when working with large datasets, particularly those with a very large number of features, as it avoids forming the full covariance matrix.
The core of this method is a randomized SVD algorithm (based on Halko, Martinsson, Tropp, 2011)
that constructs a low-rank approximation of the input data. It adaptively chooses its
sketching strategy based on the dimensions of the input data matrix A
(n_samples × n_features, after centering and scaling):
-
If
n_features <= n_samples
(data matrix is tall or square,D <= N
): The algorithm directly sketches the input matrixA
by formingY = A @ Omega'
, whereOmega'
is a random Gaussian matrix of shape (n_features × l_sketch_components). An orthonormal basisQ_basis_prime
for the range ofY
is found (N × l_sketch_components). The data is then projected onto this basis:B_projected_prime = Q_basis_prime.T @ A
(l_sketch_components × n_features). -
If
n_features > n_samples
(data matrix is wide,D > N
): To handle a large number of features efficiently, the algorithm sketches the transposeA.T
. It computesY = A.T @ Omega
(n_features × l_sketch_components), whereOmega
is a random Gaussian matrix (n_samples × l_sketch_components). An orthonormal basisQ_basis
for the range ofY
is found (n_features × l_sketch_components). The data is then projected:B_projected = Q_basis.T @ A.T = (A @ Q_basis).T
(l_sketch_components × n_samples).
In both cases, a few power iterations are used to refine the orthonormal basis (Q_basis_prime
or Q_basis
)
for improved accuracy by better capturing the dominant singular vectors of A
.
An SVD is then performed on the smaller projected matrix (B_projected_prime
or B_projected
).
The principal components (columns of the rotation matrix, stored in self.rotation
)
are derived from this SVD (from V.T
in the D <= N
case, or Q_basis @ U
in the D > N
case)
and are normalized to unit length.
The number of components kept can be influenced by the tol
(tolerance) parameter,
up to n_components_requested
.
The method stores the computed mean
, scale
(sanitized standard deviations),
rotation
matrix, and explained_variance
within the PCA
struct instance.
x_input_data
- Input data as a 2D array with shape (n_samples, n_features). This matrix will be consumed and its data modified in place for mean centering and scaling.n_components_requested
- The target number of principal components to compute and keep. The actual number of components kept may be less if the data’s effective rank is lower or iftol
filters out components.n_oversamples
- Number of additional random dimensions (p
) to sample during the sketching phase, forming a sketch of sizel = k + p
(wherek
isn_components_requested
). This helps improve the accuracy of the randomized SVD.- If
0
, an adaptive default forp
is used (typically 10% ofn_components_requested
, clamped between 5 and 20). - If positive, this value is used for
p
, but an internal minimum (e.g., 4) is enforced for robustness. Recommended values when specifying explicitly: 5-20.
- If
seed
- Optionalu64
seed for the random number generator used in sketching, allowing for reproducible results. IfNone
, a random seed is used.tol
- Optional tolerance (a float between 0.0 and 1.0, exclusive of 0.0 if used for filtering). IfSome(t_val)
, components are kept if their corresponding singular values_i
from the internal SVD of the projected sketch satisfiess_i > t_val * s_max
, wheres_max
is the largest singular value from that SVD. The number of components kept will be at mostn_components_requested
. IfNone
, tolerance-based filtering based on singular value ratios is skipped, and up ton_components_requested
components (or the effective rank of the sketch) are kept.
§Returns
A Result
containing:
Ok(Array2<f64>)
: The transformed data (principal component scores) of shape (n_samples, k_components_kept), wherek_components_kept
is the actual number of principal components retained after all filtering and rank considerations.Err(Box<dyn Error>)
: If an error occurs during the process.
§Errors
Returns an error if:
- The input matrix
x_input_data
has zero samples or zero features. - The number of samples
n_samples
is less than 2. n_components_requested
is 0.- Internal matrix operations (like QR decomposition or SVD) fail.
- Random number generation fails.
Sourcepub fn transform(&self, x: Array2<f64>) -> Result<Array2<f64>, Box<dyn Error>>
pub fn transform(&self, x: Array2<f64>) -> Result<Array2<f64>, Box<dyn Error>>
Applies the PCA transformation to the given data.
The data is centered and scaled using the mean and scale factors learned during fitting (or loaded into the model), and then projected onto the principal components.
x
- Input data to transform, shape (m_samples, d_features). Can be a single sample (1 row) or multiple samples. This matrix is modified in place.
§Errors
Returns an error if the PCA model is not fitted/loaded (i.e., missing mean, scale, or rotation components), or if the input data’s feature dimension does not match the model’s feature dimension.
Sourcepub fn save_model<P: AsRef<Path>>(&self, path: P) -> Result<(), Box<dyn Error>>
pub fn save_model<P: AsRef<Path>>(&self, path: P) -> Result<(), Box<dyn Error>>
Saves the current PCA model to a file using bincode.
The model must contain rotation, mean, and scale components for saving.
The explained_variance
field can be None
(e.g., if the model was created
via with_model
and eigenvalues were not supplied).
path
- The file path to save the model to.
§Errors
Returns an error if essential model components (rotation, mean, scale) are missing, or if file I/O or serialization fails.
Sourcepub fn load_model<P: AsRef<Path>>(path: P) -> Result<Self, Box<dyn Error>>
pub fn load_model<P: AsRef<Path>>(path: P) -> Result<Self, Box<dyn Error>>
Loads a PCA model from a file previously saved with save_model
.
path
- The file path to load the model from.
§Errors
Returns an error if file I/O or deserialization fails, or if the loaded model is found to be incomplete, internally inconsistent (e.g., mismatched dimensions), or contains non-positive scale factors.