efficient_pca 0.1.3

# Principal component analysis (PCA)

Forked from https://github.com/ekg/pca. Modified from Erik Garrison's original implementation.

Consider using this library if you have more features than samples.

A Rust library providing **Principal Component Analysis (PCA)** functionality using either:
1. A **covariance-based** eigen-decomposition (classical PCA). (Faster, less memory-efficient.)
2. A **randomized SVD** approach (for large-scale or high-dimensional data). (Slower, more memory-efficient.)

This library supports:
- Mean-centering and scaling of input data.
- Automatic selection of PCA components via a user-defined tolerance or a fixed count.
- Both "standard" PCA for moderate dimensions and a "randomized SVD" routine for very large matrices.

[The PCA is obtained via SVD](https://math.stackexchange.com/questions/3869/what-is-the-intuitive-relationship-between-svd-and-pca).

## Installation

Add this to your `Cargo.toml`:

```toml
[dependencies]
efficient_pca = "0.1.3"
```

Or just `cargo add efficient_pca` to get the latest version.


## Features

- **Standard PCA** via eigen-decomposition of the covariance matrix:
  - Suitable for data where the number of features is not prohibitively large compared to samples.
  - Automatically handles scaling and centering of your data.
  - Allows a tolerance-based cutoff for the number of principal components.

- **Randomized PCA** (`rfit`) via randomized SVD:
  - Efficient for high-dimensional or very large datasets.
  - Randomized SVD can approximate the principal components much faster if the dimensionality is very large.
  - Allows specifying the number of components and an oversampling parameter to improve accuracy.

- **Flexible Tolerances**:
  - You can specify a fraction of the largest eigenvalue/singular value as a threshold to keep or reject components.

- **Easy Transformation**:
  - Once fitted, the same PCA instance can be used to transform new data into the principal-component space.


## Usage
### Classical PCA (`fit`)

```
use ndarray::array;
use efficient_pca::PCA;

fn main() {
    // Suppose we have some 2D dataset (n_samples x n_features)
    let data = array![
        [1.0, 2.0],
        [3.0, 4.0],
        [5.0, 6.0]
    ];

    // Create a new PCA instance
    let mut pca = PCA::new();

    // Fit the model to the data using the classical covariance-based approach.
    // (Optionally provide a tolerance, e.g., Some(0.01) to drop small components.)
    pca.fit(data.clone(), None).unwrap();

    // Transform the data into the new PCA space
    let transformed = pca.transform(data).unwrap();
    println!("PCA-transformed data:\n{:?}", transformed);
}
```

### Randomized SVD PCA (`rfit`)

```
use ndarray::array;
use efficient_pca::PCA;

fn main() {
    // Larger data matrix example (here just 2x2 for illustration)
    let data = array![
        [1.0, 2.0],
        [3.0, 4.0]
    ];

    // Set up PCA
    let mut pca = PCA::new();

    // Use 'rfit' to perform randomized SVD
    // - n_components: number of principal components you want
    // - n_oversamples: oversample dimension
    // - seed: optional for reproducibility
    // - tol: optional variance-based cutoff
    pca.rfit(data.clone(), 2, 10, Some(42), None).unwrap();

    // Transform into PCA space
    let transformed = pca.transform(data).unwrap();
    println!("Randomized PCA result:\n{:?}", transformed);
}
```

### Transforming Data

Once the PCA is fitted (by either `fit` or `rfit`), you can transform new incoming data.  
The PCA object internally stores the mean, scaling, and rotation matrix.  

```
use ndarray::array;
use efficient_pca::PCA;

fn main() {
    let data_train = array![
        [1.0, 2.0],
        [3.0, 4.0],
    ];
    let data_test = array![
        [2.0, 3.0],
        [4.0, 5.0],
    ];
    
    let mut pca = PCA::new();
    pca.fit(data_train.clone(), Some(1e-3)).unwrap();
    
    // Transform both training data and new data
    let train_pcs = pca.transform(data_train).unwrap();
    let test_pcs = pca.transform(data_test).unwrap();
    
    println!("Train set in PCA space:\n{:?}", train_pcs);
    println!("Test set in PCA space:\n{:?}", test_pcs);
}
```


## API Overview

### `PCA::new()`
Creates a new, empty `PCA` struct. Before use, you must call either `fit` or `rfit`.  

```
use efficient_pca::PCA;
let pca = PCA::new();
```

### `PCA::fit(...)`
Fits PCA using the **covariance eigen-decomposition** approach.  
- **Parameters**:
  - `data_matrix`: Your data, shape (n_samples, n_features).
  - `tolerance`: If `Some(tol)`, discard all components whose eigenvalue is below `tol * max_eigenvalue`.
    Otherwise, keep all.  
- **Returns**: `Result<(), Box<dyn Error>>`, but on success, stores internal rotation, mean, and scale.

```
use ndarray::array;
use efficient_pca::PCA;
let data = array![[1.0, 2.0],
                  [3.0, 4.0]];
let mut pca = PCA::new();
pca.fit(data, Some(0.01)).unwrap();
```

### `PCA::rfit(...)`
Fits PCA using **randomized SVD**.  
- **Parameters**:
  - `x`: The input data, shape (n_samples, n_features).
  - `n_components`: Number of components to keep (upper bound).
  - `n_oversamples`: Oversampling dimension for randomized SVD.
  - `seed`: Optional RNG seed for reproducibility.
  - `tol`: Optional fraction of the largest singular value used to drop components.  
- **Returns**: Same as `fit`, but uses a different internal approach optimized for large dimensions.

```
use ndarray::array;
use efficient_pca::PCA;
let data = array![[1.0, 2.0],
                  [3.0, 4.0]];
let mut pca = PCA::new();
pca.rfit(data, 10, 5, Some(42_u64), Some(0.01)).unwrap();
```

### `PCA::transform(...)`
Transforms data using the previously fitted PCA’s rotation, mean, and scale.  
- **Parameters**:
  - `x`: A data matrix with the same number of features as the training data.
- **Returns**: The matrix of shape (n_samples, n_components) in principal-component space.

```
use ndarray::array;
use efficient_pca::PCA;
fn main() -> Result<(), Box<dyn std::error::Error>> {
    let data = array![[1.0, 2.0],
                      [3.0, 4.0]];
    let mut pca = PCA::new();
    pca.fit(data.clone(), None).unwrap();
    let projected = pca.transform(data)?;
    println!("{:?}", projected);
    Ok(())
}
```

## Performance Considerations

**Standard PCA** (`fit`):
   - Computes the covariance matrix `(p x p)` if `p <= n`, else a Gram matrix `(n x n)`.  
   - Faster for smaller feature counts but can be expensive for extremely large `p`.
   
**Randomized SVD** (`rfit`):
   - Best for very high dimensional datasets.
   - You can tune `n_oversamples` to reduce approximation errors (at the cost of more computation).
   - Internally, it uses a smaller SVD on a projected matrix, offering a big speed-up for large or tall/skinny/wide matrices.

**Memory Usage**:
   - The library copies data in some places to center and scale, and creates temporary matrices for covariance or Gram decompositions.
   - For very large datasets, consider using `rfit`.
  
- **Use `rfit`** for high-dimensional, wide datasets where memory efficiency is crucial, even if that sometimes means a trade-off in runtime.

## Authors

- Erik Garrison <erik.garrison@gmail.com>. See original repository: https://github.com/ekg/pca
- SauersML

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.