efficient_pca 0.1.8

# Efficient Principal Component Analysis in Rust

This Rust library provides Principal Component Analysis (PCA), both exact and fast approximate methods. It is a modified version of the original work by Erik Garrison. Forked from https://github.com/ekg/pca.

---
## Core Features

* ✨ **Exact PCA (`fit`)**: Computes principal components via eigen-decomposition of the covariance matrix. For datasets where the number of features is greater than the number of samples (`n_features > n_samples`), it uses the Gram matrix method (the "Gram trick"). Allows for component selection based on an eigenvalue tolerance.
* ⚡ **Randomized PCA (`rfit`)**: Employs a memory-efficient randomized SVD algorithm to approximate principal components.
* 🛡️ **Data Handling**:
    * Input data is automatically mean-centered.
    * Feature scaling is applied using standard deviations. Scale factors are sanitized to always be positive.
    * Computed principal component vectors are normalized to unit length.
* 💾 **Model Persistence**: Fitted PCA models (including mean, scale factors, and rotation matrix) can be saved to and loaded from files using `bincode` serialization.
* 🔄 **Data Transformation**: Once a PCA model is fitted or loaded, it can be used to transform new data into the principal component space. This transformation also applies the learned centering and scaling.

---
## Installation

Add `efficient_pca` to your `Cargo.toml` dependencies.

```sh
cargo add efficient_pca
```

### Linear Algebra Backend Selection

This crate supports multiple linear algebra backends for optimal performance across different platforms:

#### OpenBLAS Backends
- **`backend_openblas`** (default): Uses statically linked OpenBLAS
- **`backend_openblas_system`**: Uses system/dynamically linked OpenBLAS
  - **Recommended for macOS** where static linking can be problematic
  - Requires OpenBLAS to be installed on your system

#### Intel MKL Backends  
- **`backend_mkl`**: Uses statically linked Intel MKL
- **`backend_mkl_system`**: Uses system/dynamically linked Intel MKL

#### Alternative Backend
- **`backend_faer`**: Uses the pure Rust `faer` linear algebra library

#### Usage Examples

For macOS users or when experiencing static linking issues:
```toml
[dependencies]
efficient_pca = { version = "*", default-features = false, features = ["backend_openblas_system"] }
```

For Intel systems with MKL installed:
```toml
[dependencies]
efficient_pca = { version = "*", default-features = false, features = ["backend_mkl_system"] }
```

For pure Rust environments:
```toml
[dependencies]
efficient_pca = { version = "*", default-features = false, features = ["backend_faer"] }
```

---
## API Overview

### `PCA::new()`
Creates a new, empty `PCA` struct. The model is not fitted and needs to be computed using `fit` or `rfit`, or loaded.

### `PCA::with_model(rotation, mean, raw_standard_deviations)`
Creates a `PCA` instance from pre-computed components (rotation matrix, mean vector, and raw standard deviations).
* `raw_standard_deviations`: Input standard deviations for each feature. Values `s` that are non-finite or where `s <= 1e-9` are sanitized to `1.0` before being stored. This makes sure the internal scale factors are always positive and finite. An error is returned if input `raw_standard_deviations` initially contains non-finite values.

### `PCA::fit(&mut self, data_matrix, tolerance)`
Computes the PCA model using an exact method.
* `data_matrix`: The input data (`n_samples` x `n_features`).
* `tolerance`: Optional `f64`. If `Some(tol_val)`, principal components corresponding to eigenvalues less than `tol_val * max_eigenvalue` are discarded. `tol_val` is clamped to `[0.0, 1.0]`. If `None`, all components up to the matrix rank are retained.

### `PCA::rfit(&mut self, x_input_data, n_components_requested, n_oversamples, seed, tol)`
Computes an approximate PCA model using a memory-efficient randomized SVD algorithm and returns the transformed principal component scores for `x_input_data`.
* `x_input_data`: The input data (`n_samples` x `n_features`). This matrix is modified in place for centering and scaling.
* `n_components_requested`: The target number of principal components to compute and keep.
* `n_oversamples`: Number of additional random dimensions (`p`) to sample for the sketch (`l = k + p`).
    * If `0`, an adaptive default for `p` is used (typically 10% of `n_components_requested`, clamped between 5 and 20).
    * If positive, this value is used, but an internal minimum is enforced for robustness. Recommended explicit values: 5-20.
* `seed`: Optional `u64` for the random number generator.
* `tol`: Optional `f64` (typically between 0.0 and 1.0, exclusive). If `Some(t_val)` where `0.0 < t_val < 1.0`, components are further filtered if their corresponding singular value `s_i` from the internal SVD of the projected sketch satisfies `s_i <= t_val * s_max`.

### `PCA::transform(&self, x)`
Applies the learned PCA transformation (centering, scaling, and rotation) to new data `x`.
* `x`: Input data to transform (`m_samples` x `d_features`). This matrix is modified in place during centering and scaling.

### `PCA::rotation(&self)`
Returns an `Option<&Array2<f64>>` to the rotation matrix (principal components), if computed. Shape: (`n_features`, `k_components`).

### `PCA::explained_variance(&self)`
Returns an `Option<&Array1<f64>>` to the explained variance for each principal component, if computed.

### `PCA::save_model(&self, path)`
Saves the current PCA model (rotation, mean, scale, and optionally explained_variance) to the specified file path using `bincode` serialization.

### `PCA::load_model(path)`
Loads a PCA model from a file previously saved with `save_model`. The loaded model is validated for completeness and internal consistency (e.g., matching dimensions, positive scale factors).

---
## Performance Considerations

* **`fit()`**: Provides exact PCA. It's generally suitable for datasets where the smaller dimension (either samples or features) is not excessively large, allowing for direct eigen-decomposition. It automatically uses the Gram matrix optimization if `n_features > n_samples`.
* **`rfit()`**: A significant speed-up and reduced memory footprint for very large or high-dimensional datasets where an approximation of PCA is acceptable. The accuracy is typically good.

---
## Authors and Acknowledgements

* This library is a fork and modification of the original `pca` crate by Erik Garrison (original repository: <https://github.com/ekg/pca>).
* Extended by SauersML.

---
## EigenSNP: Large-Scale Genomic PCA

EigenSNP is a sophisticated Principal Component Analysis algorithm specifically designed for large-scale genomic datasets, such as those found in biobanks or population-scale studies. It efficiently handles datasets where the number of SNPs (features) significantly exceeds the number of samples.

### Key Features

* 🧬 **Genomic-Optimized**: Designed for SNP data with linkage disequilibrium (LD) block structure
* ⚡ **Scalable**: Uses randomized SVD (RSVD) and memory-efficient f32 precision for large datasets  
* 🔧 **Highly Configurable**: Extensive tuning parameters for different dataset characteristics
* 📊 **Multi-Stage Algorithm**: Local eigenSNP basis learning → condensed features → global PCA → refinement
* 🔍 **Diagnostics**: Optional detailed diagnostics for algorithm analysis (with feature flag)
* 💾 **Output Options**: Can save local PC loadings per LD block to TSV files

### Algorithm Overview

EigenSNP processes genomic data through several stages:

1. **Local Basis Learning**: For each LD block, learns local eigenSNP basis vectors using RSVD on a subset of samples
2. **Condensed Feature Construction**: Projects all samples onto local bases to create a condensed feature matrix
3. **Feature Standardization**: Standardizes the condensed features (zero mean, unit variance)
4. **Global PCA**: Applies RSVD to the standardized condensed features to get initial PC scores
5. **Refinement**: Iteratively refines SNP loadings and sample scores through orthogonalization and SVD

### Core Types

#### `EigenSNPCoreAlgorithm`
The main orchestrator that executes the EigenSNP workflow.

```rust
use efficient_pca::eigensnp::{EigenSNPCoreAlgorithm, EigenSNPCoreAlgorithmConfig};

let config = EigenSNPCoreAlgorithmConfig {
    target_num_global_pcs: 10,
    components_per_ld_block: 5,
    ..Default::default()
};
let algorithm = EigenSNPCoreAlgorithm::new(config);
```

#### `EigenSNPCoreAlgorithmConfig`
Configuration parameters for fine-tuning the algorithm:

```rust
use efficient_pca::eigensnp::EigenSNPCoreAlgorithmConfig;

let config = EigenSNPCoreAlgorithmConfig {
    // Global PCA parameters
    target_num_global_pcs: 15,                    // Number of final PCs to compute
    global_pca_sketch_oversampling: 10,           // Extra dimensions for RSVD sketching  
    global_pca_num_power_iterations: 2,           // Power iterations for global RSVD
    
    // Local basis learning parameters  
    subset_factor_for_local_basis_learning: 0.1,  // Fraction of samples for local learning
    min_subset_size_for_local_basis_learning: 20_000,
    max_subset_size_for_local_basis_learning: 60_000,
    components_per_ld_block: 7,                   // Local PCs per LD block
    
    // RSVD parameters for local learning
    local_rsvd_sketch_oversampling: 4,
    local_rsvd_num_power_iterations: 2,
    
    // Processing parameters
    snp_processing_strip_size: 2000,              // SNPs per processing chunk
    refine_pass_count: 1,                         // Number of refinement iterations
    random_seed: 2025,
    
    // Optional outputs
    collect_diagnostics: false,                   // Requires "enable-eigensnp-diagnostics" feature
    local_pcs_output_dir: None,                   // Directory to save local PC loadings
    
    ..Default::default()
};
```

#### `PcaReadyGenotypeAccessor`
Trait for providing access to standardized genotype data:

```rust
use efficient_pca::eigensnp::{PcaReadyGenotypeAccessor, PcaSnpId, QcSampleId, ThreadSafeStdError};
use ndarray::Array2;

struct MyGenotypeData {
    standardized_data: Array2<f32>, // SNPs x Samples, already standardized
}

impl PcaReadyGenotypeAccessor for MyGenotypeData {
    fn get_standardized_snp_sample_block(
        &self,
        snp_ids: &[PcaSnpId],
        sample_ids: &[QcSampleId],
    ) -> Result<Array2<f32>, ThreadSafeStdError> {
        // Return requested SNP x Sample block from your standardized data
        // Implementation depends on your data storage format
        unimplemented!()
    }
    
    fn num_pca_snps(&self) -> usize {
        self.standardized_data.nrows()
    }
    
    fn num_qc_samples(&self) -> usize {
        self.standardized_data.ncols()
    }
}
```

#### `LdBlockSpecification`
Defines linkage disequilibrium blocks:

```rust
use efficient_pca::eigensnp::{LdBlockSpecification, PcaSnpId};

let ld_blocks = vec![
    LdBlockSpecification {
        user_defined_block_tag: "chr1_block1".to_string(),
        pca_snp_ids_in_block: (0..1000).map(PcaSnpId).collect(),
    },
    LdBlockSpecification {
        user_defined_block_tag: "chr1_block2".to_string(), 
        pca_snp_ids_in_block: (1000..2000).map(PcaSnpId).collect(),
    },
    // ... more blocks
];
```

#### `EigenSNPCoreOutput`
Final results containing PCA components:

```rust,ignore
// After running compute_pca
let (output, diagnostics) = algorithm.compute_pca(&genotype_data, &ld_blocks, &snp_metadata)?;

// Access results
let loadings = &output.final_snp_principal_component_loadings;  // SNPs x PCs
let scores = &output.final_sample_principal_component_scores;   // Samples x PCs
let eigenvalues = &output.final_principal_component_eigenvalues; // PC eigenvalues
let num_pcs = output.num_principal_components_computed;
```

### Usage Example

```rust
use efficient_pca::eigensnp::{
    EigenSNPCoreAlgorithm, EigenSNPCoreAlgorithmConfig,
    LdBlockSpecification, PcaReadyGenotypeAccessor, PcaSnpId,
    PcaSnpMetadata, QcSampleId, ThreadSafeStdError,
};
use ndarray::{array, Array2};
use std::sync::Arc;

struct InMemoryAccessor {
    data: Array2<f32>,
}

impl PcaReadyGenotypeAccessor for InMemoryAccessor {
    fn get_standardized_snp_sample_block(
        &self,
        snp_ids: &[PcaSnpId],
        sample_ids: &[QcSampleId],
    ) -> Result<Array2<f32>, ThreadSafeStdError> {
        let mut block = Array2::zeros((snp_ids.len(), sample_ids.len()));
        for (row_idx, PcaSnpId(snp_idx)) in snp_ids.iter().enumerate() {
            for (col_idx, QcSampleId(sample_idx)) in sample_ids.iter().enumerate() {
                block[(row_idx, col_idx)] = self.data[(*snp_idx, *sample_idx)];
            }
        }
        Ok(block)
    }

    fn num_pca_snps(&self) -> usize {
        self.data.nrows()
    }

    fn num_qc_samples(&self) -> usize {
        self.data.ncols()
    }
}

# fn main() -> Result<(), ThreadSafeStdError> {
let config = EigenSNPCoreAlgorithmConfig {
    target_num_global_pcs: 2,
    components_per_ld_block: 2,
    subset_factor_for_local_basis_learning: 1.0,
    min_subset_size_for_local_basis_learning: 1,
    max_subset_size_for_local_basis_learning: 10,
    ..Default::default()
};

let algorithm = EigenSNPCoreAlgorithm::new(config);

let ld_blocks = vec![LdBlockSpecification {
    user_defined_block_tag: "block1".to_string(),
    pca_snp_ids_in_block: vec![PcaSnpId(0), PcaSnpId(1)],
}];

let snp_metadata = vec![
    PcaSnpMetadata {
        id: Arc::new("snp1".to_string()),
        chr: Arc::new("1".to_string()),
        pos: 100,
    },
    PcaSnpMetadata {
        id: Arc::new("snp2".to_string()),
        chr: Arc::new("1".to_string()),
        pos: 200,
    },
];

let genotype_data = InMemoryAccessor {
    data: array![
        [0.5, -0.5, 1.0],
        [-0.5, 0.5, -1.0],
    ],
};

let (output, _diagnostics) = algorithm.compute_pca(
    &genotype_data,
    &ld_blocks,
    &snp_metadata,
)?;

println!(
    "Computed {} principal components",
    output.num_principal_components_computed
);
println!(
    "SNP loadings shape: {:?}",
    output.final_snp_principal_component_loadings.dim()
);
println!(
    "Sample scores shape: {:?}",
    output.final_sample_principal_component_scores.dim()
);
# Ok(())
# }
```

### Performance Considerations

* **Memory Efficiency**: Uses `f32` precision with `f64` accumulation for critical operations
* **Large Datasets**: Designed for datasets with millions of SNPs and thousands to hundreds of thousands of samples
* **LD Block Size**: Optimal LD block sizes depend on your data; typically 100-5000 SNPs per block
* **Subset Size**: For local basis learning, uses 10-50% of samples by default (configurable)
* **RSVD Parameters**: Increase oversampling and power iterations for higher accuracy at computational cost

### Diagnostics and Debugging

Enable detailed diagnostics with the `enable-eigensnp-diagnostics` feature:

```toml
[dependencies]
efficient_pca = { version = "*", features = ["enable-eigensnp-diagnostics"] }
```

```rust
use efficient_pca::eigensnp::EigenSNPCoreAlgorithmConfig;

let _config = EigenSNPCoreAlgorithmConfig {
    collect_diagnostics: true,
    local_pcs_output_dir: Some("./local_pcs_output".to_string()),
    ..Default::default()
};
```

This provides detailed algorithm metrics and saves local PC loadings to TSV files for inspection.

### Index Types

EigenSNP uses strongly-typed indices for safety:

* **`PcaSnpId(usize)`**: Identifies SNPs in the final PCA-ready dataset
* **`QcSampleId(usize)`**: Identifies quality-controlled samples  
* **`LdBlockListId(usize)`**: Identifies LD blocks in the specification list
* **`CondensedFeatureId(usize)`**: Identifies rows in the condensed feature matrix
* **`PrincipalComponentId(usize)`**: Identifies computed principal components

---
## License

This project is licensed under the MIT License.