scirs2-transform
Data transformation, dimensionality reduction, and feature engineering library for machine learning in Rust, part of the SciRS2 scientific computing ecosystem.
Overview
scirs2-transform provides comprehensive data preprocessing and transformation utilities following scikit-learn's fit / transform / fit_transform API pattern. v0.3.1 significantly extends the library with UMAP, Barnes-Hut t-SNE, persistent homology / TDA, metric learning, kernel methods, optimal transport, and advanced NMF variants.
Features
Normalization and Scaling
- Min-Max scaling to
[0, 1]or custom ranges - Z-score standardization (zero mean, unit variance)
- Robust scaling (median and IQR; outlier-resistant)
- Max-absolute scaling
- L1 / L2 vector normalization
- Quantile normalization
- Reusable
Normalizerwithfit/transform/inverse_transform
Feature Engineering
- Polynomial features (degree 2+, with/without interaction-only mode)
- Box-Cox and Yeo-Johnson power transformations with optimal lambda estimation
- Equal-width and equal-frequency discretization (binning)
- Binarization with configurable thresholds
- Log transformations with epsilon handling
- Interaction terms, custom function transformers
Dimensionality Reduction
- PCA (Principal Component Analysis) with centering/scaling, explained variance ratio
- Truncated SVD (memory-efficient for sparse data)
- Linear Discriminant Analysis (LDA) for supervised reduction
- t-SNE with Barnes-Hut approximation (O(n log n), multicore)
- UMAP (Uniform Manifold Approximation and Projection)
- Isomap (geodesic-distance manifold learning)
- Locally Linear Embedding (LLE)
- Kernel PCA (RBF, polynomial, sigmoid kernels)
- Probabilistic PCA (PPCA)
- Factor analysis
Independent Component Analysis
- FastICA (fixed-point iteration)
- Spatial ICA
- Infomax ICA
Non-Negative Matrix Factorization (NMF) Variants
- Standard NMF (multiplicative update rules)
- Sparse NMF
- Convex NMF
- Semi-NMF
- Online NMF for streaming data
Sparse PCA and Dictionary Learning
- Sparse PCA via LASSO-based encoding
- Dictionary learning (K-SVD style)
- Sparse coding and reconstruction
Metric Learning
- Mahalanobis distance learning
- LMNN (Large Margin Nearest Neighbor)
- NCA (Neighborhood Components Analysis)
- Metric learning extensions (
metric_learning_ext/)
Kernel Methods
- Kernel PCA with multiple kernels
- Deep kernel learning
- Random Fourier Features (RFF) for large-scale kernel approximation
- Orthogonal Random Features (ORF)
- Nystrom approximation
Optimal Transport
- Wasserstein distance computation
- Sinkhorn-Knopp regularized OT
- Sliced Wasserstein distance
- OT-based domain adaptation
Topological Data Analysis (TDA)
- Vietoris-Rips complex construction
- Persistent homology computation (Betti numbers, persistence diagrams)
- Persistence landscape features
- Topological feature vectorization
- Persistent diagram analysis
Archetypal Analysis
- Archetypal analysis (convex hull vertex finding)
- Simplex-based data representation
Autoencoder-Based Reduction
- Linear autoencoder as PCA surrogate
- Nonlinear autoencoder reduction
Categorical Encoding
- One-hot encoding (sparse and dense)
- Ordinal / label encoding
- Target encoding with regularization
- Binary encoding for high-cardinality features
- Unknown category handling strategies
Missing Value Imputation
- Simple imputation: mean, median, mode, constant
- KNN imputation with multiple distance metrics
- Iterative imputation (MICE algorithm)
- Missing indicator tracking
Feature Selection
- Variance threshold filter
- Recursive Feature Elimination (RFE)
- Mutual information-based selection
Pipeline API
- Sequential transformation chains
ColumnTransformerfor per-column transformsfit/transform/fit_transform/inverse_transformthroughout
Signal Transforms (Integrated)
- Discrete Wavelet Transform (DWT): Haar, Daubechies, Symlet, Coiflet; multi-level
- 2D DWT for image decomposition (LL, LH, HL, HH subbands)
- Continuous Wavelet Transform (CWT): Morlet, Mexican Hat, Gaussian
- Wavelet Packet Transform (WPT) with best-basis selection
- Short-Time Fourier Transform (STFT) with multiple window functions
- Spectrograms (power, magnitude, dB scale)
- MFCC with mel filterbank, delta and delta-delta features
- Constant-Q Transform (CQT) for musical analysis
- Chromagram (12-bin pitch class profiles)
Multi-View Learning
- Multi-view PCA and CCA
- Consensus multi-view embedding
Online / Incremental Learning
- Incremental PCA (chunk-by-chunk update)
- Online NMF
- Online t-SNE approximations
Out-of-Core Processing
- Chunked array reader/writer for datasets larger than RAM
- Streaming normalizer with partial-fit
Structure Learning
- Covariance structure estimation
- Graphical LASSO for sparse inverse covariance
Quick Start
Add to your Cargo.toml:
[]
= "0.3.1"
With SIMD and parallel features:
[]
= { = "0.3.1", = ["parallel", "simd"] }
Normalization
use ;
use Array2;
PCA
use PCA;
use Array2;
UMAP
use UMAP;
use Array2;
Persistent Homology (TDA)
use ;
use Array2;
Optimal Transport
use ;
use ;
Feature Flags
| Flag | Description |
|---|---|
parallel |
Enable Rayon-based multi-threaded transforms |
simd |
SIMD-accelerated normalization and distance operations |
Related Crates
scirs2-cluster- Clustering algorithmsscirs2-linalg- Linear algebra (SVD, eigendecomposition)scirs2-fft- FFT operations (used by signal transforms)- SciRS2 project
License
Licensed under the Apache License, Version 2.0. See LICENSE for details.