scirs2-transform 0.3.0

Data transformation module for SciRS2 (scirs2-transform)
Documentation
# scirs2-transform TODO

## Status: v0.3.0 Released (February 26, 2026)

## v0.3.0 Completed

### Normalization and Scaling
- Min-Max, Z-score, Robust (IQR), Max-absolute, L1/L2, Quantile
- Reusable `Normalizer` with fit/transform/inverse_transform
- Custom range normalization
- SIMD-accelerated normalize path (`normalize_simd.rs`)

### Feature Engineering
- Polynomial features with degree and interaction-only options
- Box-Cox and Yeo-Johnson power transformations with optimal lambda
- Equal-width and equal-frequency discretization (binning)
- Binarization with configurable thresholds
- Log transformations with epsilon
- SIMD-accelerated feature ops (`features_simd.rs`)

### Dimensionality Reduction
- PCA with centering/scaling, explained variance ratio, inverse transform
- Truncated SVD for sparse and large matrices
- Linear Discriminant Analysis (LDA) with SVD solver
- Barnes-Hut t-SNE (O(n log n), multicore, trustworthiness metric)
- UMAP (uniform manifold approximation and projection) (`umap.rs`)
- Isomap (geodesic distance manifold learning)
- Locally Linear Embedding (LLE)
- Kernel PCA (RBF, polynomial, sigmoid)
- Probabilistic PCA (`reduction/ppca.rs`)
- Factor analysis (`reduction/fastica.rs`)

### Independent Component Analysis
- FastICA (fixed-point iteration, `reduction/fastica.rs`)
- Spatial ICA (`spatial_ica.rs`)

### NMF Variants
- Standard NMF with multiplicative update rules
- Sparse NMF, Convex NMF, Semi-NMF
- Online NMF for streaming
- NMF variants module (`nmf_variants.rs`)

### Sparse PCA and Dictionary Learning
- Sparse PCA via LASSO encoding (`sparse_pca.rs`)
- Sparse coding transform (`sparse_coding_transform/`)

### Metric Learning
- Mahalanobis distance learning
- LMNN (Large Margin Nearest Neighbor)
- NCA (Neighborhood Components Analysis)
- Advanced metric learning extensions (`metric_learning_ext/`)

### Kernel Methods
- Kernel PCA
- Deep kernel learning (`deep_kernel.rs`)
- Random Fourier Features (RFF) and Orthogonal RF (`random_features.rs`)

### Optimal Transport
- Wasserstein distance
- Sinkhorn-Knopp regularized OT
- Sliced Wasserstein distance
- OT-based domain adaptation (`optimal_transport.rs`)

### Topological Data Analysis (TDA)
- Vietoris-Rips complex
- Persistent homology: Betti numbers, persistence diagrams
- Persistence landscape feature vectors
- Topological feature vectorization (`tda.rs`, `tda_ext.rs`)
- Persistent diagram analysis (`persistent_diagram.rs`)
- TopoMap / topological layout (`topomap.rs`)

### Archetypal Analysis
- Archetypal analysis via simplex vertex finding (`archetypal.rs`)

### Autoencoder-Based Reduction
- Linear autoencoder (`linear_ae.rs`)
- Nonlinear autoencoder for reduction (`autoencoder_reduction.rs`)

### Encoder Models
- Configurable encoder model wrapper (`encoder_models.rs`)

### Categorical Encoding
- One-hot encoding (sparse and dense), drop-first option
- Ordinal encoding
- Target encoding with regularization
- Binary encoding for high-cardinality
- Unknown category handling

### Missing Value Imputation
- Simple imputation: mean, median, mode, constant
- KNN imputation with multiple metrics
- Iterative imputation / MICE
- Missing indicator

### Feature Selection
- Variance threshold
- Recursive Feature Elimination (RFE)
- Mutual information-based selection (`feature_selection/`)

### Pipeline API
- Sequential transformation chains (`pipeline.rs`)
- ColumnTransformer for column-wise transforms
- Consistent fit/transform/inverse_transform API

### Signal Transforms
- DWT 1D and 2D: Haar, Daubechies, Symlet, Coiflet; multi-level decompose/reconstruct
- CWT: Morlet, Mexican Hat, Gaussian; scalogram
- Wavelet Packet Transform (WPT) with best-basis selection (Shannon entropy)
- STFT with multiple window functions; inverse STFT with perfect reconstruction
- Spectrograms: power, magnitude, dB scaling
- MFCC: mel filterbank, DCT-II, liftering, delta/delta-delta
- Constant-Q Transform (CQT)
- Chromagram (12-bin pitch class profiles)

### Multi-View Learning
- Multi-view PCA and CCA (`multiview/`)

### Online / Incremental Learning
- Incremental PCA (`online/`)
- Online NMF
- Streaming normalizer with partial-fit (`streaming.rs`)

### Out-of-Core Processing
- Chunked array reader/writer (`out_of_core.rs`)
- Bulk buffer read/write with pre-allocated pools

### Structure Learning
- Covariance structure estimation (`structure_learning.rs`)

### Latent Variable Models
- Latent variable model framework (`latent/`)

### Nonlinear Methods
- Nonlinear reduction wrappers (`nonlinear/`)

### Projection Methods
- Random projection, Johnson-Lindenstrauss (`projection/`)

### Factorization
- Tensor factorization utilities (`factorization/`)

## v0.4.0 Roadmap

### GPU-Accelerated Dimensionality Reduction
- GPU-accelerated UMAP (cuML-compatible path)
- GPU PCA via batched SVD on GPU
- GPU t-SNE attraction/repulsion kernels

### Online PCA and Incremental SVD
- Truly streaming (O(1) memory) online PCA
- Incremental SVD with rank-1 updates

### Multimodal Alignment
- Cross-modal embedding alignment (e.g. vision-language)
- Procrustes alignment for embedding spaces

### Advanced Optimal Transport
- Unbalanced OT (for distributions of different total mass)
- Gromov-Wasserstein distance
- Multi-marginal OT

### Improved TDA
- Alpha complex (faster than Vietoris-Rips for low-d)
- Cubical complex for image data
- Zigzag persistence

### Production Monitoring
- Drift detection: KS test, PSI, Wasserstein, MMD
- Data quality alerts and degradation tracking

## Known Issues / Technical Debt

- t-SNE convergence can be slow for very large n; KD-tree acceleration has a known approximation error that may affect cluster separation in some cases
- UMAP random seed behavior is not fully deterministic across platforms due to floating-point ordering
- Some signal transform modules currently alias `scirs2-fft` operations directly; a cleaner abstraction boundary is planned
- `topomap.rs` TDA layout is approximate; exact persistent homology needs optimization for high-dimensional inputs