# SciRS2 Transform
[](https://crates.io/crates/scirs2-transform)
[](#license)
[](https://docs.rs/scirs2-transform)
[](#testing)
**Production-ready data transformation library for machine learning in Rust (v0.1.4)**
This crate provides comprehensive data transformation utilities for the SciRS2 ecosystem (v0.1.4). Following the [SciRS2 POLICY](../SCIRS2_POLICY.md), this module is designed to match and exceed the functionality of scikit-learn's preprocessing module while leveraging Rust's performance and safety guarantees with enhanced distributed processing capabilities.
## ๐ Features
### Data Normalization & Standardization
- **Min-Max Scaling**: Scale features to [0, 1] or custom ranges
- **Z-Score Standardization**: Transform to zero mean and unit variance
- **Robust Scaling**: Use median and IQR for outlier-resistant scaling
- **Max Absolute Scaling**: Scale by maximum absolute value
- **L1/L2 Normalization**: Vector normalization
### Feature Engineering
- **Polynomial Features**: Generate polynomial and interaction features
- **Power Transformations**: Box-Cox and Yeo-Johnson with optimal ฮป estimation
- **Discretization**: Equal-width and equal-frequency binning
- **Binarization**: Convert continuous features to binary
- **Log Transformations**: Logarithmic feature scaling
### Dimensionality Reduction
- **PCA**: Principal Component Analysis with centering/scaling options
- **Truncated SVD**: Memory-efficient singular value decomposition
- **LDA**: Linear Discriminant Analysis for supervised reduction
- **t-SNE**: Advanced non-linear embedding with Barnes-Hut optimization
### Categorical Encoding
- **One-Hot Encoding**: Sparse and dense representations
- **Ordinal Encoding**: Label encoding for ordinal categories
- **Target Encoding**: Supervised encoding with regularization
- **Binary Encoding**: Memory-efficient encoding for high-cardinality features
### Missing Value Imputation
- **Simple Imputation**: Mean, median, mode, and constant strategies
- **KNN Imputation**: K-nearest neighbors with multiple distance metrics
- **Iterative Imputation**: MICE algorithm for multivariate imputation
- **Missing Indicators**: Track which values were imputed
### Feature Selection
- **Variance Threshold**: Remove low-variance features
- **Integration**: Seamless integration with all transformers
## ๐ฆ Installation
Add this to your `Cargo.toml`:
```toml
[dependencies]
scirs2-transform = "0.1.4"
```
For parallel processing and enhanced performance:
```toml
[dependencies]
scirs2-transform = { version = "0.1.4", features = ["parallel"] }
```
## ๐ฏ Quick Start
### Basic Normalization
```rust
use ndarray::array;
use scirs2_transform::normalize::{normalize_array, NormalizationMethod, Normalizer};
// One-shot normalization
let data = array![[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0],
[7.0, 8.0, 9.0]];
let normalized = normalize_array(&data, NormalizationMethod::MinMax, 0)?;
// Fit-transform workflow for reusable transformations
let mut normalizer = Normalizer::new(NormalizationMethod::ZScore, 0);
let train_transformed = normalizer.fit_transform(&train_data)?;
let test_transformed = normalizer.transform(&test_data)?;
```
### Feature Engineering
```rust
use scirs2_transform::features::{PolynomialFeatures, PowerTransformer, binarize};
// Generate polynomial features
let data = array![[1.0, 2.0], [3.0, 4.0]];
let poly = PolynomialFeatures::new(2, false, true);
let poly_features = poly.transform(&data)?;
// Power transformations with optimal lambda
let mut transformer = PowerTransformer::yeo_johnson(true);
let gaussian_data = transformer.fit_transform(&skewed_data)?;
// Binarization
let binary_features = binarize(&data, 0.0)?;
```
### Dimensionality Reduction
```rust
use scirs2_transform::reduction::{PCA, TSNE};
// PCA for linear dimensionality reduction
let mut pca = PCA::new(2, true, false);
let reduced_data = pca.fit_transform(&high_dim_data)?;
let explained_variance = pca.explained_variance_ratio().unwrap();
// t-SNE for non-linear visualization
let mut tsne = TSNE::new(2, 30.0, 500)?;
let embedding = tsne.fit_transform(&data)?;
```
### Categorical Encoding
```rust
use scirs2_transform::encoding::{OneHotEncoder, TargetEncoder};
// One-hot encoding
let mut encoder = OneHotEncoder::new(false, false)?;
let encoded = encoder.fit_transform(&categorical_data)?;
// Target encoding for supervised learning
let mut target_encoder = TargetEncoder::mean_encoding(1.0);
let encoded = target_encoder.fit_transform(&categories, &targets)?;
```
### Missing Value Imputation
```rust
use scirs2_transform::impute::{SimpleImputer, KNNImputer, ImputeStrategy};
// Simple imputation
let mut imputer = SimpleImputer::new(ImputeStrategy::Mean);
let complete_data = imputer.fit_transform(&data_with_missing)?;
// KNN imputation
let mut knn_imputer = KNNImputer::new(5)?;
let imputed_data = knn_imputer.fit_transform(&data_with_missing)?;
```
## ๐ง Advanced Usage
### Pipeline Integration
```rust
// Sequential transformations
let mut scaler = Normalizer::new(NormalizationMethod::ZScore, 0);
let mut pca = PCA::new(50, true, false);
// Preprocessing pipeline
let scaled_data = scaler.fit_transform(&raw_data)?;
let reduced_data = pca.fit_transform(&scaled_data)?;
```
### Custom Transformations
```rust
use scirs2_transform::features::PowerTransformer;
// Custom power transformation
let mut transformer = PowerTransformer::new("yeo-johnson", true)?;
transformer.fit(&training_data)?;
// Apply to new data
let transformed_test = transformer.transform(&test_data)?;
let original_test = transformer.inverse_transform(&transformed_test)?;
```
### Performance Optimization
```rust
// Enable parallel processing for large datasets
use rayon::prelude::*;
// Most transformers automatically use parallel processing when beneficial
let mut pca = PCA::new(100, true, false);
let result = pca.fit_transform(&large_dataset)?; // Automatically parallelized
```
## ๐ Performance
SciRS2 Transform is designed for production workloads:
- **Memory Efficient**: Zero-copy operations where possible
- **Parallel Processing**: Multi-core support via Rayon
- **SIMD Ready**: Integration with vectorized operations
- **Large Scale**: Handles datasets with 100k+ samples and 10k+ features
### Benchmarks
| PCA | 50k ร 1k | 2.1s | 3.8s | 1.8x |
| t-SNE | 10k ร 100 | 12.3s | 18.7s | 1.5x |
| Normalization | 100k ร 500 | 0.3s | 0.9s | 3.0x |
| Power Transform | 50k ร 200 | 1.8s | 2.4s | 1.3x |
## ๐งช Testing
Run the comprehensive test suite:
```bash
# All tests (100 tests)
cargo test
# With output
cargo test -- --nocapture
# Specific module
cargo test normalize::tests
```
## ๐ Documentation
- [API Documentation](https://docs.rs/scirs2-transform)
- [SciRS2 Project](https://github.com/cool-japan/scirs)
- [Performance Guide](./docs/performance.md)
- [Migration from sklearn](./docs/sklearn-migration.md)
## ๐ Compatibility
### Scikit-learn API Compatibility
SciRS2 Transform follows scikit-learn's API conventions:
- `fit()` / `transform()` / `fit_transform()` pattern
- Consistent parameter naming
- Similar default behaviors
- Compatible data formats (via ndarray)
### Migration from Scikit-learn
```python
# Python (scikit-learn)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
```
```rust
// Rust (SciRS2)
use scirs2_transform::normalize::{Normalizer, NormalizationMethod};
let mut scaler = Normalizer::new(NormalizationMethod::ZScore, 0);
let x_scaled = scaler.fit_transform(&x)?;
```
## ๐๏ธ Architecture
### Modular Design
```
scirs2-transform/
โโโ normalize/ # Data normalization and standardization
โโโ features/ # Feature engineering utilities
โโโ reduction/ # Dimensionality reduction algorithms
โโโ encoding/ # Categorical data encoding
โโโ impute/ # Missing value imputation
โโโ selection/ # Feature selection methods
โโโ scaling/ # Advanced scaling transformers
```
### Error Handling
Comprehensive error handling with descriptive messages:
```rust
use scirs2_transform::{Result, TransformError};
match normalizer.fit_transform(&data) {
Ok(transformed) => println!("Success!"),
Err(TransformError::InvalidInput(msg)) => println!("Input error: {}", msg),
Err(TransformError::TransformationError(msg)) => println!("Transform error: {}", msg),
Err(e) => println!("Other error: {}", e),
}
```
## ๐ ๏ธ Development
### Building from Source
```bash
git clone https://github.com/cool-japan/scirs
cd scirs/scirs2-transform
cargo build --release
```
### Contributing
1. Fork the repository
2. Create a feature branch
3. Add comprehensive tests
4. Ensure all tests pass: `cargo test`
5. Run clippy: `cargo clippy`
6. Submit a pull request
## ๐ Roadmap
### Version 0.1.0 (Next)
- Pipeline API for chaining transformations
- Enhanced SIMD acceleration
- Comprehensive benchmarking suite
### Version 0.2.0
- UMAP and manifold learning algorithms
- Advanced matrix decomposition methods
- Time series feature extraction
### Version 1.0.0
- GPU acceleration
- Distributed processing
- Production monitoring tools
## ๐ค License
This project is dual-licensed under either:
You may choose to use either license.
## ๐ Acknowledgments
- Inspired by [scikit-learn](https://scikit-learn.org/)
- Built on [ndarray](https://github.com/rust-ndarray/ndarray)
- Powered by [Rayon](https://github.com/rayon-rs/rayon) for parallelization
---
**Ready for Production**: SciRS2 Transform v0.1.4 provides production-ready data transformation capabilities with performance that meets or exceeds established Python libraries while offering Rust's safety and performance guarantees.