SciRS2 Datasets
A production-ready collection of dataset utilities for the SciRS2 scientific computing library (v0.1.0). Following the SciRS2 POLICY, this module provides comprehensive functionality for loading, generating, and working with datasets commonly used in scientific computing, machine learning, and statistical analysis with enhanced random generation capabilities.
🚀 Production Status - stable (0.1.0)
This release features SciRS2 POLICY implementation with all core functionality implemented, thoroughly tested (117+ tests), and production-ready. The API is stable and follows Rust best practices with zero-warning builds and ecosystem consistency.
✨ Features
- 🎯 Toy Datasets: Classic datasets (Iris, Boston Housing, Breast Cancer, Digits, Wine, Diabetes)
- 🔧 Data Generators: Comprehensive synthetic dataset creation for classification, regression, clustering, and time series
- 📊 Dataset Utilities: Cross-validation, train/test splitting, sampling, and data balancing
- ⚡ Performance: Memory-efficient loading with robust caching and batch operations
- 🛡️ Reliability: SHA256 verification, comprehensive error handling, and platform-specific optimizations
- 📚 Well-Documented: Complete API documentation with examples for all public functions
Installation
Add to your Cargo.toml:
[]
= "0.1.0"
For remote dataset downloading capabilities:
[]
= { = "0.1.0", = ["download"] }
Quick Start
Load Classic Datasets
use ;
// Load the Iris dataset
let iris = load_iris?;
println!;
// Load Boston housing dataset
let boston = load_boston?;
println!;
Generate Synthetic Data
use ;
// Classification dataset
let dataset = make_classification?;
println!;
// Non-linear patterns
let spirals = make_spirals?;
let blobs = make_blobs?;
Cross-Validation and Splitting
use ;
let iris = load_iris?;
// K-fold cross-validation
let folds = k_fold_split?;
// Stratified splitting with targets
if let Some = &iris.target
Core Components
🎯 Toy Datasets
Pre-loaded classic datasets for immediate use:
use ;
// All datasets return a Dataset<f64> with consistent API
let iris = load_iris?; // 150 samples, 4 features, 3 classes
let digits = load_digits?; // 1797 samples, 64 features, 10 classes
let wine = load_wine?; // 178 samples, 13 features, 3 classes
let cancer = load_breast_cancer?; // 569 samples, 30 features, 2 classes
let diabetes = load_diabetes?; // 442 samples, 10 features, regression
let boston = load_boston?; // 506 samples, 13 features, regression
🔧 Data Generators
Comprehensive synthetic dataset creation:
use ;
// Linear and non-linear patterns
let classification = make_classification?;
let regression = make_regression?;
let circles = make_circles?;
let moons = make_moons?;
// Complex patterns
let spirals = make_spirals?;
let swiss_roll = make_swiss_roll?;
// Time series
let ts = make_time_series?;
📊 Dataset Utilities
Complete toolkit for dataset manipulation:
use ;
⚡ Caching System
Efficient dataset management with automatic caching:
use ;
let cache = new?;
let stats = cache.get_statistics?;
println!;
Performance & Reliability
- Memory Efficient: Lazy loading and memory-mapped access for large datasets
- Fast: Optimized algorithms with optional SIMD acceleration
- Reliable: SHA256 integrity verification and comprehensive error handling
- Cross-Platform: Consistent behavior across Windows, macOS, and Linux
- Well-Tested: 117+ unit tests with 100% API coverage
API Stability
The API is stable and production-ready. All public functions are thoroughly documented with examples. Breaking changes will only occur in major version updates (1.0.0+).
Integration
Seamlessly integrates with other SciRS2 modules:
use ;
// Use with scirs2-stats, scirs2-linalg, etc.
Contributing
See the project CONTRIBUTING.md for guidelines. Focus areas for contributions:
- Performance optimization and benchmarking
- Additional real-world datasets
- Advanced data generation algorithms
- Integration examples and tutorials
License
Dual-licensed under MIT or Apache License 2.0.