SciRS2 Datasets
A comprehensive dataset loading and generation library for the SciRS2 scientific computing ecosystem. Provides classic toy datasets, synthetic data generators, time series benchmarks, graph datasets, image datasets, anomaly detection benchmarks, financial data, medical imaging (synthetic), recommendation datasets, and more — all with a consistent, ergonomic API inspired by scikit-learn.datasets.
Features
Classic Toy Datasets
- Iris: 150 samples, 4 features, 3 classes (Fisher's classic)
- Boston Housing: 506 samples, 13 features, regression (housing prices)
- Breast Cancer: 569 samples, 30 features, binary classification
- Wine: 178 samples, 13 features, 3 classes
- Digits: 1797 samples, 64 features (8x8 pixel images), 10 classes
- Diabetes: 442 samples, 10 features, regression
Synthetic Data Generators
- Classification: Linear and non-linear, configurable clusters, noise, redundant features
- Regression: Multi-output regression with configurable informative features
- Clustering:
make_blobs(Gaussian), hierarchical cluster structures - Non-linear patterns:
make_spirals,make_moons,make_circles,make_swiss_roll - Time series: AR/MA/ARIMA processes, seasonal, trend, noise-configurable generators
- Imbalanced datasets: Configurable class imbalance ratios
Specialized Benchmark Datasets
- Graph: Cora (citation network), Citeseer, PROTEINS, benchmark graphs for GNN evaluation
- Image: MNIST-like, Fashion-MNIST-like, CIFAR-10 format (synthetic)
- Text: 20 Newsgroups (topics), IMDB (sentiment), NER datasets, QA datasets
- Anomaly Detection: KDD Cup, benchmark anomaly detection datasets
- Financial Time Series: Synthetic stock prices, volatility, portfolio data
- Medical Imaging: Synthetic MRI/CT-like volumes for algorithm testing
- Recommendation Systems: MovieLens-like interaction matrices, collaborative filtering benchmarks
- Physics Simulations: N-body dynamics, fluid simulation snapshots, wave equations
- Knowledge Graphs: Entity-relation triples for link prediction benchmarks
Dataset Utilities
- Cross-Validation: K-fold, stratified K-fold, time series split, group K-fold
- Train/Test Splitting: Random and stratified splits
- Sampling: Random, stratified, bootstrap, importance sampling
- Data Balancing: Random oversampling, random undersampling, SMOTE-like
- Feature Engineering: Polynomial features, binning, statistical feature extraction
- Scaling and Normalization: Min-max, robust scaling, standard scaling, L1/L2 normalization
- Caching: Platform-specific disk caching with SHA256 integrity verification
- Streaming Generators: Infinite generators for online learning benchmarks
Installation
[]
= "0.3.4"
With remote dataset download support:
[]
= { = "0.3.4", = ["download"] }
Quick Start
Classic Datasets
use ;
let iris = load_iris?;
let boston = load_boston?;
let digits = load_digits?;
let wine = load_wine?;
let cancer = load_breast_cancer?;
let diabetes = load_diabetes?;
println!;
Synthetic Data
use ;
// Classification dataset: 1000 samples, 10 features, 3 classes
let clf_data = make_classification?;
// Regression dataset: 500 samples, 5 features, 3 informative
let reg_data = make_regression?;
// Clustering: 300 samples, 4 Gaussian clusters
let blobs = make_blobs?;
// Non-linear patterns
let spirals = make_spirals?;
let moons = make_moons?;
let circles = make_circles?;
let swiss_roll = make_swiss_roll?;
Time Series
use ;
// Generic time series with trend and seasonality
let ts = make_time_series?;
// Specific ARIMA process
let arima_ts = make_arima_series?;
// Load standard benchmark datasets
let m4_daily = load?;
Graph Datasets
use ;
let cora = load_cora?; // 2708 nodes, 5429 edges, 7 classes
let citeseer = load_citeseer?; // 3327 nodes, 4732 edges, 6 classes
let proteins = load_proteins?; // 1113 graphs, graph classification
println!;
Anomaly Detection Benchmarks
use ;
// KDD Cup 99 subset
let kdd = load_anomaly_benchmark?;
println!;
// Synthetic anomaly injection
use make_anomaly_dataset;
let = make_anomaly_dataset?;
Text Datasets
use ;
let newsgroups = load_newsgroups?;
let imdb = load_imdb?; // 5000 reviews, balanced
let ner_data = load_ner_dataset?;
Financial Data
use ;
let config = FinancialDataConfig ;
let prices = make_financial_series?;
println!;
Cross-Validation
use ;
let iris = load_iris?;
// Standard K-fold
let folds = k_fold_split?;
for in folds.iter.enumerate
// Stratified split
if let Some = &iris.target
// Time series split (no data leakage)
let ts_folds = time_series_split?;
Caching System
use ;
let cache = new?;
let stats = cache.get_statistics?;
println!;
// Clear specific dataset from cache
cache.remove?;
// Clear all cached datasets
cache.clear?;
Dataset API
All datasets implement the Dataset<F> trait:
Module Map
| Module | Contents |
|---|---|
toy_datasets |
Iris, Boston, Digits, Wine, Breast Cancer, Diabetes |
generators |
make_classification, make_regression, make_blobs, non-linear patterns |
time_series_benchmarks |
ARIMA generators, M4/M5 format, seasonal decomposition datasets |
graph_datasets |
Cora, Citeseer, PROTEINS, synthetic graph generators |
image_datasets |
MNIST-like, CIFAR-10 format, synthetic images |
text_datasets |
20 Newsgroups, IMDB, NER, QA |
anomaly_benchmarks |
KDD Cup, synthetic anomaly injection, detection benchmarks |
financial |
Synthetic asset prices, volatility, portfolio construction |
medical_datasets |
Synthetic MRI/CT volumes, segmentation masks |
recommendation_datasets |
MovieLens-like, collaborative filtering matrices |
graph_benchmarks |
GNN benchmark suites |
regression_benchmarks |
Regression performance benchmarks |
imbalanced |
Imbalanced classification datasets and SMOTE-like |
synthetic_signals |
Synthetic signal datasets for DSP algorithms |
physics |
N-body, fluid simulation, wave equation snapshots |
knowledge_graph_datasets |
Entity-relation triples for KG tasks |
benchmark |
Comprehensive ML algorithm benchmarks |
utils |
Cross-validation, train/test split, sampling, scaling |
cache |
Disk caching with SHA256 verification |
Performance
- Memory-efficient loading: Lazy loading and memory-mapped access via
scirs2-io - Fast generators: Vectorized synthetic data generation using
scirs2-coreRNG - Integrity verified: SHA256 checksums on all cached downloads
- Cross-platform caching: Platform-specific cache directories (XDG on Linux, Application Support on macOS, AppData on Windows)
- Test coverage: 117+ unit tests, 100% public API coverage
Integration
Works seamlessly with other SciRS2 crates:
use load_iris;
use normal;
use pca;
use accuracy_score;
let iris = load_iris?;
// Feed directly into scirs2-linalg, scirs2-stats, scirs2-metrics, etc.
License
Licensed under the Apache License 2.0. See LICENSE for details.
Authors
COOLJAPAN OU (Team KitaSan)