Expand description
§SciRS2 Datasets - Dataset Loading and Generation
scirs2-datasets provides dataset utilities modeled after scikit-learn’s datasets module,
offering toy datasets (Iris, Boston, MNIST), synthetic data generators, cross-validation splitters,
and data preprocessing utilities for machine learning workflows.
§🎯 Key Features
- Toy Datasets: Classic datasets (Iris, Boston Housing, Breast Cancer, Digits)
- Data Generators: Synthetic data for classification, regression, clustering
- Cross-Validation: K-fold, stratified, time series CV splitters
- Preprocessing: Train/test split, normalization, feature scaling
- Caching: Efficient disk caching for downloaded datasets
§📦 Module Overview
| SciRS2 Function | scikit-learn Equivalent | Description |
|---|---|---|
load_iris | sklearn.datasets.load_iris | Classic Iris classification dataset |
load_boston | sklearn.datasets.load_boston | Boston housing regression dataset |
make_classification | sklearn.datasets.make_classification | Synthetic classification data |
make_regression | sklearn.datasets.make_regression | Synthetic regression data |
make_blobs | sklearn.datasets.make_blobs | Synthetic clustering data |
k_fold_split | sklearn.model_selection.KFold | K-fold cross-validation |
§🚀 Quick Start
[dependencies]
scirs2-datasets = "0.1.5"use scirs2_datasets::{load_iris, make_classification};
// Load classic Iris dataset
let iris = load_iris().expect("Operation failed");
println!("{} samples, {} features", iris.n_samples(), iris.n_features());
// Generate synthetic classification data
let data = make_classification(100, 5, 3, 2, 4, Some(42)).expect("Operation failed");§🔒 Version: 0.2.0 (February 8, 2026)
§v0.2.0 New Features
- Lazy Loading: Memory-mapped datasets with zero-copy views
- Data Augmentation: GPU-accelerated augmentation pipeline
- Parallel Preprocessing: Multi-threaded preprocessing with work-stealing
- Distributed Loading: Shard-aware loading for distributed training
- Format Support: Parquet, Arrow, HDF5 integration via scirs2-io
- Benchmarks: Comprehensive comparison with PyTorch DataLoader
§Examples
§Loading toy datasets
use scirs2_datasets::{load_iris, load_boston};
// Load the classic Iris dataset
let iris = load_iris().expect("Operation failed");
println!("Iris dataset: {} samples, {} features", iris.n_samples(), iris.n_features());
// Load the Boston housing dataset
let boston = load_boston().expect("Operation failed");
println!("Boston dataset: {} samples, {} features", boston.n_samples(), boston.n_features());§Generating synthetic datasets
use scirs2_datasets::{make_classification, make_regression, make_blobs, make_spirals, make_moons};
// Generate a classification dataset
let classification = make_classification(100, 5, 3, 2, 4, Some(42)).expect("Operation failed");
println!("Classification dataset: {} samples, {} features, {} classes",
classification.n_samples(), classification.n_features(), 3);
// Generate a regression dataset
let regression = make_regression(50, 4, 3, 0.1, Some(42)).expect("Operation failed");
println!("Regression dataset: {} samples, {} features",
regression.n_samples(), regression.n_features());
// Generate a clustering dataset
let blobs = make_blobs(80, 3, 4, 1.0, Some(42)).expect("Operation failed");
println!("Blobs dataset: {} samples, {} features, {} clusters",
blobs.n_samples(), blobs.n_features(), 4);
// Generate non-linear patterns
let spirals = make_spirals(200, 2, 0.1, Some(42)).expect("Operation failed");
let moons = make_moons(150, 0.05, Some(42)).expect("Operation failed");§Cross-validation
use scirs2_datasets::{load_iris, k_fold_split, stratified_k_fold_split};
let iris = load_iris().expect("Operation failed");
// K-fold cross-validation
let k_folds = k_fold_split(iris.n_samples(), 5, true, Some(42)).expect("Operation failed");
println!("Created {} folds for K-fold CV", k_folds.len());
// Stratified K-fold cross-validation
if let Some(target) = &iris.target {
let stratified_folds = stratified_k_fold_split(target, 5, true, Some(42)).expect("Operation failed");
println!("Created {} stratified folds", stratified_folds.len());
}§Dataset manipulation
use scirs2_datasets::{load_iris, Dataset};
let iris = load_iris().expect("Operation failed");
// Access dataset properties
println!("Dataset: {} samples, {} features", iris.n_samples(), iris.n_features());
if let Some(featurenames) = iris.featurenames() {
println!("Features: {:?}", featurenames);
}Re-exports§
pub use adaptive_streaming_engine::create_adaptive_engine;pub use adaptive_streaming_engine::create_adaptive_engine_with_config;pub use adaptive_streaming_engine::AdaptiveStreamConfig;pub use adaptive_streaming_engine::AdaptiveStreamingEngine;pub use adaptive_streaming_engine::AlertSeverity;pub use adaptive_streaming_engine::AlertType;pub use adaptive_streaming_engine::ChunkMetadata;pub use adaptive_streaming_engine::DataCharacteristics;pub use adaptive_streaming_engine::MemoryStrategy;pub use adaptive_streaming_engine::PatternType;pub use adaptive_streaming_engine::PerformanceMetrics;pub use adaptive_streaming_engine::QualityAlert;pub use adaptive_streaming_engine::QualityMetrics;pub use adaptive_streaming_engine::StatisticalMoments;pub use adaptive_streaming_engine::StreamChunk;pub use adaptive_streaming_engine::TrendDirection;pub use adaptive_streaming_engine::TrendIndicators;pub use advanced_generators::make_adversarial_examples;pub use advanced_generators::make_anomaly_dataset;pub use advanced_generators::make_continual_learning_dataset;pub use advanced_generators::make_domain_adaptation_dataset;pub use advanced_generators::make_few_shot_dataset;pub use advanced_generators::make_multitask_dataset;pub use advanced_generators::AdversarialConfig;pub use advanced_generators::AnomalyConfig;pub use advanced_generators::AnomalyType;pub use advanced_generators::AttackMethod;pub use advanced_generators::ContinualLearningDataset;pub use advanced_generators::DomainAdaptationConfig;pub use advanced_generators::DomainAdaptationDataset;pub use advanced_generators::FewShotDataset;pub use advanced_generators::MultiTaskConfig;pub use advanced_generators::MultiTaskDataset;pub use advanced_generators::TaskType;pub use benchmarks::BenchmarkResult;pub use benchmarks::BenchmarkRunner;pub use benchmarks::BenchmarkSuite;pub use benchmarks::PerformanceComparison;pub use cloud::presets::azure_client;pub use cloud::presets::gcs_client;pub use cloud::presets::public_s3_client;pub use cloud::presets::s3_client;pub use cloud::presets::s3_compatible_client;pub use cloud::public_datasets::AWSOpenData;pub use cloud::public_datasets::AzureOpenData;pub use cloud::public_datasets::GCPPublicData;pub use cloud::CloudClient;pub use cloud::CloudConfig;pub use cloud::CloudCredentials;pub use cloud::CloudProvider;pub use distributed::DistributedConfig;pub use distributed::DistributedProcessor;pub use distributed::ScalingMethod;pub use distributed::ScalingParameters;pub use domain_specific::astronomy::StellarDatasets;pub use domain_specific::climate::ClimateDatasets;pub use domain_specific::convenience::list_domain_datasets;pub use domain_specific::convenience::load_atmospheric_chemistry;pub use domain_specific::convenience::load_climate_data;pub use domain_specific::convenience::load_exoplanets;pub use domain_specific::convenience::load_gene_expression;pub use domain_specific::convenience::load_stellar_classification;pub use domain_specific::genomics::GenomicsDatasets;pub use domain_specific::DomainConfig;pub use domain_specific::QualityFilters;pub use explore::convenience::explore;pub use explore::convenience::export_summary;pub use explore::convenience::info;pub use explore::convenience::quick_summary;pub use explore::DatasetExplorer;pub use explore::DatasetSummary;pub use explore::ExploreConfig;pub use explore::FeatureStatistics;pub use explore::InferredDataType;pub use explore::OutputFormat;pub use explore::QualityAssessment;pub use external::convenience::list_uci_datasets;pub use external::convenience::load_from_url_sync;pub use external::repositories::GitHubRepository;pub use external::repositories::KaggleRepository;pub use external::repositories::UCIRepository;pub use external::ExternalClient;pub use external::ExternalConfig;pub use external::ProgressCallback;pub use ml_integration::convenience::create_experiment;pub use ml_integration::convenience::cv_split;pub use ml_integration::convenience::prepare_for_ml;pub use ml_integration::convenience::train_test_split;pub use ml_integration::CrossValidationResults;pub use ml_integration::DataSplit;pub use ml_integration::MLExperiment;pub use ml_integration::MLPipeline;pub use ml_integration::MLPipelineConfig;pub use ml_integration::ScalingMethod as MLScalingMethod;pub use cache::get_cachedir;pub use cache::BatchOperations;pub use cache::BatchResult;pub use cache::CacheFileInfo;pub use cache::CacheManager;pub use cache::CacheStats;pub use cache::DatasetCache;pub use cache::DetailedCacheStats;pub use external::convenience::load_from_url;pub use external::convenience::load_github_dataset;pub use external::convenience::load_uci_dataset;pub use generators::add_time_series_noise;pub use generators::benchmark_gpu_vs_cpu;pub use generators::get_gpu_info;pub use generators::gpu_is_available;pub use generators::inject_missing_data;pub use generators::inject_outliers;pub use generators::make_anisotropic_blobs;pub use generators::make_blobs;pub use generators::make_blobs_gpu;pub use generators::make_circles;pub use generators::make_classification;pub use generators::make_classification_gpu;pub use generators::make_corrupted_dataset;pub use generators::make_helix;pub use generators::make_hierarchical_clusters;pub use generators::make_intersecting_manifolds;pub use generators::make_manifold;pub use generators::make_moons;pub use generators::make_regression;pub use generators::make_regression_gpu;pub use generators::make_s_curve;pub use generators::make_severed_sphere;pub use generators::make_spirals;pub use generators::make_swiss_roll;pub use generators::make_swiss_roll_advanced;pub use generators::make_time_series;pub use generators::make_torus;pub use generators::make_twin_peaks;pub use generators::ManifoldConfig;pub use generators::ManifoldType;pub use generators::MissingPattern;pub use generators::OutlierType;pub use generators::time_series::make_ar_process;pub use generators::time_series::make_random_walk;pub use generators::time_series::make_seasonal;pub use generators::time_series::make_sine_wave;pub use generators::graph::make_barabasi_albert;pub use generators::graph::make_karate_club;pub use generators::graph::make_random_graph;pub use generators::graph::make_watts_strogatz;pub use generators::sparse::make_sparse_banded;pub use generators::sparse::make_sparse_laplacian;pub use generators::sparse::make_sparse_spd;pub use generators::classification::make_classification_enhanced;pub use generators::classification::make_hastie_10_2;pub use generators::classification::make_multilabel_classification;pub use generators::classification::ClassificationConfig;pub use generators::classification::MultilabelConfig;pub use generators::classification::MultilabelDataset;pub use generators::regression::make_friedman1;pub use generators::regression::make_friedman2;pub use generators::regression::make_friedman3;pub use generators::regression::make_low_rank_matrix;pub use generators::structured::make_biclusters;pub use generators::structured::make_checkerboard;pub use generators::structured::make_sparse_coded_signal;pub use generators::structured::make_sparse_spd_matrix;pub use generators::structured::make_spd_matrix;pub use gpu::get_optimal_gpu_config;pub use gpu::is_cuda_available;pub use gpu::is_opencl_available;pub use gpu::list_gpu_devices;pub use gpu::make_blobs_auto_gpu;pub use gpu::make_classification_auto_gpu;pub use gpu::make_regression_auto_gpu;pub use gpu::GpuBackend;pub use gpu::GpuBenchmark;pub use gpu::GpuBenchmarkResults;pub use gpu::GpuConfig;pub use gpu::GpuContext;pub use gpu::GpuDeviceInfo;pub use gpu::GpuMemoryConfig;pub use gpu_optimization::benchmark_advanced_performance;pub use gpu_optimization::generate_advanced_matrix;pub use gpu_optimization::AdvancedGpuOptimizer;pub use gpu_optimization::AdvancedKernelConfig;pub use gpu_optimization::BenchmarkResult as AdvancedBenchmarkResult;pub use gpu_optimization::DataLayout;pub use gpu_optimization::LoadBalancingMethod;pub use gpu_optimization::MemoryAccessPattern;pub use gpu_optimization::PerformanceBenchmarkResults;pub use gpu_optimization::SpecializationLevel;pub use gpu_optimization::VectorizationStrategy;pub use loaders::load_csv;pub use loaders::load_csv_legacy;pub use loaders::load_csv_parallel;pub use loaders::load_csv_streaming;pub use loaders::load_json;pub use loaders::load_raw;pub use loaders::save_json;pub use loaders::CsvConfig;pub use loaders::DatasetChunkIterator;pub use loaders::StreamingConfig;pub use neuromorphic_data_processor::create_neuromorphic_processor;pub use neuromorphic_data_processor::create_neuromorphic_processor_with_topology;pub use neuromorphic_data_processor::NetworkTopology;pub use neuromorphic_data_processor::NeuromorphicProcessor;pub use neuromorphic_data_processor::NeuromorphicTransform;pub use neuromorphic_data_processor::SynapticPlasticity;pub use quantum_enhanced_generators::make_quantum_blobs;pub use quantum_enhanced_generators::make_quantum_classification;pub use quantum_enhanced_generators::make_quantum_regression;pub use quantum_enhanced_generators::QuantumDatasetGenerator;pub use quantum_neuromorphic_fusion::create_fusion_with_params;pub use quantum_neuromorphic_fusion::create_quantum_neuromorphic_fusion;pub use quantum_neuromorphic_fusion::QuantumBioFusionResult;pub use quantum_neuromorphic_fusion::QuantumInterference;pub use quantum_neuromorphic_fusion::QuantumNeuromorphicFusion;pub use real_world::list_real_world_datasets;pub use real_world::load_adult;pub use real_world::load_california_housing;pub use real_world::load_heart_disease;pub use real_world::load_red_wine_quality;pub use real_world::load_titanic;pub use real_world::RealWorldConfig;pub use real_world::RealWorldDatasets;pub use registry::get_registry;pub use registry::load_dataset_byname;pub use registry::DatasetMetadata;pub use registry::DatasetRegistry;pub use standard::load_boston as load_boston_full;pub use standard::load_breast_cancer as load_breast_cancer_full;pub use standard::load_digits as load_digits_full;pub use standard::load_iris as load_iris_full;pub use standard::load_wine;pub use standard::DatasetResult;pub use streaming::stream_classification;pub use streaming::stream_csv;pub use streaming::stream_regression;pub use streaming::DataChunk;pub use streaming::StreamConfig;pub use streaming::StreamProcessor;pub use streaming::StreamStats;pub use streaming::StreamTransformer;pub use streaming::StreamingIterator;pub use utils::analyze_dataset_advanced;pub use utils::create_balanced_dataset;pub use utils::create_binned_features;pub use utils::generate_synthetic_samples;pub use utils::importance_sample;pub use utils::k_fold_split;pub use utils::min_max_scale;pub use utils::polynomial_features;pub use utils::quick_quality_assessment;pub use utils::random_oversample;pub use utils::random_sample;pub use utils::random_undersample;pub use utils::robust_scale;pub use utils::statistical_features;pub use utils::stratified_k_fold_split;pub use utils::stratified_sample;pub use utils::time_series_split;pub use utils::AdvancedDatasetAnalyzer;pub use utils::AdvancedQualityMetrics;pub use utils::BalancingStrategy;pub use utils::BinningStrategy;pub use utils::CorrelationInsights;pub use utils::CrossValidationFolds;pub use utils::Dataset;pub use utils::NormalityAssessment;pub use lazy_loading::from_binary as lazy_from_binary;pub use lazy_loading::from_binary_with_config as lazy_from_binary_with_config;pub use lazy_loading::LazyChunkIterator;pub use lazy_loading::LazyDataset;pub use lazy_loading::LazyLoadConfig;pub use lazy_loading::MmapDataset;pub use augmentation::standard_image_augmentation;pub use augmentation::standard_tabular_augmentation;pub use augmentation::AugmentationPipeline;pub use augmentation::Brightness;pub use augmentation::Contrast;pub use augmentation::GaussianNoise;pub use augmentation::HorizontalFlip;pub use augmentation::Mixup;pub use augmentation::RandomFeatureScale;pub use augmentation::RandomRotation90;pub use augmentation::Transform;pub use augmentation::VerticalFlip;pub use parallel_preprocessing::create_pipeline;pub use parallel_preprocessing::create_pipeline_with_config;pub use parallel_preprocessing::ParallelConfig;pub use parallel_preprocessing::ParallelPipeline;pub use parallel_preprocessing::PreprocessFn;pub use distributed_loading::create_loader;pub use distributed_loading::create_loader_with_config;pub use distributed_loading::DistributedCache;pub use distributed_loading::DistributedConfig as DistributedLoadingConfig;pub use distributed_loading::DistributedLoader;pub use distributed_loading::Shard;pub use formats::CompressionCodec;pub use formats::FormatConfig;pub use formats::FormatType;pub use formats::read_auto;pub use formats::read_hdf5;pub use formats::read_parquet;pub use formats::write_hdf5;pub use formats::write_parquet;pub use formats::FormatConverter;pub use formats::Hdf5Reader;pub use formats::Hdf5Writer;pub use formats::ParquetReader;pub use formats::ParquetWriter;pub use sample::*;pub use toy::*;
Modules§
- adaptive_
streaming_ engine - Auto-generated module structure
- advanced_
generators - Advanced synthetic data generators
- augmentation
- Data augmentation pipeline with GPU support
- benchmarks
- Performance benchmarking utilities
- cache
- Dataset caching functionality
- cloud
- Cloud storage integration for datasets
- distributed
- Distributed dataset processing capabilities
- distributed_
loading - Distributed dataset loading
- domain_
specific - Domain-specific datasets for scientific research
- error
- Error types for the datasets module
- explore
- Interactive dataset exploration and analysis tools
- external
- External data sources integration
- formats
- Format support (Parquet, Arrow, HDF5)
- generators
- Dataset generators
- gpu
- GPU acceleration for dataset operations
- gpu_
optimization - Advanced GPU Optimization Engine
- lazy_
loading - Lazy loading and memory-mapped datasets
- loaders
- Data loading utilities
- ml_
integration - Machine learning pipeline integration
- neuromorphic_
data_ processor - Neuromorphic Data Processing Engine
- parallel_
preprocessing - Parallel data preprocessing
- platform_
dirs - Pure Rust platform directory detection (replaces
dirscrate for COOLJAPAN Pure Rust policy) Pure Rust platform directory detection (replacesdirscrate for COOLJAPAN Pure Rust policy) - quantum_
enhanced_ generators - Quantum-Enhanced Data Generation Engine
- quantum_
neuromorphic_ fusion - Quantum-Neuromorphic Fusion Engine
- real_
world - Auto-generated module structure
- registry
- Dataset registry system for managing dataset metadata and locations
- sample
- Sample datasets for testing and demonstration
- stability
- API stability guarantees and compatibility documentation
- standard
- Standard benchmark datasets (fully embedded, no download required)
- streaming
- Streaming support for large datasets
- time_
series - Time series datasets.
- toy
- Toy datasets for testing and examples
- utils
- Core utilities for working with datasets
Macros§
- api_
stability - Macro to easily annotate APIs with stability information