Skip to main content

Crate scirs2_datasets

Crate scirs2_datasets 

Source
Expand description

§SciRS2 Datasets - Dataset Loading and Generation

scirs2-datasets provides dataset utilities modeled after scikit-learn’s datasets module, offering toy datasets (Iris, Boston, MNIST), synthetic data generators, cross-validation splitters, and data preprocessing utilities for machine learning workflows.

§🎯 Key Features

  • Toy Datasets: Classic datasets (Iris, Boston Housing, Breast Cancer, Digits)
  • Data Generators: Synthetic data for classification, regression, clustering
  • Cross-Validation: K-fold, stratified, time series CV splitters
  • Preprocessing: Train/test split, normalization, feature scaling
  • Caching: Efficient disk caching for downloaded datasets

§📦 Module Overview

SciRS2 Functionscikit-learn EquivalentDescription
load_irissklearn.datasets.load_irisClassic Iris classification dataset
load_bostonsklearn.datasets.load_bostonBoston housing regression dataset
make_classificationsklearn.datasets.make_classificationSynthetic classification data
make_regressionsklearn.datasets.make_regressionSynthetic regression data
make_blobssklearn.datasets.make_blobsSynthetic clustering data
k_fold_splitsklearn.model_selection.KFoldK-fold cross-validation

§🚀 Quick Start

[dependencies]
scirs2-datasets = "0.1.5"
use scirs2_datasets::{load_iris, make_classification};

// Load classic Iris dataset
let iris = load_iris().expect("Operation failed");
println!("{} samples, {} features", iris.n_samples(), iris.n_features());

// Generate synthetic classification data
let data = make_classification(100, 5, 3, 2, 4, Some(42)).expect("Operation failed");

§🔒 Version: 0.2.0 (February 8, 2026)

§v0.2.0 New Features

  • Lazy Loading: Memory-mapped datasets with zero-copy views
  • Data Augmentation: GPU-accelerated augmentation pipeline
  • Parallel Preprocessing: Multi-threaded preprocessing with work-stealing
  • Distributed Loading: Shard-aware loading for distributed training
  • Format Support: Parquet, Arrow, HDF5 integration via scirs2-io
  • Benchmarks: Comprehensive comparison with PyTorch DataLoader

§Examples

§Loading toy datasets

use scirs2_datasets::{load_iris, load_boston};

// Load the classic Iris dataset
let iris = load_iris().expect("Operation failed");
println!("Iris dataset: {} samples, {} features", iris.n_samples(), iris.n_features());

// Load the Boston housing dataset
let boston = load_boston().expect("Operation failed");
println!("Boston dataset: {} samples, {} features", boston.n_samples(), boston.n_features());

§Generating synthetic datasets

use scirs2_datasets::{make_classification, make_regression, make_blobs, make_spirals, make_moons};

// Generate a classification dataset
let classification = make_classification(100, 5, 3, 2, 4, Some(42)).expect("Operation failed");
println!("Classification dataset: {} samples, {} features, {} classes",
         classification.n_samples(), classification.n_features(), 3);

// Generate a regression dataset
let regression = make_regression(50, 4, 3, 0.1, Some(42)).expect("Operation failed");
println!("Regression dataset: {} samples, {} features",
         regression.n_samples(), regression.n_features());

// Generate a clustering dataset
let blobs = make_blobs(80, 3, 4, 1.0, Some(42)).expect("Operation failed");
println!("Blobs dataset: {} samples, {} features, {} clusters",
         blobs.n_samples(), blobs.n_features(), 4);

// Generate non-linear patterns
let spirals = make_spirals(200, 2, 0.1, Some(42)).expect("Operation failed");
let moons = make_moons(150, 0.05, Some(42)).expect("Operation failed");

§Cross-validation

use scirs2_datasets::{load_iris, k_fold_split, stratified_k_fold_split};

let iris = load_iris().expect("Operation failed");

// K-fold cross-validation
let k_folds = k_fold_split(iris.n_samples(), 5, true, Some(42)).expect("Operation failed");
println!("Created {} folds for K-fold CV", k_folds.len());

// Stratified K-fold cross-validation
if let Some(target) = &iris.target {
    let stratified_folds = stratified_k_fold_split(target, 5, true, Some(42)).expect("Operation failed");
    println!("Created {} stratified folds", stratified_folds.len());
}

§Dataset manipulation

use scirs2_datasets::{load_iris, Dataset};

let iris = load_iris().expect("Operation failed");

// Access dataset properties
println!("Dataset: {} samples, {} features", iris.n_samples(), iris.n_features());
if let Some(featurenames) = iris.featurenames() {
    println!("Features: {:?}", featurenames);
}

Re-exports§

pub use adaptive_streaming_engine::create_adaptive_engine;
pub use adaptive_streaming_engine::create_adaptive_engine_with_config;
pub use adaptive_streaming_engine::AdaptiveStreamConfig;
pub use adaptive_streaming_engine::AdaptiveStreamingEngine;
pub use adaptive_streaming_engine::AlertSeverity;
pub use adaptive_streaming_engine::AlertType;
pub use adaptive_streaming_engine::ChunkMetadata;
pub use adaptive_streaming_engine::DataCharacteristics;
pub use adaptive_streaming_engine::MemoryStrategy;
pub use adaptive_streaming_engine::PatternType;
pub use adaptive_streaming_engine::PerformanceMetrics;
pub use adaptive_streaming_engine::QualityAlert;
pub use adaptive_streaming_engine::QualityMetrics;
pub use adaptive_streaming_engine::StatisticalMoments;
pub use adaptive_streaming_engine::StreamChunk;
pub use adaptive_streaming_engine::TrendDirection;
pub use adaptive_streaming_engine::TrendIndicators;
pub use advanced_generators::make_adversarial_examples;
pub use advanced_generators::make_anomaly_dataset;
pub use advanced_generators::make_continual_learning_dataset;
pub use advanced_generators::make_domain_adaptation_dataset;
pub use advanced_generators::make_few_shot_dataset;
pub use advanced_generators::make_multitask_dataset;
pub use advanced_generators::AdversarialConfig;
pub use advanced_generators::AnomalyConfig;
pub use advanced_generators::AnomalyType;
pub use advanced_generators::AttackMethod;
pub use advanced_generators::ContinualLearningDataset;
pub use advanced_generators::DomainAdaptationConfig;
pub use advanced_generators::DomainAdaptationDataset;
pub use advanced_generators::FewShotDataset;
pub use advanced_generators::MultiTaskConfig;
pub use advanced_generators::MultiTaskDataset;
pub use advanced_generators::TaskType;
pub use benchmarks::BenchmarkResult;
pub use benchmarks::BenchmarkRunner;
pub use benchmarks::BenchmarkSuite;
pub use benchmarks::PerformanceComparison;
pub use cloud::presets::azure_client;
pub use cloud::presets::gcs_client;
pub use cloud::presets::public_s3_client;
pub use cloud::presets::s3_client;
pub use cloud::presets::s3_compatible_client;
pub use cloud::public_datasets::AWSOpenData;
pub use cloud::public_datasets::AzureOpenData;
pub use cloud::public_datasets::GCPPublicData;
pub use cloud::CloudClient;
pub use cloud::CloudConfig;
pub use cloud::CloudCredentials;
pub use cloud::CloudProvider;
pub use distributed::DistributedConfig;
pub use distributed::DistributedProcessor;
pub use distributed::ScalingMethod;
pub use distributed::ScalingParameters;
pub use domain_specific::astronomy::StellarDatasets;
pub use domain_specific::climate::ClimateDatasets;
pub use domain_specific::convenience::list_domain_datasets;
pub use domain_specific::convenience::load_atmospheric_chemistry;
pub use domain_specific::convenience::load_climate_data;
pub use domain_specific::convenience::load_exoplanets;
pub use domain_specific::convenience::load_gene_expression;
pub use domain_specific::convenience::load_stellar_classification;
pub use domain_specific::genomics::GenomicsDatasets;
pub use domain_specific::DomainConfig;
pub use domain_specific::QualityFilters;
pub use explore::convenience::explore;
pub use explore::convenience::export_summary;
pub use explore::convenience::info;
pub use explore::convenience::quick_summary;
pub use explore::DatasetExplorer;
pub use explore::DatasetSummary;
pub use explore::ExploreConfig;
pub use explore::FeatureStatistics;
pub use explore::InferredDataType;
pub use explore::OutputFormat;
pub use explore::QualityAssessment;
pub use external::convenience::list_uci_datasets;
pub use external::convenience::load_from_url_sync;
pub use external::repositories::GitHubRepository;
pub use external::repositories::KaggleRepository;
pub use external::repositories::UCIRepository;
pub use external::ExternalClient;
pub use external::ExternalConfig;
pub use external::ProgressCallback;
pub use ml_integration::convenience::create_experiment;
pub use ml_integration::convenience::cv_split;
pub use ml_integration::convenience::prepare_for_ml;
pub use ml_integration::convenience::train_test_split;
pub use ml_integration::CrossValidationResults;
pub use ml_integration::DataSplit;
pub use ml_integration::MLExperiment;
pub use ml_integration::MLPipeline;
pub use ml_integration::MLPipelineConfig;
pub use ml_integration::ScalingMethod as MLScalingMethod;
pub use cache::get_cachedir;
pub use cache::BatchOperations;
pub use cache::BatchResult;
pub use cache::CacheFileInfo;
pub use cache::CacheManager;
pub use cache::CacheStats;
pub use cache::DatasetCache;
pub use cache::DetailedCacheStats;
pub use external::convenience::load_from_url;
pub use external::convenience::load_github_dataset;
pub use external::convenience::load_uci_dataset;
pub use generators::add_time_series_noise;
pub use generators::benchmark_gpu_vs_cpu;
pub use generators::get_gpu_info;
pub use generators::gpu_is_available;
pub use generators::inject_missing_data;
pub use generators::inject_outliers;
pub use generators::make_anisotropic_blobs;
pub use generators::make_blobs;
pub use generators::make_blobs_gpu;
pub use generators::make_circles;
pub use generators::make_classification;
pub use generators::make_classification_gpu;
pub use generators::make_corrupted_dataset;
pub use generators::make_helix;
pub use generators::make_hierarchical_clusters;
pub use generators::make_intersecting_manifolds;
pub use generators::make_manifold;
pub use generators::make_moons;
pub use generators::make_regression;
pub use generators::make_regression_gpu;
pub use generators::make_s_curve;
pub use generators::make_severed_sphere;
pub use generators::make_spirals;
pub use generators::make_swiss_roll;
pub use generators::make_swiss_roll_advanced;
pub use generators::make_time_series;
pub use generators::make_torus;
pub use generators::make_twin_peaks;
pub use generators::ManifoldConfig;
pub use generators::ManifoldType;
pub use generators::MissingPattern;
pub use generators::OutlierType;
pub use generators::time_series::make_ar_process;
pub use generators::time_series::make_random_walk;
pub use generators::time_series::make_seasonal;
pub use generators::time_series::make_sine_wave;
pub use generators::graph::make_barabasi_albert;
pub use generators::graph::make_karate_club;
pub use generators::graph::make_random_graph;
pub use generators::graph::make_watts_strogatz;
pub use generators::sparse::make_sparse_banded;
pub use generators::sparse::make_sparse_laplacian;
pub use generators::sparse::make_sparse_spd;
pub use generators::classification::make_classification_enhanced;
pub use generators::classification::make_hastie_10_2;
pub use generators::classification::make_multilabel_classification;
pub use generators::classification::ClassificationConfig;
pub use generators::classification::MultilabelConfig;
pub use generators::classification::MultilabelDataset;
pub use generators::regression::make_friedman1;
pub use generators::regression::make_friedman2;
pub use generators::regression::make_friedman3;
pub use generators::regression::make_low_rank_matrix;
pub use generators::regression::make_sparse_uncorrelated;
pub use generators::structured::make_biclusters;
pub use generators::structured::make_checkerboard;
pub use generators::structured::make_sparse_coded_signal;
pub use generators::structured::make_sparse_spd_matrix;
pub use generators::structured::make_spd_matrix;
pub use gpu::get_optimal_gpu_config;
pub use gpu::is_cuda_available;
pub use gpu::is_opencl_available;
pub use gpu::list_gpu_devices;
pub use gpu::make_blobs_auto_gpu;
pub use gpu::make_classification_auto_gpu;
pub use gpu::make_regression_auto_gpu;
pub use gpu::GpuBackend;
pub use gpu::GpuBenchmark;
pub use gpu::GpuBenchmarkResults;
pub use gpu::GpuConfig;
pub use gpu::GpuContext;
pub use gpu::GpuDeviceInfo;
pub use gpu::GpuMemoryConfig;
pub use gpu_optimization::benchmark_advanced_performance;
pub use gpu_optimization::generate_advanced_matrix;
pub use gpu_optimization::AdvancedGpuOptimizer;
pub use gpu_optimization::AdvancedKernelConfig;
pub use gpu_optimization::BenchmarkResult as AdvancedBenchmarkResult;
pub use gpu_optimization::DataLayout;
pub use gpu_optimization::LoadBalancingMethod;
pub use gpu_optimization::MemoryAccessPattern;
pub use gpu_optimization::PerformanceBenchmarkResults;
pub use gpu_optimization::SpecializationLevel;
pub use gpu_optimization::VectorizationStrategy;
pub use loaders::load_csv;
pub use loaders::load_csv_legacy;
pub use loaders::load_csv_parallel;
pub use loaders::load_csv_streaming;
pub use loaders::load_json;
pub use loaders::load_raw;
pub use loaders::save_json;
pub use loaders::CsvConfig;
pub use loaders::DatasetChunkIterator;
pub use loaders::StreamingConfig;
pub use neuromorphic_data_processor::create_neuromorphic_processor;
pub use neuromorphic_data_processor::create_neuromorphic_processor_with_topology;
pub use neuromorphic_data_processor::NetworkTopology;
pub use neuromorphic_data_processor::NeuromorphicProcessor;
pub use neuromorphic_data_processor::NeuromorphicTransform;
pub use neuromorphic_data_processor::SynapticPlasticity;
pub use quantum_enhanced_generators::make_quantum_blobs;
pub use quantum_enhanced_generators::make_quantum_classification;
pub use quantum_enhanced_generators::make_quantum_regression;
pub use quantum_enhanced_generators::QuantumDatasetGenerator;
pub use quantum_neuromorphic_fusion::create_fusion_with_params;
pub use quantum_neuromorphic_fusion::create_quantum_neuromorphic_fusion;
pub use quantum_neuromorphic_fusion::QuantumBioFusionResult;
pub use quantum_neuromorphic_fusion::QuantumInterference;
pub use quantum_neuromorphic_fusion::QuantumNeuromorphicFusion;
pub use real_world::list_real_world_datasets;
pub use real_world::load_adult;
pub use real_world::load_california_housing;
pub use real_world::load_heart_disease;
pub use real_world::load_red_wine_quality;
pub use real_world::load_titanic;
pub use real_world::RealWorldConfig;
pub use real_world::RealWorldDatasets;
pub use registry::get_registry;
pub use registry::load_dataset_byname;
pub use registry::DatasetMetadata;
pub use registry::DatasetRegistry;
pub use standard::load_boston as load_boston_full;
pub use standard::load_breast_cancer as load_breast_cancer_full;
pub use standard::load_digits as load_digits_full;
pub use standard::load_iris as load_iris_full;
pub use standard::load_wine;
pub use standard::DatasetResult;
pub use streaming::stream_classification;
pub use streaming::stream_csv;
pub use streaming::stream_regression;
pub use streaming::DataChunk;
pub use streaming::StreamConfig;
pub use streaming::StreamProcessor;
pub use streaming::StreamStats;
pub use streaming::StreamTransformer;
pub use streaming::StreamingIterator;
pub use utils::analyze_dataset_advanced;
pub use utils::create_balanced_dataset;
pub use utils::create_binned_features;
pub use utils::generate_synthetic_samples;
pub use utils::importance_sample;
pub use utils::k_fold_split;
pub use utils::min_max_scale;
pub use utils::polynomial_features;
pub use utils::quick_quality_assessment;
pub use utils::random_oversample;
pub use utils::random_sample;
pub use utils::random_undersample;
pub use utils::robust_scale;
pub use utils::statistical_features;
pub use utils::stratified_k_fold_split;
pub use utils::stratified_sample;
pub use utils::time_series_split;
pub use utils::AdvancedDatasetAnalyzer;
pub use utils::AdvancedQualityMetrics;
pub use utils::BalancingStrategy;
pub use utils::BinningStrategy;
pub use utils::CorrelationInsights;
pub use utils::CrossValidationFolds;
pub use utils::Dataset;
pub use utils::NormalityAssessment;
pub use lazy_loading::from_binary as lazy_from_binary;
pub use lazy_loading::from_binary_with_config as lazy_from_binary_with_config;
pub use lazy_loading::LazyChunkIterator;
pub use lazy_loading::LazyDataset;
pub use lazy_loading::LazyLoadConfig;
pub use lazy_loading::MmapDataset;
pub use augmentation::standard_image_augmentation;
pub use augmentation::standard_tabular_augmentation;
pub use augmentation::AugmentationPipeline;
pub use augmentation::Brightness;
pub use augmentation::Contrast;
pub use augmentation::GaussianNoise;
pub use augmentation::HorizontalFlip;
pub use augmentation::Mixup;
pub use augmentation::RandomFeatureScale;
pub use augmentation::RandomRotation90;
pub use augmentation::Transform;
pub use augmentation::VerticalFlip;
pub use parallel_preprocessing::create_pipeline;
pub use parallel_preprocessing::create_pipeline_with_config;
pub use parallel_preprocessing::ParallelConfig;
pub use parallel_preprocessing::ParallelPipeline;
pub use parallel_preprocessing::PreprocessFn;
pub use distributed_loading::create_loader;
pub use distributed_loading::create_loader_with_config;
pub use distributed_loading::DistributedCache;
pub use distributed_loading::DistributedConfig as DistributedLoadingConfig;
pub use distributed_loading::DistributedLoader;
pub use distributed_loading::Shard;
pub use formats::CompressionCodec;
pub use formats::FormatConfig;
pub use formats::FormatType;
pub use formats::read_auto;
pub use formats::read_hdf5;
pub use formats::read_parquet;
pub use formats::write_hdf5;
pub use formats::write_parquet;
pub use formats::FormatConverter;
pub use formats::Hdf5Reader;
pub use formats::Hdf5Writer;
pub use formats::ParquetReader;
pub use formats::ParquetWriter;
pub use sample::*;
pub use toy::*;

Modules§

adaptive_streaming_engine
Auto-generated module structure
advanced_generators
Advanced synthetic data generators
augmentation
Data augmentation pipeline with GPU support
benchmarks
Performance benchmarking utilities
cache
Dataset caching functionality
cloud
Cloud storage integration for datasets
distributed
Distributed dataset processing capabilities
distributed_loading
Distributed dataset loading
domain_specific
Domain-specific datasets for scientific research
error
Error types for the datasets module
explore
Interactive dataset exploration and analysis tools
external
External data sources integration
formats
Format support (Parquet, Arrow, HDF5)
generators
Dataset generators
gpu
GPU acceleration for dataset operations
gpu_optimization
Advanced GPU Optimization Engine
lazy_loading
Lazy loading and memory-mapped datasets
loaders
Data loading utilities
ml_integration
Machine learning pipeline integration
neuromorphic_data_processor
Neuromorphic Data Processing Engine
parallel_preprocessing
Parallel data preprocessing
platform_dirs
Pure Rust platform directory detection (replaces dirs crate for COOLJAPAN Pure Rust policy) Pure Rust platform directory detection (replaces dirs crate for COOLJAPAN Pure Rust policy)
quantum_enhanced_generators
Quantum-Enhanced Data Generation Engine
quantum_neuromorphic_fusion
Quantum-Neuromorphic Fusion Engine
real_world
Auto-generated module structure
registry
Dataset registry system for managing dataset metadata and locations
sample
Sample datasets for testing and demonstration
stability
API stability guarantees and compatibility documentation
standard
Standard benchmark datasets (fully embedded, no download required)
streaming
Streaming support for large datasets
time_series
Time series datasets.
toy
Toy datasets for testing and examples
utils
Core utilities for working with datasets

Macros§

api_stability
Macro to easily annotate APIs with stability information