Expand description
§SciRS2 Datasets - Dataset Loading and Generation
scirs2-datasets provides dataset utilities modeled after scikit-learn’s datasets module,
offering toy datasets (Iris, Boston, MNIST), synthetic data generators, cross-validation splitters,
and data preprocessing utilities for machine learning workflows.
§🎯 Key Features
- Toy Datasets: Classic datasets (Iris, Boston Housing, Breast Cancer, Digits)
- Data Generators: Synthetic data for classification, regression, clustering
- Cross-Validation: K-fold, stratified, time series CV splitters
- Preprocessing: Train/test split, normalization, feature scaling
- Caching: Efficient disk caching for downloaded datasets
§📦 Module Overview
| SciRS2 Function | scikit-learn Equivalent | Description |
|---|---|---|
load_iris | sklearn.datasets.load_iris | Classic Iris classification dataset |
load_boston | sklearn.datasets.load_boston | Boston housing regression dataset |
make_classification | sklearn.datasets.make_classification | Synthetic classification data |
make_regression | sklearn.datasets.make_regression | Synthetic regression data |
make_blobs | sklearn.datasets.make_blobs | Synthetic clustering data |
k_fold_split | sklearn.model_selection.KFold | K-fold cross-validation |
§🚀 Quick Start
[dependencies]
scirs2-datasets = "0.1.0-rc.2"use scirs2_datasets::{load_iris, make_classification};
// Load classic Iris dataset
let iris = load_iris().unwrap();
println!("{} samples, {} features", iris.n_samples(), iris.n_features());
// Generate synthetic classification data
let data = make_classification(100, 5, 3, 2, 4, Some(42)).unwrap();§🔒 Version: 0.1.0-rc.2 (October 03, 2025)
§Examples
§Loading toy datasets
use scirs2_datasets::{load_iris, load_boston};
// Load the classic Iris dataset
let iris = load_iris().unwrap();
println!("Iris dataset: {} samples, {} features", iris.n_samples(), iris.n_features());
// Load the Boston housing dataset
let boston = load_boston().unwrap();
println!("Boston dataset: {} samples, {} features", boston.n_samples(), boston.n_features());§Generating synthetic datasets
use scirs2_datasets::{make_classification, make_regression, make_blobs, make_spirals, make_moons};
// Generate a classification dataset
let classification = make_classification(100, 5, 3, 2, 4, Some(42)).unwrap();
println!("Classification dataset: {} samples, {} features, {} classes",
classification.n_samples(), classification.n_features(), 3);
// Generate a regression dataset
let regression = make_regression(50, 4, 3, 0.1, Some(42)).unwrap();
println!("Regression dataset: {} samples, {} features",
regression.n_samples(), regression.n_features());
// Generate a clustering dataset
let blobs = make_blobs(80, 3, 4, 1.0, Some(42)).unwrap();
println!("Blobs dataset: {} samples, {} features, {} clusters",
blobs.n_samples(), blobs.n_features(), 4);
// Generate non-linear patterns
let spirals = make_spirals(200, 2, 0.1, Some(42)).unwrap();
let moons = make_moons(150, 0.05, Some(42)).unwrap();§Cross-validation
use scirs2_datasets::{load_iris, k_fold_split, stratified_k_fold_split};
let iris = load_iris().unwrap();
// K-fold cross-validation
let k_folds = k_fold_split(iris.n_samples(), 5, true, Some(42)).unwrap();
println!("Created {} folds for K-fold CV", k_folds.len());
// Stratified K-fold cross-validation
if let Some(target) = &iris.target {
let stratified_folds = stratified_k_fold_split(target, 5, true, Some(42)).unwrap();
println!("Created {} stratified folds", stratified_folds.len());
}§Dataset manipulation
use scirs2_datasets::{load_iris, Dataset};
let iris = load_iris().unwrap();
// Access dataset properties
println!("Dataset: {} samples, {} features", iris.n_samples(), iris.n_features());
if let Some(featurenames) = iris.featurenames() {
println!("Features: {:?}", featurenames);
}Re-exports§
pub use adaptive_streaming_engine::create_adaptive_engine;pub use adaptive_streaming_engine::create_adaptive_engine_with_config;pub use adaptive_streaming_engine::AdaptiveStreamConfig;pub use adaptive_streaming_engine::AdaptiveStreamingEngine;pub use adaptive_streaming_engine::AlertSeverity;pub use adaptive_streaming_engine::AlertType;pub use adaptive_streaming_engine::ChunkMetadata;pub use adaptive_streaming_engine::DataCharacteristics;pub use adaptive_streaming_engine::MemoryStrategy;pub use adaptive_streaming_engine::PatternType;pub use adaptive_streaming_engine::PerformanceMetrics;pub use adaptive_streaming_engine::QualityAlert;pub use adaptive_streaming_engine::QualityMetrics;pub use adaptive_streaming_engine::StatisticalMoments;pub use adaptive_streaming_engine::StreamChunk;pub use adaptive_streaming_engine::TrendDirection;pub use adaptive_streaming_engine::TrendIndicators;pub use advanced_generators::make_adversarial_examples;pub use advanced_generators::make_anomaly_dataset;pub use advanced_generators::make_continual_learning_dataset;pub use advanced_generators::make_domain_adaptation_dataset;pub use advanced_generators::make_few_shot_dataset;pub use advanced_generators::make_multitask_dataset;pub use advanced_generators::AdversarialConfig;pub use advanced_generators::AnomalyConfig;pub use advanced_generators::AnomalyType;pub use advanced_generators::AttackMethod;pub use advanced_generators::ContinualLearningDataset;pub use advanced_generators::DomainAdaptationConfig;pub use advanced_generators::DomainAdaptationDataset;pub use advanced_generators::FewShotDataset;pub use advanced_generators::MultiTaskConfig;pub use advanced_generators::MultiTaskDataset;pub use advanced_generators::TaskType;pub use benchmarks::BenchmarkResult;pub use benchmarks::BenchmarkRunner;pub use benchmarks::BenchmarkSuite;pub use benchmarks::PerformanceComparison;pub use cloud::presets::azure_client;pub use cloud::presets::gcs_client;pub use cloud::presets::public_s3_client;pub use cloud::presets::s3_client;pub use cloud::presets::s3_compatible_client;pub use cloud::public_datasets::AWSOpenData;pub use cloud::public_datasets::AzureOpenData;pub use cloud::public_datasets::GCPPublicData;pub use cloud::CloudClient;pub use cloud::CloudConfig;pub use cloud::CloudCredentials;pub use cloud::CloudProvider;pub use distributed::DistributedConfig;pub use distributed::DistributedProcessor;pub use distributed::ScalingMethod;pub use distributed::ScalingParameters;pub use domain_specific::astronomy::StellarDatasets;pub use domain_specific::climate::ClimateDatasets;pub use domain_specific::convenience::list_domain_datasets;pub use domain_specific::convenience::load_atmospheric_chemistry;pub use domain_specific::convenience::load_climate_data;pub use domain_specific::convenience::load_exoplanets;pub use domain_specific::convenience::load_gene_expression;pub use domain_specific::convenience::load_stellar_classification;pub use domain_specific::genomics::GenomicsDatasets;pub use domain_specific::DomainConfig;pub use domain_specific::QualityFilters;pub use explore::convenience::explore;pub use explore::convenience::export_summary;pub use explore::convenience::info;pub use explore::convenience::quick_summary;pub use explore::DatasetExplorer;pub use explore::DatasetSummary;pub use explore::ExploreConfig;pub use explore::FeatureStatistics;pub use explore::InferredDataType;pub use explore::OutputFormat;pub use explore::QualityAssessment;pub use external::convenience::load_github_dataset_sync;pub use external::convenience::load_uci_dataset_sync;pub use external::convenience::list_uci_datasets;pub use external::convenience::load_from_url_sync;pub use external::repositories::GitHubRepository;pub use external::repositories::KaggleRepository;pub use external::repositories::UCIRepository;pub use external::ExternalClient;pub use external::ExternalConfig;pub use external::ProgressCallback;pub use ml_integration::convenience::create_experiment;pub use ml_integration::convenience::cv_split;pub use ml_integration::convenience::prepare_for_ml;pub use ml_integration::convenience::train_test_split;pub use ml_integration::CrossValidationResults;pub use ml_integration::DataSplit;pub use ml_integration::MLExperiment;pub use ml_integration::MLPipeline;pub use ml_integration::MLPipelineConfig;pub use ml_integration::ScalingMethod as MLScalingMethod;pub use cache::get_cachedir;pub use cache::BatchOperations;pub use cache::BatchResult;pub use cache::CacheFileInfo;pub use cache::CacheManager;pub use cache::CacheStats;pub use cache::DatasetCache;pub use cache::DetailedCacheStats;pub use generators::add_time_series_noise;pub use generators::benchmark_gpu_vs_cpu;pub use generators::get_gpu_info;pub use generators::gpu_is_available;pub use generators::inject_missing_data;pub use generators::inject_outliers;pub use generators::make_anisotropic_blobs;pub use generators::make_blobs;pub use generators::make_blobs_gpu;pub use generators::make_circles;pub use generators::make_classification;pub use generators::make_classification_gpu;pub use generators::make_corrupted_dataset;pub use generators::make_helix;pub use generators::make_hierarchical_clusters;pub use generators::make_intersecting_manifolds;pub use generators::make_manifold;pub use generators::make_moons;pub use generators::make_regression;pub use generators::make_regression_gpu;pub use generators::make_s_curve;pub use generators::make_severed_sphere;pub use generators::make_spirals;pub use generators::make_swiss_roll;pub use generators::make_swiss_roll_advanced;pub use generators::make_time_series;pub use generators::make_torus;pub use generators::make_twin_peaks;pub use generators::ManifoldConfig;pub use generators::ManifoldType;pub use generators::MissingPattern;pub use generators::OutlierType;pub use gpu::get_optimal_gpu_config;pub use gpu::is_cuda_available;pub use gpu::is_opencl_available;pub use gpu::list_gpu_devices;pub use gpu::make_blobs_auto_gpu;pub use gpu::make_classification_auto_gpu;pub use gpu::make_regression_auto_gpu;pub use gpu::GpuBackend;pub use gpu::GpuBenchmark;pub use gpu::GpuBenchmarkResults;pub use gpu::GpuConfig;pub use gpu::GpuContext;pub use gpu::GpuDeviceInfo;pub use gpu::GpuMemoryConfig;pub use gpu_optimization::benchmark_advanced_performance;pub use gpu_optimization::generate_advanced_matrix;pub use gpu_optimization::AdvancedGpuOptimizer;pub use gpu_optimization::AdvancedKernelConfig;pub use gpu_optimization::BenchmarkResult as AdvancedBenchmarkResult;pub use gpu_optimization::DataLayout;pub use gpu_optimization::LoadBalancingMethod;pub use gpu_optimization::MemoryAccessPattern;pub use gpu_optimization::PerformanceBenchmarkResults;pub use gpu_optimization::SpecializationLevel;pub use gpu_optimization::VectorizationStrategy;pub use loaders::load_csv;pub use loaders::load_csv_legacy;pub use loaders::load_csv_parallel;pub use loaders::load_csv_streaming;pub use loaders::load_json;pub use loaders::load_raw;pub use loaders::save_json;pub use loaders::CsvConfig;pub use loaders::DatasetChunkIterator;pub use loaders::StreamingConfig;pub use neuromorphic_data_processor::create_neuromorphic_processor;pub use neuromorphic_data_processor::create_neuromorphic_processor_with_topology;pub use neuromorphic_data_processor::NetworkTopology;pub use neuromorphic_data_processor::NeuromorphicProcessor;pub use neuromorphic_data_processor::NeuromorphicTransform;pub use neuromorphic_data_processor::SynapticPlasticity;pub use quantum_enhanced_generators::make_quantum_blobs;pub use quantum_enhanced_generators::make_quantum_classification;pub use quantum_enhanced_generators::make_quantum_regression;pub use quantum_enhanced_generators::QuantumDatasetGenerator;pub use quantum_neuromorphic_fusion::create_fusion_with_params;pub use quantum_neuromorphic_fusion::create_quantum_neuromorphic_fusion;pub use quantum_neuromorphic_fusion::QuantumBioFusionResult;pub use quantum_neuromorphic_fusion::QuantumInterference;pub use quantum_neuromorphic_fusion::QuantumNeuromorphicFusion;pub use real_world::list_real_world_datasets;pub use real_world::load_adult;pub use real_world::load_california_housing;pub use real_world::load_heart_disease;pub use real_world::load_red_wine_quality;pub use real_world::load_titanic;pub use real_world::RealWorldConfig;pub use real_world::RealWorldDatasets;pub use registry::get_registry;pub use registry::load_dataset_byname;pub use registry::DatasetMetadata;pub use registry::DatasetRegistry;pub use streaming::stream_classification;pub use streaming::stream_csv;pub use streaming::stream_regression;pub use streaming::DataChunk;pub use streaming::StreamConfig;pub use streaming::StreamProcessor;pub use streaming::StreamStats;pub use streaming::StreamTransformer;pub use streaming::StreamingIterator;pub use utils::analyze_dataset_advanced;pub use utils::create_balanced_dataset;pub use utils::create_binned_features;pub use utils::generate_synthetic_samples;pub use utils::importance_sample;pub use utils::k_fold_split;pub use utils::min_max_scale;pub use utils::polynomial_features;pub use utils::quick_quality_assessment;pub use utils::random_oversample;pub use utils::random_sample;pub use utils::random_undersample;pub use utils::robust_scale;pub use utils::statistical_features;pub use utils::stratified_k_fold_split;pub use utils::stratified_sample;pub use utils::time_series_split;pub use utils::AdvancedDatasetAnalyzer;pub use utils::AdvancedQualityMetrics;pub use utils::BalancingStrategy;pub use utils::BinningStrategy;pub use utils::CorrelationInsights;pub use utils::CrossValidationFolds;pub use utils::Dataset;pub use utils::NormalityAssessment;pub use sample::*;pub use toy::*;
Modules§
- adaptive_
streaming_ engine - Adaptive Streaming Data Processing Engine
- advanced_
generators - Advanced synthetic data generators
- benchmarks
- Performance benchmarking utilities
- cache
- Dataset caching functionality
- cloud
- Cloud storage integration for datasets
- distributed
- Distributed dataset processing capabilities
- domain_
specific - Domain-specific datasets for scientific research
- error
- Error types for the datasets module
- explore
- Interactive dataset exploration and analysis tools
- external
- External data sources integration
- generators
- Dataset generators
- gpu
- GPU acceleration for dataset operations
- gpu_
optimization - Advanced GPU Optimization Engine
- loaders
- Data loading utilities
- ml_
integration - Machine learning pipeline integration
- neuromorphic_
data_ processor - Neuromorphic Data Processing Engine
- quantum_
enhanced_ generators - Quantum-Enhanced Data Generation Engine
- quantum_
neuromorphic_ fusion - Quantum-Neuromorphic Fusion Engine
- real_
world - Real-world dataset collection
- registry
- Dataset registry system for managing dataset metadata and locations
- sample
- Sample datasets for testing and demonstration
- stability
- API stability guarantees and compatibility documentation
- streaming
- Streaming support for large datasets
- time_
series - Time series datasets.
- toy
- Toy datasets for testing and examples
- utils
- Core utilities for working with datasets
Macros§
- api_
stability - Macro to easily annotate APIs with stability information