Expand description
§SciRS2 Datasets - Dataset Loading and Generation
scirs2-datasets provides dataset utilities modeled after scikit-learn’s datasets
module,
offering toy datasets (Iris, Boston, MNIST), synthetic data generators, cross-validation splitters,
and data preprocessing utilities for machine learning workflows.
§🎯 Key Features
- Toy Datasets: Classic datasets (Iris, Boston Housing, Breast Cancer, Digits)
- Data Generators: Synthetic data for classification, regression, clustering
- Cross-Validation: K-fold, stratified, time series CV splitters
- Preprocessing: Train/test split, normalization, feature scaling
- Caching: Efficient disk caching for downloaded datasets
§📦 Module Overview
SciRS2 Function | scikit-learn Equivalent | Description |
---|---|---|
load_iris | sklearn.datasets.load_iris | Classic Iris classification dataset |
load_boston | sklearn.datasets.load_boston | Boston housing regression dataset |
make_classification | sklearn.datasets.make_classification | Synthetic classification data |
make_regression | sklearn.datasets.make_regression | Synthetic regression data |
make_blobs | sklearn.datasets.make_blobs | Synthetic clustering data |
k_fold_split | sklearn.model_selection.KFold | K-fold cross-validation |
§🚀 Quick Start
[dependencies]
scirs2-datasets = "0.1.0-rc.2"
use scirs2_datasets::{load_iris, make_classification};
// Load classic Iris dataset
let iris = load_iris().unwrap();
println!("{} samples, {} features", iris.n_samples(), iris.n_features());
// Generate synthetic classification data
let data = make_classification(100, 5, 3, 2, 4, Some(42)).unwrap();
§🔒 Version: 0.1.0-rc.2 (October 03, 2025)
§Examples
§Loading toy datasets
use scirs2_datasets::{load_iris, load_boston};
// Load the classic Iris dataset
let iris = load_iris().unwrap();
println!("Iris dataset: {} samples, {} features", iris.n_samples(), iris.n_features());
// Load the Boston housing dataset
let boston = load_boston().unwrap();
println!("Boston dataset: {} samples, {} features", boston.n_samples(), boston.n_features());
§Generating synthetic datasets
use scirs2_datasets::{make_classification, make_regression, make_blobs, make_spirals, make_moons};
// Generate a classification dataset
let classification = make_classification(100, 5, 3, 2, 4, Some(42)).unwrap();
println!("Classification dataset: {} samples, {} features, {} classes",
classification.n_samples(), classification.n_features(), 3);
// Generate a regression dataset
let regression = make_regression(50, 4, 3, 0.1, Some(42)).unwrap();
println!("Regression dataset: {} samples, {} features",
regression.n_samples(), regression.n_features());
// Generate a clustering dataset
let blobs = make_blobs(80, 3, 4, 1.0, Some(42)).unwrap();
println!("Blobs dataset: {} samples, {} features, {} clusters",
blobs.n_samples(), blobs.n_features(), 4);
// Generate non-linear patterns
let spirals = make_spirals(200, 2, 0.1, Some(42)).unwrap();
let moons = make_moons(150, 0.05, Some(42)).unwrap();
§Cross-validation
use scirs2_datasets::{load_iris, k_fold_split, stratified_k_fold_split};
let iris = load_iris().unwrap();
// K-fold cross-validation
let k_folds = k_fold_split(iris.n_samples(), 5, true, Some(42)).unwrap();
println!("Created {} folds for K-fold CV", k_folds.len());
// Stratified K-fold cross-validation
if let Some(target) = &iris.target {
let stratified_folds = stratified_k_fold_split(target, 5, true, Some(42)).unwrap();
println!("Created {} stratified folds", stratified_folds.len());
}
§Dataset manipulation
use scirs2_datasets::{load_iris, Dataset};
let iris = load_iris().unwrap();
// Access dataset properties
println!("Dataset: {} samples, {} features", iris.n_samples(), iris.n_features());
if let Some(featurenames) = iris.featurenames() {
println!("Features: {:?}", featurenames);
}
Re-exports§
pub use adaptive_streaming_engine::create_adaptive_engine;
pub use adaptive_streaming_engine::create_adaptive_engine_with_config;
pub use adaptive_streaming_engine::AdaptiveStreamConfig;
pub use adaptive_streaming_engine::AdaptiveStreamingEngine;
pub use adaptive_streaming_engine::AlertSeverity;
pub use adaptive_streaming_engine::AlertType;
pub use adaptive_streaming_engine::ChunkMetadata;
pub use adaptive_streaming_engine::DataCharacteristics;
pub use adaptive_streaming_engine::MemoryStrategy;
pub use adaptive_streaming_engine::PatternType;
pub use adaptive_streaming_engine::PerformanceMetrics;
pub use adaptive_streaming_engine::QualityAlert;
pub use adaptive_streaming_engine::QualityMetrics;
pub use adaptive_streaming_engine::StatisticalMoments;
pub use adaptive_streaming_engine::StreamChunk;
pub use adaptive_streaming_engine::TrendDirection;
pub use adaptive_streaming_engine::TrendIndicators;
pub use advanced_generators::make_adversarial_examples;
pub use advanced_generators::make_anomaly_dataset;
pub use advanced_generators::make_continual_learning_dataset;
pub use advanced_generators::make_domain_adaptation_dataset;
pub use advanced_generators::make_few_shot_dataset;
pub use advanced_generators::make_multitask_dataset;
pub use advanced_generators::AdversarialConfig;
pub use advanced_generators::AnomalyConfig;
pub use advanced_generators::AnomalyType;
pub use advanced_generators::AttackMethod;
pub use advanced_generators::ContinualLearningDataset;
pub use advanced_generators::DomainAdaptationConfig;
pub use advanced_generators::DomainAdaptationDataset;
pub use advanced_generators::FewShotDataset;
pub use advanced_generators::MultiTaskConfig;
pub use advanced_generators::MultiTaskDataset;
pub use advanced_generators::TaskType;
pub use benchmarks::BenchmarkResult;
pub use benchmarks::BenchmarkRunner;
pub use benchmarks::BenchmarkSuite;
pub use benchmarks::PerformanceComparison;
pub use cloud::presets::azure_client;
pub use cloud::presets::gcs_client;
pub use cloud::presets::public_s3_client;
pub use cloud::presets::s3_client;
pub use cloud::presets::s3_compatible_client;
pub use cloud::public_datasets::AWSOpenData;
pub use cloud::public_datasets::AzureOpenData;
pub use cloud::public_datasets::GCPPublicData;
pub use cloud::CloudClient;
pub use cloud::CloudConfig;
pub use cloud::CloudCredentials;
pub use cloud::CloudProvider;
pub use distributed::DistributedConfig;
pub use distributed::DistributedProcessor;
pub use distributed::ScalingMethod;
pub use distributed::ScalingParameters;
pub use domain_specific::astronomy::StellarDatasets;
pub use domain_specific::climate::ClimateDatasets;
pub use domain_specific::convenience::list_domain_datasets;
pub use domain_specific::convenience::load_atmospheric_chemistry;
pub use domain_specific::convenience::load_climate_data;
pub use domain_specific::convenience::load_exoplanets;
pub use domain_specific::convenience::load_gene_expression;
pub use domain_specific::convenience::load_stellar_classification;
pub use domain_specific::genomics::GenomicsDatasets;
pub use domain_specific::DomainConfig;
pub use domain_specific::QualityFilters;
pub use explore::convenience::explore;
pub use explore::convenience::export_summary;
pub use explore::convenience::info;
pub use explore::convenience::quick_summary;
pub use explore::DatasetExplorer;
pub use explore::DatasetSummary;
pub use explore::ExploreConfig;
pub use explore::FeatureStatistics;
pub use explore::InferredDataType;
pub use explore::OutputFormat;
pub use explore::QualityAssessment;
pub use external::convenience::load_github_dataset_sync;
pub use external::convenience::load_uci_dataset_sync;
pub use external::convenience::list_uci_datasets;
pub use external::convenience::load_from_url_sync;
pub use external::repositories::GitHubRepository;
pub use external::repositories::KaggleRepository;
pub use external::repositories::UCIRepository;
pub use external::ExternalClient;
pub use external::ExternalConfig;
pub use external::ProgressCallback;
pub use ml_integration::convenience::create_experiment;
pub use ml_integration::convenience::cv_split;
pub use ml_integration::convenience::prepare_for_ml;
pub use ml_integration::convenience::train_test_split;
pub use ml_integration::CrossValidationResults;
pub use ml_integration::DataSplit;
pub use ml_integration::MLExperiment;
pub use ml_integration::MLPipeline;
pub use ml_integration::MLPipelineConfig;
pub use ml_integration::ScalingMethod as MLScalingMethod;
pub use cache::get_cachedir;
pub use cache::BatchOperations;
pub use cache::BatchResult;
pub use cache::CacheFileInfo;
pub use cache::CacheManager;
pub use cache::CacheStats;
pub use cache::DatasetCache;
pub use cache::DetailedCacheStats;
pub use generators::add_time_series_noise;
pub use generators::benchmark_gpu_vs_cpu;
pub use generators::get_gpu_info;
pub use generators::gpu_is_available;
pub use generators::inject_missing_data;
pub use generators::inject_outliers;
pub use generators::make_anisotropic_blobs;
pub use generators::make_blobs;
pub use generators::make_blobs_gpu;
pub use generators::make_circles;
pub use generators::make_classification;
pub use generators::make_classification_gpu;
pub use generators::make_corrupted_dataset;
pub use generators::make_helix;
pub use generators::make_hierarchical_clusters;
pub use generators::make_intersecting_manifolds;
pub use generators::make_manifold;
pub use generators::make_moons;
pub use generators::make_regression;
pub use generators::make_regression_gpu;
pub use generators::make_s_curve;
pub use generators::make_severed_sphere;
pub use generators::make_spirals;
pub use generators::make_swiss_roll;
pub use generators::make_swiss_roll_advanced;
pub use generators::make_time_series;
pub use generators::make_torus;
pub use generators::make_twin_peaks;
pub use generators::ManifoldConfig;
pub use generators::ManifoldType;
pub use generators::MissingPattern;
pub use generators::OutlierType;
pub use gpu::get_optimal_gpu_config;
pub use gpu::is_cuda_available;
pub use gpu::is_opencl_available;
pub use gpu::list_gpu_devices;
pub use gpu::make_blobs_auto_gpu;
pub use gpu::make_classification_auto_gpu;
pub use gpu::make_regression_auto_gpu;
pub use gpu::GpuBackend;
pub use gpu::GpuBenchmark;
pub use gpu::GpuBenchmarkResults;
pub use gpu::GpuConfig;
pub use gpu::GpuContext;
pub use gpu::GpuDeviceInfo;
pub use gpu::GpuMemoryConfig;
pub use gpu_optimization::benchmark_advanced_performance;
pub use gpu_optimization::generate_advanced_matrix;
pub use gpu_optimization::AdvancedGpuOptimizer;
pub use gpu_optimization::AdvancedKernelConfig;
pub use gpu_optimization::BenchmarkResult as AdvancedBenchmarkResult;
pub use gpu_optimization::DataLayout;
pub use gpu_optimization::LoadBalancingMethod;
pub use gpu_optimization::MemoryAccessPattern;
pub use gpu_optimization::PerformanceBenchmarkResults;
pub use gpu_optimization::SpecializationLevel;
pub use gpu_optimization::VectorizationStrategy;
pub use loaders::load_csv;
pub use loaders::load_csv_legacy;
pub use loaders::load_csv_parallel;
pub use loaders::load_csv_streaming;
pub use loaders::load_json;
pub use loaders::load_raw;
pub use loaders::save_json;
pub use loaders::CsvConfig;
pub use loaders::DatasetChunkIterator;
pub use loaders::StreamingConfig;
pub use neuromorphic_data_processor::create_neuromorphic_processor;
pub use neuromorphic_data_processor::create_neuromorphic_processor_with_topology;
pub use neuromorphic_data_processor::NetworkTopology;
pub use neuromorphic_data_processor::NeuromorphicProcessor;
pub use neuromorphic_data_processor::NeuromorphicTransform;
pub use neuromorphic_data_processor::SynapticPlasticity;
pub use quantum_enhanced_generators::make_quantum_blobs;
pub use quantum_enhanced_generators::make_quantum_classification;
pub use quantum_enhanced_generators::make_quantum_regression;
pub use quantum_enhanced_generators::QuantumDatasetGenerator;
pub use quantum_neuromorphic_fusion::create_fusion_with_params;
pub use quantum_neuromorphic_fusion::create_quantum_neuromorphic_fusion;
pub use quantum_neuromorphic_fusion::QuantumBioFusionResult;
pub use quantum_neuromorphic_fusion::QuantumInterference;
pub use quantum_neuromorphic_fusion::QuantumNeuromorphicFusion;
pub use real_world::list_real_world_datasets;
pub use real_world::load_adult;
pub use real_world::load_california_housing;
pub use real_world::load_heart_disease;
pub use real_world::load_red_wine_quality;
pub use real_world::load_titanic;
pub use real_world::RealWorldConfig;
pub use real_world::RealWorldDatasets;
pub use registry::get_registry;
pub use registry::load_dataset_byname;
pub use registry::DatasetMetadata;
pub use registry::DatasetRegistry;
pub use streaming::stream_classification;
pub use streaming::stream_csv;
pub use streaming::stream_regression;
pub use streaming::DataChunk;
pub use streaming::StreamConfig;
pub use streaming::StreamProcessor;
pub use streaming::StreamStats;
pub use streaming::StreamTransformer;
pub use streaming::StreamingIterator;
pub use utils::analyze_dataset_advanced;
pub use utils::create_balanced_dataset;
pub use utils::create_binned_features;
pub use utils::generate_synthetic_samples;
pub use utils::importance_sample;
pub use utils::k_fold_split;
pub use utils::min_max_scale;
pub use utils::polynomial_features;
pub use utils::quick_quality_assessment;
pub use utils::random_oversample;
pub use utils::random_sample;
pub use utils::random_undersample;
pub use utils::robust_scale;
pub use utils::statistical_features;
pub use utils::stratified_k_fold_split;
pub use utils::stratified_sample;
pub use utils::time_series_split;
pub use utils::AdvancedDatasetAnalyzer;
pub use utils::AdvancedQualityMetrics;
pub use utils::BalancingStrategy;
pub use utils::BinningStrategy;
pub use utils::CorrelationInsights;
pub use utils::CrossValidationFolds;
pub use utils::Dataset;
pub use utils::NormalityAssessment;
pub use sample::*;
pub use toy::*;
Modules§
- adaptive_
streaming_ engine - Adaptive Streaming Data Processing Engine
- advanced_
generators - Advanced synthetic data generators
- benchmarks
- Performance benchmarking utilities
- cache
- Dataset caching functionality
- cloud
- Cloud storage integration for datasets
- distributed
- Distributed dataset processing capabilities
- domain_
specific - Domain-specific datasets for scientific research
- error
- Error types for the datasets module
- explore
- Interactive dataset exploration and analysis tools
- external
- External data sources integration
- generators
- Dataset generators
- gpu
- GPU acceleration for dataset operations
- gpu_
optimization - Advanced GPU Optimization Engine
- loaders
- Data loading utilities
- ml_
integration - Machine learning pipeline integration
- neuromorphic_
data_ processor - Neuromorphic Data Processing Engine
- quantum_
enhanced_ generators - Quantum-Enhanced Data Generation Engine
- quantum_
neuromorphic_ fusion - Quantum-Neuromorphic Fusion Engine
- real_
world - Real-world dataset collection
- registry
- Dataset registry system for managing dataset metadata and locations
- sample
- Sample datasets for testing and demonstration
- stability
- API stability guarantees and compatibility documentation
- streaming
- Streaming support for large datasets
- time_
series - Time series datasets.
- toy
- Toy datasets for testing and examples
- utils
- Core utilities for working with datasets
Macros§
- api_
stability - Macro to easily annotate APIs with stability information