Crate scirs2_datasets

Crate scirs2_datasets 

Source
Expand description

§SciRS2 Datasets - Dataset Loading and Generation

scirs2-datasets provides dataset utilities modeled after scikit-learn’s datasets module, offering toy datasets (Iris, Boston, MNIST), synthetic data generators, cross-validation splitters, and data preprocessing utilities for machine learning workflows.

§🎯 Key Features

  • Toy Datasets: Classic datasets (Iris, Boston Housing, Breast Cancer, Digits)
  • Data Generators: Synthetic data for classification, regression, clustering
  • Cross-Validation: K-fold, stratified, time series CV splitters
  • Preprocessing: Train/test split, normalization, feature scaling
  • Caching: Efficient disk caching for downloaded datasets

§📦 Module Overview

SciRS2 Functionscikit-learn EquivalentDescription
load_irissklearn.datasets.load_irisClassic Iris classification dataset
load_bostonsklearn.datasets.load_bostonBoston housing regression dataset
make_classificationsklearn.datasets.make_classificationSynthetic classification data
make_regressionsklearn.datasets.make_regressionSynthetic regression data
make_blobssklearn.datasets.make_blobsSynthetic clustering data
k_fold_splitsklearn.model_selection.KFoldK-fold cross-validation

§🚀 Quick Start

[dependencies]
scirs2-datasets = "0.1.0-rc.2"
use scirs2_datasets::{load_iris, make_classification};

// Load classic Iris dataset
let iris = load_iris().unwrap();
println!("{} samples, {} features", iris.n_samples(), iris.n_features());

// Generate synthetic classification data
let data = make_classification(100, 5, 3, 2, 4, Some(42)).unwrap();

§🔒 Version: 0.1.0-rc.2 (October 03, 2025)

§Examples

§Loading toy datasets

use scirs2_datasets::{load_iris, load_boston};

// Load the classic Iris dataset
let iris = load_iris().unwrap();
println!("Iris dataset: {} samples, {} features", iris.n_samples(), iris.n_features());

// Load the Boston housing dataset
let boston = load_boston().unwrap();
println!("Boston dataset: {} samples, {} features", boston.n_samples(), boston.n_features());

§Generating synthetic datasets

use scirs2_datasets::{make_classification, make_regression, make_blobs, make_spirals, make_moons};

// Generate a classification dataset
let classification = make_classification(100, 5, 3, 2, 4, Some(42)).unwrap();
println!("Classification dataset: {} samples, {} features, {} classes",
         classification.n_samples(), classification.n_features(), 3);

// Generate a regression dataset
let regression = make_regression(50, 4, 3, 0.1, Some(42)).unwrap();
println!("Regression dataset: {} samples, {} features",
         regression.n_samples(), regression.n_features());

// Generate a clustering dataset
let blobs = make_blobs(80, 3, 4, 1.0, Some(42)).unwrap();
println!("Blobs dataset: {} samples, {} features, {} clusters",
         blobs.n_samples(), blobs.n_features(), 4);

// Generate non-linear patterns
let spirals = make_spirals(200, 2, 0.1, Some(42)).unwrap();
let moons = make_moons(150, 0.05, Some(42)).unwrap();

§Cross-validation

use scirs2_datasets::{load_iris, k_fold_split, stratified_k_fold_split};

let iris = load_iris().unwrap();

// K-fold cross-validation
let k_folds = k_fold_split(iris.n_samples(), 5, true, Some(42)).unwrap();
println!("Created {} folds for K-fold CV", k_folds.len());

// Stratified K-fold cross-validation
if let Some(target) = &iris.target {
    let stratified_folds = stratified_k_fold_split(target, 5, true, Some(42)).unwrap();
    println!("Created {} stratified folds", stratified_folds.len());
}

§Dataset manipulation

use scirs2_datasets::{load_iris, Dataset};

let iris = load_iris().unwrap();

// Access dataset properties
println!("Dataset: {} samples, {} features", iris.n_samples(), iris.n_features());
if let Some(featurenames) = iris.featurenames() {
    println!("Features: {:?}", featurenames);
}

Re-exports§

pub use adaptive_streaming_engine::create_adaptive_engine;
pub use adaptive_streaming_engine::create_adaptive_engine_with_config;
pub use adaptive_streaming_engine::AdaptiveStreamConfig;
pub use adaptive_streaming_engine::AdaptiveStreamingEngine;
pub use adaptive_streaming_engine::AlertSeverity;
pub use adaptive_streaming_engine::AlertType;
pub use adaptive_streaming_engine::ChunkMetadata;
pub use adaptive_streaming_engine::DataCharacteristics;
pub use adaptive_streaming_engine::MemoryStrategy;
pub use adaptive_streaming_engine::PatternType;
pub use adaptive_streaming_engine::PerformanceMetrics;
pub use adaptive_streaming_engine::QualityAlert;
pub use adaptive_streaming_engine::QualityMetrics;
pub use adaptive_streaming_engine::StatisticalMoments;
pub use adaptive_streaming_engine::StreamChunk;
pub use adaptive_streaming_engine::TrendDirection;
pub use adaptive_streaming_engine::TrendIndicators;
pub use advanced_generators::make_adversarial_examples;
pub use advanced_generators::make_anomaly_dataset;
pub use advanced_generators::make_continual_learning_dataset;
pub use advanced_generators::make_domain_adaptation_dataset;
pub use advanced_generators::make_few_shot_dataset;
pub use advanced_generators::make_multitask_dataset;
pub use advanced_generators::AdversarialConfig;
pub use advanced_generators::AnomalyConfig;
pub use advanced_generators::AnomalyType;
pub use advanced_generators::AttackMethod;
pub use advanced_generators::ContinualLearningDataset;
pub use advanced_generators::DomainAdaptationConfig;
pub use advanced_generators::DomainAdaptationDataset;
pub use advanced_generators::FewShotDataset;
pub use advanced_generators::MultiTaskConfig;
pub use advanced_generators::MultiTaskDataset;
pub use advanced_generators::TaskType;
pub use benchmarks::BenchmarkResult;
pub use benchmarks::BenchmarkRunner;
pub use benchmarks::BenchmarkSuite;
pub use benchmarks::PerformanceComparison;
pub use cloud::presets::azure_client;
pub use cloud::presets::gcs_client;
pub use cloud::presets::public_s3_client;
pub use cloud::presets::s3_client;
pub use cloud::presets::s3_compatible_client;
pub use cloud::public_datasets::AWSOpenData;
pub use cloud::public_datasets::AzureOpenData;
pub use cloud::public_datasets::GCPPublicData;
pub use cloud::CloudClient;
pub use cloud::CloudConfig;
pub use cloud::CloudCredentials;
pub use cloud::CloudProvider;
pub use distributed::DistributedConfig;
pub use distributed::DistributedProcessor;
pub use distributed::ScalingMethod;
pub use distributed::ScalingParameters;
pub use domain_specific::astronomy::StellarDatasets;
pub use domain_specific::climate::ClimateDatasets;
pub use domain_specific::convenience::list_domain_datasets;
pub use domain_specific::convenience::load_atmospheric_chemistry;
pub use domain_specific::convenience::load_climate_data;
pub use domain_specific::convenience::load_exoplanets;
pub use domain_specific::convenience::load_gene_expression;
pub use domain_specific::convenience::load_stellar_classification;
pub use domain_specific::genomics::GenomicsDatasets;
pub use domain_specific::DomainConfig;
pub use domain_specific::QualityFilters;
pub use explore::convenience::explore;
pub use explore::convenience::export_summary;
pub use explore::convenience::info;
pub use explore::convenience::quick_summary;
pub use explore::DatasetExplorer;
pub use explore::DatasetSummary;
pub use explore::ExploreConfig;
pub use explore::FeatureStatistics;
pub use explore::InferredDataType;
pub use explore::OutputFormat;
pub use explore::QualityAssessment;
pub use external::convenience::load_github_dataset_sync;
pub use external::convenience::load_uci_dataset_sync;
pub use external::convenience::list_uci_datasets;
pub use external::convenience::load_from_url_sync;
pub use external::repositories::GitHubRepository;
pub use external::repositories::KaggleRepository;
pub use external::repositories::UCIRepository;
pub use external::ExternalClient;
pub use external::ExternalConfig;
pub use external::ProgressCallback;
pub use ml_integration::convenience::create_experiment;
pub use ml_integration::convenience::cv_split;
pub use ml_integration::convenience::prepare_for_ml;
pub use ml_integration::convenience::train_test_split;
pub use ml_integration::CrossValidationResults;
pub use ml_integration::DataSplit;
pub use ml_integration::MLExperiment;
pub use ml_integration::MLPipeline;
pub use ml_integration::MLPipelineConfig;
pub use ml_integration::ScalingMethod as MLScalingMethod;
pub use cache::get_cachedir;
pub use cache::BatchOperations;
pub use cache::BatchResult;
pub use cache::CacheFileInfo;
pub use cache::CacheManager;
pub use cache::CacheStats;
pub use cache::DatasetCache;
pub use cache::DetailedCacheStats;
pub use generators::add_time_series_noise;
pub use generators::benchmark_gpu_vs_cpu;
pub use generators::get_gpu_info;
pub use generators::gpu_is_available;
pub use generators::inject_missing_data;
pub use generators::inject_outliers;
pub use generators::make_anisotropic_blobs;
pub use generators::make_blobs;
pub use generators::make_blobs_gpu;
pub use generators::make_circles;
pub use generators::make_classification;
pub use generators::make_classification_gpu;
pub use generators::make_corrupted_dataset;
pub use generators::make_helix;
pub use generators::make_hierarchical_clusters;
pub use generators::make_intersecting_manifolds;
pub use generators::make_manifold;
pub use generators::make_moons;
pub use generators::make_regression;
pub use generators::make_regression_gpu;
pub use generators::make_s_curve;
pub use generators::make_severed_sphere;
pub use generators::make_spirals;
pub use generators::make_swiss_roll;
pub use generators::make_swiss_roll_advanced;
pub use generators::make_time_series;
pub use generators::make_torus;
pub use generators::make_twin_peaks;
pub use generators::ManifoldConfig;
pub use generators::ManifoldType;
pub use generators::MissingPattern;
pub use generators::OutlierType;
pub use gpu::get_optimal_gpu_config;
pub use gpu::is_cuda_available;
pub use gpu::is_opencl_available;
pub use gpu::list_gpu_devices;
pub use gpu::make_blobs_auto_gpu;
pub use gpu::make_classification_auto_gpu;
pub use gpu::make_regression_auto_gpu;
pub use gpu::GpuBackend;
pub use gpu::GpuBenchmark;
pub use gpu::GpuBenchmarkResults;
pub use gpu::GpuConfig;
pub use gpu::GpuContext;
pub use gpu::GpuDeviceInfo;
pub use gpu::GpuMemoryConfig;
pub use gpu_optimization::benchmark_advanced_performance;
pub use gpu_optimization::generate_advanced_matrix;
pub use gpu_optimization::AdvancedGpuOptimizer;
pub use gpu_optimization::AdvancedKernelConfig;
pub use gpu_optimization::BenchmarkResult as AdvancedBenchmarkResult;
pub use gpu_optimization::DataLayout;
pub use gpu_optimization::LoadBalancingMethod;
pub use gpu_optimization::MemoryAccessPattern;
pub use gpu_optimization::PerformanceBenchmarkResults;
pub use gpu_optimization::SpecializationLevel;
pub use gpu_optimization::VectorizationStrategy;
pub use loaders::load_csv;
pub use loaders::load_csv_legacy;
pub use loaders::load_csv_parallel;
pub use loaders::load_csv_streaming;
pub use loaders::load_json;
pub use loaders::load_raw;
pub use loaders::save_json;
pub use loaders::CsvConfig;
pub use loaders::DatasetChunkIterator;
pub use loaders::StreamingConfig;
pub use neuromorphic_data_processor::create_neuromorphic_processor;
pub use neuromorphic_data_processor::create_neuromorphic_processor_with_topology;
pub use neuromorphic_data_processor::NetworkTopology;
pub use neuromorphic_data_processor::NeuromorphicProcessor;
pub use neuromorphic_data_processor::NeuromorphicTransform;
pub use neuromorphic_data_processor::SynapticPlasticity;
pub use quantum_enhanced_generators::make_quantum_blobs;
pub use quantum_enhanced_generators::make_quantum_classification;
pub use quantum_enhanced_generators::make_quantum_regression;
pub use quantum_enhanced_generators::QuantumDatasetGenerator;
pub use quantum_neuromorphic_fusion::create_fusion_with_params;
pub use quantum_neuromorphic_fusion::create_quantum_neuromorphic_fusion;
pub use quantum_neuromorphic_fusion::QuantumBioFusionResult;
pub use quantum_neuromorphic_fusion::QuantumInterference;
pub use quantum_neuromorphic_fusion::QuantumNeuromorphicFusion;
pub use real_world::list_real_world_datasets;
pub use real_world::load_adult;
pub use real_world::load_california_housing;
pub use real_world::load_heart_disease;
pub use real_world::load_red_wine_quality;
pub use real_world::load_titanic;
pub use real_world::RealWorldConfig;
pub use real_world::RealWorldDatasets;
pub use registry::get_registry;
pub use registry::load_dataset_byname;
pub use registry::DatasetMetadata;
pub use registry::DatasetRegistry;
pub use streaming::stream_classification;
pub use streaming::stream_csv;
pub use streaming::stream_regression;
pub use streaming::DataChunk;
pub use streaming::StreamConfig;
pub use streaming::StreamProcessor;
pub use streaming::StreamStats;
pub use streaming::StreamTransformer;
pub use streaming::StreamingIterator;
pub use utils::analyze_dataset_advanced;
pub use utils::create_balanced_dataset;
pub use utils::create_binned_features;
pub use utils::generate_synthetic_samples;
pub use utils::importance_sample;
pub use utils::k_fold_split;
pub use utils::min_max_scale;
pub use utils::polynomial_features;
pub use utils::quick_quality_assessment;
pub use utils::random_oversample;
pub use utils::random_sample;
pub use utils::random_undersample;
pub use utils::robust_scale;
pub use utils::statistical_features;
pub use utils::stratified_k_fold_split;
pub use utils::stratified_sample;
pub use utils::time_series_split;
pub use utils::AdvancedDatasetAnalyzer;
pub use utils::AdvancedQualityMetrics;
pub use utils::BalancingStrategy;
pub use utils::BinningStrategy;
pub use utils::CorrelationInsights;
pub use utils::CrossValidationFolds;
pub use utils::Dataset;
pub use utils::NormalityAssessment;
pub use sample::*;
pub use toy::*;

Modules§

adaptive_streaming_engine
Adaptive Streaming Data Processing Engine
advanced_generators
Advanced synthetic data generators
benchmarks
Performance benchmarking utilities
cache
Dataset caching functionality
cloud
Cloud storage integration for datasets
distributed
Distributed dataset processing capabilities
domain_specific
Domain-specific datasets for scientific research
error
Error types for the datasets module
explore
Interactive dataset exploration and analysis tools
external
External data sources integration
generators
Dataset generators
gpu
GPU acceleration for dataset operations
gpu_optimization
Advanced GPU Optimization Engine
loaders
Data loading utilities
ml_integration
Machine learning pipeline integration
neuromorphic_data_processor
Neuromorphic Data Processing Engine
quantum_enhanced_generators
Quantum-Enhanced Data Generation Engine
quantum_neuromorphic_fusion
Quantum-Neuromorphic Fusion Engine
real_world
Real-world dataset collection
registry
Dataset registry system for managing dataset metadata and locations
sample
Sample datasets for testing and demonstration
stability
API stability guarantees and compatibility documentation
streaming
Streaming support for large datasets
time_series
Time series datasets.
toy
Toy datasets for testing and examples
utils
Core utilities for working with datasets

Macros§

api_stability
Macro to easily annotate APIs with stability information