Crate scirs2_datasets

Crate scirs2_datasets 

Source
Expand description

Datasets module for SciRS2

This module provides dataset loading utilities similar to scikit-learn’s datasets module. It includes toy datasets, sample datasets, time series datasets, data generators, and utilities for loading and processing datasets.

§Features

  • Toy datasets: Classic datasets like Iris, Boston Housing, Breast Cancer, and Digits
  • Data generators: Create synthetic datasets for classification, regression, clustering, and time series
  • Cross-validation utilities: K-fold, stratified, and time series cross-validation
  • Dataset utilities: Train/test splitting, normalization, and metadata handling
  • Caching: Efficient caching system for downloaded datasets
  • Registry: Centralized registry for dataset metadata and locations

§Examples

§Loading toy datasets

use scirs2__datasets::{load_iris, load_boston};

// Load the classic Iris dataset
let iris = load_iris().unwrap();
println!("Iris dataset: {} samples, {} features", iris.n_samples(), iris.n_features());

// Load the Boston housing dataset
let boston = load_boston().unwrap();
println!("Boston dataset: {} samples, {} features", boston.n_samples(), boston.n_features());

§Generating synthetic datasets

use scirs2__datasets::{make_classification, make_regression, make_blobs, make_spirals, make_moons};

// Generate a classification dataset
let classification = make_classification(100, 5, 3, 2, 4, Some(42)).unwrap();
println!("Classification dataset: {} samples, {} features, {} classes",
         classification.n_samples(), classification.n_features(), 3);

// Generate a regression dataset
let regression = make_regression(50, 4, 3, 0.1, Some(42)).unwrap();
println!("Regression dataset: {} samples, {} features",
         regression.n_samples(), regression.n_features());

// Generate a clustering dataset
let blobs = make_blobs(80, 3, 4, 1.0, Some(42)).unwrap();
println!("Blobs dataset: {} samples, {} features, {} clusters",
         blobs.n_samples(), blobs.n_features(), 4);

// Generate non-linear patterns
let spirals = make_spirals(200, 2, 0.1, Some(42)).unwrap();
let moons = make_moons(150, 0.05, Some(42)).unwrap();

§Cross-validation

use scirs2__datasets::{load_iris, k_fold_split, stratified_k_fold_split};

let iris = load_iris().unwrap();

// K-fold cross-validation
let k_folds = k_fold_split(iris.n_samples(), 5, true, Some(42)).unwrap();
println!("Created {} folds for K-fold CV", k_folds.len());

// Stratified K-fold cross-validation
if let Some(target) = &iris.target {
    let stratified_folds = stratified_k_fold_split(target, 5, true, Some(42)).unwrap();
    println!("Created {} stratified folds", stratified_folds.len());
}

§Dataset manipulation

use scirs2__datasets::{load_iris, Dataset};

let iris = load_iris().unwrap();

// Access dataset properties
println!("Dataset: {} samples, {} features", iris.n_samples(), iris.n_features());
if let Some(featurenames) = iris.featurenames() {
    println!("Features: {:?}", featurenames);
}

Re-exports§

pub use adaptive_streaming_engine::create_adaptive_engine;
pub use adaptive_streaming_engine::create_adaptive_engine_with_config;
pub use adaptive_streaming_engine::AdaptiveStreamConfig;
pub use adaptive_streaming_engine::AdaptiveStreamingEngine;
pub use adaptive_streaming_engine::AlertSeverity;
pub use adaptive_streaming_engine::AlertType;
pub use adaptive_streaming_engine::ChunkMetadata;
pub use adaptive_streaming_engine::DataCharacteristics;
pub use adaptive_streaming_engine::MemoryStrategy;
pub use adaptive_streaming_engine::PatternType;
pub use adaptive_streaming_engine::PerformanceMetrics;
pub use adaptive_streaming_engine::QualityAlert;
pub use adaptive_streaming_engine::QualityMetrics;
pub use adaptive_streaming_engine::StatisticalMoments;
pub use adaptive_streaming_engine::StreamChunk;
pub use adaptive_streaming_engine::TrendDirection;
pub use adaptive_streaming_engine::TrendIndicators;
pub use advanced_generators::make_adversarial_examples;
pub use advanced_generators::make_anomaly_dataset;
pub use advanced_generators::make_continual_learning_dataset;
pub use advanced_generators::make_domain_adaptation_dataset;
pub use advanced_generators::make_few_shot_dataset;
pub use advanced_generators::make_multitask_dataset;
pub use advanced_generators::AdversarialConfig;
pub use advanced_generators::AnomalyConfig;
pub use advanced_generators::AnomalyType;
pub use advanced_generators::AttackMethod;
pub use advanced_generators::ContinualLearningDataset;
pub use advanced_generators::DomainAdaptationConfig;
pub use advanced_generators::DomainAdaptationDataset;
pub use advanced_generators::FewShotDataset;
pub use advanced_generators::MultiTaskConfig;
pub use advanced_generators::MultiTaskDataset;
pub use advanced_generators::TaskType;
pub use benchmarks::BenchmarkResult;
pub use benchmarks::BenchmarkRunner;
pub use benchmarks::BenchmarkSuite;
pub use benchmarks::PerformanceComparison;
pub use cloud::presets::azure_client;
pub use cloud::presets::gcs_client;
pub use cloud::presets::public_s3_client;
pub use cloud::presets::s3_client;
pub use cloud::presets::s3_compatible_client;
pub use cloud::public_datasets::AWSOpenData;
pub use cloud::public_datasets::AzureOpenData;
pub use cloud::public_datasets::GCPPublicData;
pub use cloud::CloudClient;
pub use cloud::CloudConfig;
pub use cloud::CloudCredentials;
pub use cloud::CloudProvider;
pub use distributed::DistributedConfig;
pub use distributed::DistributedProcessor;
pub use distributed::ScalingMethod;
pub use distributed::ScalingParameters;
pub use domain_specific::astronomy::StellarDatasets;
pub use domain_specific::climate::ClimateDatasets;
pub use domain_specific::convenience::list_domain_datasets;
pub use domain_specific::convenience::load_atmospheric_chemistry;
pub use domain_specific::convenience::load_climate_data;
pub use domain_specific::convenience::load_exoplanets;
pub use domain_specific::convenience::load_gene_expression;
pub use domain_specific::convenience::load_stellar_classification;
pub use domain_specific::genomics::GenomicsDatasets;
pub use domain_specific::DomainConfig;
pub use domain_specific::QualityFilters;
pub use explore::convenience::explore;
pub use explore::convenience::export_summary;
pub use explore::convenience::info;
pub use explore::convenience::quick_summary;
pub use explore::DatasetExplorer;
pub use explore::DatasetSummary;
pub use explore::ExploreConfig;
pub use explore::FeatureStatistics;
pub use explore::InferredDataType;
pub use explore::OutputFormat;
pub use explore::QualityAssessment;
pub use external::convenience::load_github_dataset_sync;
pub use external::convenience::load_uci_dataset_sync;
pub use external::convenience::list_uci_datasets;
pub use external::convenience::load_from_url_sync;
pub use external::repositories::GitHubRepository;
pub use external::repositories::KaggleRepository;
pub use external::repositories::UCIRepository;
pub use external::ExternalClient;
pub use external::ExternalConfig;
pub use external::ProgressCallback;
pub use ml_integration::convenience::create_experiment;
pub use ml_integration::convenience::cv_split;
pub use ml_integration::convenience::prepare_for_ml;
pub use ml_integration::convenience::train_test_split;
pub use ml_integration::CrossValidationResults;
pub use ml_integration::DataSplit;
pub use ml_integration::MLExperiment;
pub use ml_integration::MLPipeline;
pub use ml_integration::MLPipelineConfig;
pub use ml_integration::ScalingMethod as MLScalingMethod;
pub use cache::get_cachedir;
pub use cache::BatchOperations;
pub use cache::BatchResult;
pub use cache::CacheFileInfo;
pub use cache::CacheManager;
pub use cache::CacheStats;
pub use cache::DatasetCache;
pub use cache::DetailedCacheStats;
pub use generators::add_time_series_noise;
pub use generators::benchmark_gpu_vs_cpu;
pub use generators::get_gpu_info;
pub use generators::gpu_is_available;
pub use generators::inject_missing_data;
pub use generators::inject_outliers;
pub use generators::make_anisotropic_blobs;
pub use generators::make_blobs;
pub use generators::make_blobs_gpu;
pub use generators::make_circles;
pub use generators::make_classification;
pub use generators::make_classification_gpu;
pub use generators::make_corrupted_dataset;
pub use generators::make_helix;
pub use generators::make_hierarchical_clusters;
pub use generators::make_intersecting_manifolds;
pub use generators::make_manifold;
pub use generators::make_moons;
pub use generators::make_regression;
pub use generators::make_regression_gpu;
pub use generators::make_s_curve;
pub use generators::make_severed_sphere;
pub use generators::make_spirals;
pub use generators::make_swiss_roll;
pub use generators::make_swiss_roll_advanced;
pub use generators::make_time_series;
pub use generators::make_torus;
pub use generators::make_twin_peaks;
pub use generators::ManifoldConfig;
pub use generators::ManifoldType;
pub use generators::MissingPattern;
pub use generators::OutlierType;
pub use gpu::get_optimal_gpu_config;
pub use gpu::is_cuda_available;
pub use gpu::is_opencl_available;
pub use gpu::list_gpu_devices;
pub use gpu::make_blobs_auto_gpu;
pub use gpu::make_classification_auto_gpu;
pub use gpu::make_regression_auto_gpu;
pub use gpu::GpuBackend;
pub use gpu::GpuBenchmark;
pub use gpu::GpuBenchmarkResults;
pub use gpu::GpuConfig;
pub use gpu::GpuContext;
pub use gpu::GpuDeviceInfo;
pub use gpu::GpuMemoryConfig;
pub use gpu_optimization::benchmark_advanced_performance;
pub use gpu_optimization::generate_advanced_matrix;
pub use gpu_optimization::AdvancedGpuOptimizer;
pub use gpu_optimization::AdvancedKernelConfig;
pub use gpu_optimization::BenchmarkResult as AdvancedBenchmarkResult;
pub use gpu_optimization::DataLayout;
pub use gpu_optimization::LoadBalancingMethod;
pub use gpu_optimization::MemoryAccessPattern;
pub use gpu_optimization::PerformanceBenchmarkResults;
pub use gpu_optimization::SpecializationLevel;
pub use gpu_optimization::VectorizationStrategy;
pub use loaders::load_csv;
pub use loaders::load_csv_legacy;
pub use loaders::load_csv_parallel;
pub use loaders::load_csv_streaming;
pub use loaders::load_json;
pub use loaders::load_raw;
pub use loaders::save_json;
pub use loaders::CsvConfig;
pub use loaders::DatasetChunkIterator;
pub use loaders::StreamingConfig;
pub use neuromorphic_data_processor::create_neuromorphic_processor;
pub use neuromorphic_data_processor::create_neuromorphic_processor_with_topology;
pub use neuromorphic_data_processor::NetworkTopology;
pub use neuromorphic_data_processor::NeuromorphicProcessor;
pub use neuromorphic_data_processor::NeuromorphicTransform;
pub use neuromorphic_data_processor::SynapticPlasticity;
pub use quantum_enhanced_generators::make_quantum_blobs;
pub use quantum_enhanced_generators::make_quantum_classification;
pub use quantum_enhanced_generators::make_quantum_regression;
pub use quantum_enhanced_generators::QuantumDatasetGenerator;
pub use quantum_neuromorphic_fusion::create_fusion_with_params;
pub use quantum_neuromorphic_fusion::create_quantum_neuromorphic_fusion;
pub use quantum_neuromorphic_fusion::QuantumBioFusionResult;
pub use quantum_neuromorphic_fusion::QuantumInterference;
pub use quantum_neuromorphic_fusion::QuantumNeuromorphicFusion;
pub use real_world::list_real_world_datasets;
pub use real_world::load_adult;
pub use real_world::load_california_housing;
pub use real_world::load_heart_disease;
pub use real_world::load_red_wine_quality;
pub use real_world::load_titanic;
pub use real_world::RealWorldConfig;
pub use real_world::RealWorldDatasets;
pub use registry::get_registry;
pub use registry::load_dataset_byname;
pub use registry::DatasetMetadata;
pub use registry::DatasetRegistry;
pub use streaming::stream_classification;
pub use streaming::stream_csv;
pub use streaming::stream_regression;
pub use streaming::DataChunk;
pub use streaming::StreamConfig;
pub use streaming::StreamProcessor;
pub use streaming::StreamStats;
pub use streaming::StreamTransformer;
pub use streaming::StreamingIterator;
pub use utils::analyze_dataset_advanced;
pub use utils::create_balanced_dataset;
pub use utils::create_binned_features;
pub use utils::generate_synthetic_samples;
pub use utils::importance_sample;
pub use utils::k_fold_split;
pub use utils::min_max_scale;
pub use utils::polynomial_features;
pub use utils::quick_quality_assessment;
pub use utils::random_oversample;
pub use utils::random_sample;
pub use utils::random_undersample;
pub use utils::robust_scale;
pub use utils::statistical_features;
pub use utils::stratified_k_fold_split;
pub use utils::stratified_sample;
pub use utils::time_series_split;
pub use utils::AdvancedDatasetAnalyzer;
pub use utils::AdvancedQualityMetrics;
pub use utils::BalancingStrategy;
pub use utils::BinningStrategy;
pub use utils::CorrelationInsights;
pub use utils::CrossValidationFolds;
pub use utils::Dataset;
pub use utils::NormalityAssessment;
pub use sample::*;
pub use toy::*;

Modules§

adaptive_streaming_engine
Adaptive Streaming Data Processing Engine
advanced_generators
Advanced synthetic data generators
benchmarks
Performance benchmarking utilities
cache
Dataset caching functionality
cloud
Cloud storage integration for datasets
distributed
Distributed dataset processing capabilities
domain_specific
Domain-specific datasets for scientific research
error
Error types for the datasets module
explore
Interactive dataset exploration and analysis tools
external
External data sources integration
generators
Dataset generators
gpu
GPU acceleration for dataset operations
gpu_optimization
Advanced GPU Optimization Engine
loaders
Data loading utilities
ml_integration
Machine learning pipeline integration
neuromorphic_data_processor
Neuromorphic Data Processing Engine
quantum_enhanced_generators
Quantum-Enhanced Data Generation Engine
quantum_neuromorphic_fusion
Quantum-Neuromorphic Fusion Engine
real_world
Real-world dataset collection
registry
Dataset registry system for managing dataset metadata and locations
sample
Sample datasets for testing and demonstration
stability
API stability guarantees and compatibility documentation
streaming
Streaming support for large datasets
time_series
Time series datasets.
toy
Toy datasets for testing and examples
utils
Core utilities for working with datasets

Macros§

api_stability
Macro to easily annotate APIs with stability information