Crate scirs2_datasets

Source
Expand description

Datasets module for SciRS2

This module provides dataset loading utilities similar to scikit-learn’s datasets module. It includes toy datasets, sample datasets, time series datasets, data generators, and utilities for loading and processing datasets.

§Features

  • Toy datasets: Classic datasets like Iris, Boston Housing, Breast Cancer, and Digits
  • Data generators: Create synthetic datasets for classification, regression, clustering, and time series
  • Cross-validation utilities: K-fold, stratified, and time series cross-validation
  • Dataset utilities: Train/test splitting, normalization, and metadata handling
  • Caching: Efficient caching system for downloaded datasets
  • Registry: Centralized registry for dataset metadata and locations

§Examples

§Loading toy datasets

use scirs2_datasets::{load_iris, load_boston};

// Load the classic Iris dataset
let iris = load_iris().unwrap();
println!("Iris dataset: {} samples, {} features", iris.n_samples(), iris.n_features());

// Load the Boston housing dataset
let boston = load_boston().unwrap();
println!("Boston dataset: {} samples, {} features", boston.n_samples(), boston.n_features());

§Generating synthetic datasets

use scirs2_datasets::{make_classification, make_regression, make_blobs, make_spirals, make_moons};

// Generate a classification dataset
let classification = make_classification(100, 5, 3, 2, 4, Some(42)).unwrap();
println!("Classification dataset: {} samples, {} features, {} classes",
         classification.n_samples(), classification.n_features(), 3);

// Generate a regression dataset
let regression = make_regression(50, 4, 3, 0.1, Some(42)).unwrap();
println!("Regression dataset: {} samples, {} features",
         regression.n_samples(), regression.n_features());

// Generate a clustering dataset
let blobs = make_blobs(80, 3, 4, 1.0, Some(42)).unwrap();
println!("Blobs dataset: {} samples, {} features, {} clusters",
         blobs.n_samples(), blobs.n_features(), 4);

// Generate non-linear patterns
let spirals = make_spirals(200, 2, 0.1, Some(42)).unwrap();
let moons = make_moons(150, 0.05, Some(42)).unwrap();

§Cross-validation

use scirs2_datasets::{load_iris, k_fold_split, stratified_k_fold_split};

let iris = load_iris().unwrap();

// K-fold cross-validation
let k_folds = k_fold_split(iris.n_samples(), 5, true, Some(42)).unwrap();
println!("Created {} folds for K-fold CV", k_folds.len());

// Stratified K-fold cross-validation
if let Some(target) = &iris.target {
    let stratified_folds = stratified_k_fold_split(target, 5, true, Some(42)).unwrap();
    println!("Created {} stratified folds", stratified_folds.len());
}

§Dataset manipulation

use scirs2_datasets::{load_iris, Dataset};

let iris = load_iris().unwrap();

// Access dataset properties
println!("Dataset: {} samples, {} features", iris.n_samples(), iris.n_features());
if let Some(feature_names) = iris.feature_names() {
    println!("Features: {:?}", feature_names);
}

Re-exports§

pub use cache::get_cache_dir;
pub use cache::BatchOperations;
pub use cache::BatchResult;
pub use cache::CacheFileInfo;
pub use cache::CacheManager;
pub use cache::CacheStats;
pub use cache::DatasetCache;
pub use cache::DetailedCacheStats;
pub use generators::add_time_series_noise;
pub use generators::inject_missing_data;
pub use generators::inject_outliers;
pub use generators::make_anisotropic_blobs;
pub use generators::make_blobs;
pub use generators::make_circles;
pub use generators::make_classification;
pub use generators::make_corrupted_dataset;
pub use generators::make_hierarchical_clusters;
pub use generators::make_moons;
pub use generators::make_regression;
pub use generators::make_spirals;
pub use generators::make_swiss_roll;
pub use generators::make_time_series;
pub use generators::MissingPattern;
pub use generators::OutlierType;
pub use utils::create_balanced_dataset;
pub use utils::create_binned_features;
pub use utils::generate_synthetic_samples;
pub use utils::importance_sample;
pub use utils::k_fold_split;
pub use utils::min_max_scale;
pub use utils::polynomial_features;
pub use utils::random_oversample;
pub use utils::random_sample;
pub use utils::random_undersample;
pub use utils::robust_scale;
pub use utils::statistical_features;
pub use utils::stratified_k_fold_split;
pub use utils::stratified_sample;
pub use utils::time_series_split;
pub use utils::BalancingStrategy;
pub use utils::BinningStrategy;
pub use utils::CrossValidationFolds;
pub use utils::Dataset;
pub use registry::*;
pub use sample::*;
pub use toy::*;

Modules§

cache
Dataset caching functionality
error
Error types for the datasets module
generators
Dataset generators
loaders
Data loading utilities
registry
Dataset registry system for managing dataset metadata and locations
sample
Sample datasets for testing and demonstration
time_series
Time series datasets.
toy
Toy datasets for testing and examples
utils
Core utilities for working with datasets