Expand description
Datasets module for SciRS2
This module provides dataset loading utilities similar to scikit-learn’s datasets module. It includes toy datasets, sample datasets, time series datasets, data generators, and utilities for loading and processing datasets.
§Features
- Toy datasets: Classic datasets like Iris, Boston Housing, Breast Cancer, and Digits
- Data generators: Create synthetic datasets for classification, regression, clustering, and time series
- Cross-validation utilities: K-fold, stratified, and time series cross-validation
- Dataset utilities: Train/test splitting, normalization, and metadata handling
- Caching: Efficient caching system for downloaded datasets
- Registry: Centralized registry for dataset metadata and locations
§Examples
§Loading toy datasets
use scirs2_datasets::{load_iris, load_boston};
// Load the classic Iris dataset
let iris = load_iris().unwrap();
println!("Iris dataset: {} samples, {} features", iris.n_samples(), iris.n_features());
// Load the Boston housing dataset
let boston = load_boston().unwrap();
println!("Boston dataset: {} samples, {} features", boston.n_samples(), boston.n_features());
§Generating synthetic datasets
use scirs2_datasets::{make_classification, make_regression, make_blobs, make_spirals, make_moons};
// Generate a classification dataset
let classification = make_classification(100, 5, 3, 2, 4, Some(42)).unwrap();
println!("Classification dataset: {} samples, {} features, {} classes",
classification.n_samples(), classification.n_features(), 3);
// Generate a regression dataset
let regression = make_regression(50, 4, 3, 0.1, Some(42)).unwrap();
println!("Regression dataset: {} samples, {} features",
regression.n_samples(), regression.n_features());
// Generate a clustering dataset
let blobs = make_blobs(80, 3, 4, 1.0, Some(42)).unwrap();
println!("Blobs dataset: {} samples, {} features, {} clusters",
blobs.n_samples(), blobs.n_features(), 4);
// Generate non-linear patterns
let spirals = make_spirals(200, 2, 0.1, Some(42)).unwrap();
let moons = make_moons(150, 0.05, Some(42)).unwrap();
§Cross-validation
use scirs2_datasets::{load_iris, k_fold_split, stratified_k_fold_split};
let iris = load_iris().unwrap();
// K-fold cross-validation
let k_folds = k_fold_split(iris.n_samples(), 5, true, Some(42)).unwrap();
println!("Created {} folds for K-fold CV", k_folds.len());
// Stratified K-fold cross-validation
if let Some(target) = &iris.target {
let stratified_folds = stratified_k_fold_split(target, 5, true, Some(42)).unwrap();
println!("Created {} stratified folds", stratified_folds.len());
}
§Dataset manipulation
use scirs2_datasets::{load_iris, Dataset};
let iris = load_iris().unwrap();
// Access dataset properties
println!("Dataset: {} samples, {} features", iris.n_samples(), iris.n_features());
if let Some(feature_names) = iris.feature_names() {
println!("Features: {:?}", feature_names);
}
Re-exports§
pub use cache::get_cache_dir;
pub use cache::BatchOperations;
pub use cache::BatchResult;
pub use cache::CacheFileInfo;
pub use cache::CacheManager;
pub use cache::CacheStats;
pub use cache::DatasetCache;
pub use cache::DetailedCacheStats;
pub use generators::add_time_series_noise;
pub use generators::inject_missing_data;
pub use generators::inject_outliers;
pub use generators::make_anisotropic_blobs;
pub use generators::make_blobs;
pub use generators::make_circles;
pub use generators::make_classification;
pub use generators::make_corrupted_dataset;
pub use generators::make_hierarchical_clusters;
pub use generators::make_moons;
pub use generators::make_regression;
pub use generators::make_spirals;
pub use generators::make_swiss_roll;
pub use generators::make_time_series;
pub use generators::MissingPattern;
pub use generators::OutlierType;
pub use utils::create_balanced_dataset;
pub use utils::create_binned_features;
pub use utils::generate_synthetic_samples;
pub use utils::importance_sample;
pub use utils::k_fold_split;
pub use utils::min_max_scale;
pub use utils::polynomial_features;
pub use utils::random_oversample;
pub use utils::random_sample;
pub use utils::random_undersample;
pub use utils::robust_scale;
pub use utils::statistical_features;
pub use utils::stratified_k_fold_split;
pub use utils::stratified_sample;
pub use utils::time_series_split;
pub use utils::BalancingStrategy;
pub use utils::BinningStrategy;
pub use utils::CrossValidationFolds;
pub use utils::Dataset;
pub use registry::*;
pub use sample::*;
pub use toy::*;
Modules§
- cache
- Dataset caching functionality
- error
- Error types for the datasets module
- generators
- Dataset generators
- loaders
- Data loading utilities
- registry
- Dataset registry system for managing dataset metadata and locations
- sample
- Sample datasets for testing and demonstration
- time_
series - Time series datasets.
- toy
- Toy datasets for testing and examples
- utils
- Core utilities for working with datasets