Skip to main content

Crate dataset_core

Crate dataset_core 

Source
Expand description

A generic, thread-safe dataset container with lazy loading and caching.

dataset-core provides Dataset<T>, a lightweight wrapper that pairs a storage directory with a lazily-initialized value of any type T. The actual downloading and parsing logic is supplied by the caller through a loader closure, making Dataset<T> suitable for any data source — local files, remote URLs, databases, or in-memory generation.

On top of this core type, the crate offers optional feature-gated modules:

  • utils — helper functions for downloading files, extracting archives, verifying SHA-256 hashes, and managing temporary directories.
  • datasets — ready-to-use loaders for classic ML datasets (Iris, Boston Housing, Diabetes, Titanic, Wine Quality). These also serve as reference implementations showing how to wrap Dataset<T> for a concrete use case.

§Feature Flags

FeatureWhat it enables
utilsdownload_to, unzip, create_temp_dir, file_sha256_matches, acquire_dataset, and the error module
datasetsAll built-in dataset loaders (implies utils)

With no features enabled, only Dataset<T> is available — only depend on std::sync::OnceLock.

§Quick Start — Dataset<T>

use dataset_core::Dataset;

fn my_loader(dir: &str) -> Result<Vec<String>, std::io::Error> {
    // In a real use case you would read/download files from `dir`.
    Ok(vec!["hello".to_string(), "world".to_string()])
}

let ds: Dataset<Vec<String>> = Dataset::new("./my_data");

// First call runs the loader; subsequent calls return the cached reference.
let data = ds.load(my_loader).unwrap();
assert_eq!(data.len(), 2);

let data_again = ds.load(my_loader).unwrap();
assert!(std::ptr::eq(data, data_again)); // same reference, no reload

§Built-in Datasets (feature datasets)

DatasetSamplesFeaturesTask Type
Iris1504Classification
Boston Housing50613Regression
Diabetes7688Classification
Titanic89111Classification
Wine Quality (Red)1,59911Regression
Wine Quality (White)4,89811Regression
use dataset_core::datasets::iris::Iris;

let iris = Iris::new("./data");
let (features, labels) = iris.data().unwrap();
assert_eq!(features.shape(), &[150, 4]);

§Utility Functions (feature utils)

  • download_to — download a remote file into a directory
  • unzip — extract a ZIP archive
  • create_temp_dir — create a self-cleaning temporary directory
  • file_sha256_matches — verify a file’s SHA-256 hash
  • acquire_dataset — cache-aware dataset acquisition workflow (temp dir → prepare → optional hash check → move to final location)

Re-exports§

pub use error::DataFormatErrorKind;
pub use error::DatasetError;
pub use utils::acquire_dataset;
pub use utils::create_temp_dir;
pub use utils::download_to;
pub use utils::file_sha256_matches;
pub use utils::unzip;

Modules§

datasets
Built-in dataset implementations.
error
Error handling module.
utils
Utility functions for dataset authors.

Structs§

Dataset
A generic, thread-safe dataset container with lazy loading and in-memory caching.