Skip to main content

Crate dataset_core

Crate dataset_core 

Source
Expand description

A generic, thread-safe dataset container with lazy loading and caching.

dataset-core provides Dataset<T>, a lightweight wrapper that pairs a storage directory with a lazily-initialized value of any type T. The actual downloading and parsing logic is supplied by the caller through a loader closure, making Dataset<T> suitable for any data source — local files, remote URLs, databases, or in-memory generation.

On top of this core type, the crate offers an optional feature-gated module:

  • utils — helper functions for downloading files, extracting archives, verifying SHA-256 hashes, and managing temporary directories.

Ready-to-use loaders for classic ML datasets (Iris, Boston Housing, Diabetes, Titanic, Wine Quality) live in the companion crate dataset-ml, which depends on dataset-core with the utils feature enabled and serves as the reference implementation for wrapping Dataset<T>.

§Feature Flags

FeatureWhat it enables
utilsdownload_to, unzip, create_temp_dir, file_sha256_matches, acquire_dataset, and the error module

With no features enabled, only Dataset<T> is available — depending only on std::sync::OnceLock.

§Quick Start — Dataset<T>

use dataset_core::Dataset;

fn my_loader(dir: &str) -> Result<Vec<String>, std::io::Error> {
    // In a real use case you would read/download files from `dir`.
    Ok(vec!["hello".to_string(), "world".to_string()])
}

let ds: Dataset<Vec<String>> = Dataset::new("./my_data");

// First call runs the loader; subsequent calls return the cached reference.
let data = ds.load(my_loader).unwrap();
assert_eq!(data.len(), 2);

let data_again = ds.load(my_loader).unwrap();
assert!(std::ptr::eq(data, data_again)); // same reference, no reload

§Utility Functions (feature utils)

  • download_to — download a remote file into a directory
  • unzip — extract a ZIP archive
  • create_temp_dir — create a self-cleaning temporary directory
  • file_sha256_matches — verify a file’s SHA-256 hash
  • acquire_dataset — cache-aware dataset acquisition workflow (temp dir → prepare → optional hash check → move to final location)

Re-exports§

pub use error::DataFormatErrorKind;
pub use error::DatasetError;
pub use utils::acquire_dataset;
pub use utils::create_temp_dir;
pub use utils::download_to;
pub use utils::file_sha256_matches;
pub use utils::unzip;

Modules§

error
Error handling module.
utils
Utility functions for dataset authors.

Structs§

Dataset
A generic, thread-safe dataset container with lazy loading and in-memory caching.