dataset-core
A generic, thread-safe dataset container with lazy loading and caching for Rust.
Overview
dataset-core provides Dataset<T>, a lightweight wrapper that pairs a storage directory with a lazily-initialized value of any type T. The actual loading logic is supplied by the caller through a closure, so Dataset<T> works with any data source — local files, remote URLs, databases, or in-memory generation.
The first call to load() executes the closure and caches the result via OnceLock; every subsequent call returns a reference to the cached value with zero overhead, even across threads.
On top of this core type, two optional feature-gated modules are available:
utils— helpers for downloading files, extracting archives, verifying SHA-256 hashes, and managing temporary directories.datasets— ready-to-use loaders for classic ML datasets that also serve as reference implementations showing how to wrapDataset<T>.
Installation
Core only (zero dependencies):
[]
= "*"
With utilities:
[]
= { = "*", = ["utils"] }
With built-in datasets (implies utils):
[]
= { = "*", = ["datasets"] }
Feature Flags
| Feature | What it enables | Extra dependencies |
|---|---|---|
| (none) | Dataset<T> only |
none |
utils |
Download, unzip, temp dirs, SHA-256 validation, error types | ureq, zip, tempfile, sha2 |
datasets |
All built-in dataset loaders (implies utils) |
ndarray, csv (+ everything in utils) |
Core Usage
use Dataset;
Dataset<T> API
| Method | Returns | Description |
|---|---|---|
new(dir) |
Dataset<T> |
Create an instance (no I/O) |
load(loader) |
Result<&T, E> |
Run loader on first call, return cached &T thereafter |
is_loaded() |
bool |
Whether data has been loaded |
storage_dir() |
&str |
The storage directory path |
Built-in Datasets (feature datasets)
| Dataset | Samples | Features | Task Type | Source |
|---|---|---|---|---|
| Iris | 150 | 4 | Classification | UCI ML Repository |
| Boston Housing | 506 | 13 | Regression | UCI ML Repository |
| Diabetes | 768 | 8 | Classification | Kaggle |
| Titanic | 891 | 11 | Classification | Kaggle |
| Wine Quality (Red) | 1,599 | 11 | Regression | UCI ML Repository |
| Wine Quality (White) | 4,898 | 11 | Regression | UCI ML Repository |
use Iris;
Each built-in dataset struct follows the same pattern:
new(storage_dir)— create instance (no I/O)features()— reference to feature matrixlabels()/targets()— reference to label/target vectordata()— all references at once
Note: Titanic's
features()returns(&Array2<String>, &Array2<f64>)(string + numeric features), anddata()returns a triple.
Utility Functions (feature utils)
| Function | Purpose |
|---|---|
download_to |
Download a remote file into a directory |
unzip |
Extract a ZIP archive |
create_temp_dir |
Create a self-cleaning temporary directory |
file_sha256_matches |
Verify a file's SHA-256 hash |
acquire_dataset |
Cache-aware acquisition: reuse valid local file, prepare in temp dir, hash check, move |
Building Your Own Dataset
The built-in datasets in the datasets module demonstrate the recommended pattern for wrapping Dataset<T>. Here is a simplified outline:
use Dataset;
See src/datasets/iris.rs and others for complete, real-world examples including downloading, CSV parsing, SHA-256 validation, and ndarray integration.
Performance Considerations
- First access: downloads the file (if not on disk), validates SHA-256, parses, and caches in memory.
- Subsequent accesses: return a reference to the cached data — zero allocation, zero I/O.
.to_owned(): clones cached data into a new owned value — use only when mutation is needed.- Offline: once downloaded, datasets are stored on disk; no network required on subsequent runs.
License
This project is licensed under the MIT License — see LICENSE for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Datasets Attribution
The built-in datasets are classic machine learning datasets widely used for educational and research purposes:
- Iris: Fisher's Iris dataset (1936)
- Boston Housing: Harrison & Rubinfeld (1978)
- Diabetes: Pima Indians Diabetes Database
- Titanic: Kaggle Titanic dataset
- Wine Quality: UCI Machine Learning Repository
Author
SomeB1oody — stanyin64@gmail.com