简体中文 | English
dataset-core
A generic, thread-safe dataset container with lazy loading and caching for Rust.
Overview
dataset-core provides Dataset<T>, a lightweight wrapper that pairs a storage directory with a lazily-initialized value of any type T. The actual loading logic is supplied by the caller through a closure, so Dataset<T> works with any data source — local files, remote URLs, databases, or in-memory generation.
The first call to load() executes the closure and caches the result via OnceLock; every subsequent call returns a reference to the cached value with zero overhead, even across threads.
On top of this core type, an optional feature-gated module is available:
utils— helpers for downloading files, extracting archives, verifying SHA-256 hashes, and managing temporary directories.
Looking for ready-to-use loaders for classic ML datasets (Iris, Boston Housing, Diabetes, Titanic, Wine Quality)? They live in the companion crate dataset-ml, which depends on dataset-core with the utils feature enabled.
Installation
Core only (zero dependencies):
[]
= "0.2"
With utilities:
[]
= { = "0.2", = ["utils"] }
Feature Flags
| Feature | What it enables | Extra dependencies |
|---|---|---|
| (none) | Dataset<T> only |
none |
utils |
Download, unzip, temp dirs, SHA-256 validation, error types | ureq, zip, tempfile, sha2, thiserror |
Core Usage
use Dataset;
Dataset<T> API
| Method | Returns | Description |
|---|---|---|
new(dir) |
Dataset<T> |
Create an instance (no I/O) |
load(loader) |
Result<&T, E> |
Run loader on first call, return cached &T thereafter |
is_loaded() |
bool |
Whether data has been loaded |
storage_dir() |
&str |
The storage directory path |
Utility Functions (feature utils)
| Function | Purpose |
|---|---|
download_to |
Download a remote file into a directory |
unzip |
Extract a ZIP archive |
create_temp_dir |
Create a self-cleaning temporary directory |
file_sha256_matches |
Verify a file's SHA-256 hash |
acquire_dataset |
Cache-aware acquisition: reuse valid local file, prepare in temp dir, hash check, move |
Building Your Own Dataset
Dataset<T> is designed to be wrapped. The companion crate dataset-ml demonstrates the recommended pattern; here is a simplified outline:
use Dataset;
See the dataset-ml source for complete, real-world examples including downloading, CSV parsing, SHA-256 validation, and ndarray integration.
Performance Considerations
- First access: runs the loader once (potentially network + parse), caches the result.
- Subsequent accesses: return a reference to the cached data — zero allocation, zero I/O.
- Cross-thread safety:
Dataset<T>isSend + SyncwheneverTis; the internalOnceLockguarantees the loader runs at most once even under concurrent calls.
License
This project is licensed under the MIT License — see LICENSE for details.
Author
SomeB1oody — stanyin64@gmail.com