简体中文 | English
dataset-ml
Ready-to-use loaders for classic machine learning datasets, built on dataset-core.
Overview
dataset-ml ships with loaders for six classic ML datasets. Each loader:
- Downloads the source file on first access (with
ureq). - Verifies a pinned SHA-256 hash to detect corruption or upstream changes.
- Parses the CSV into
ndarrayArray1/Array2. - Caches the parsed result in memory via
dataset_core::Dataset<T>— subsequent accesses return a&reference with zero I/O.
Each module is also a complete reference implementation of the pattern for wrapping Dataset<T> for a concrete data source.
Installation
[]
= "0.1"
Datasets
| Struct | Module path | Samples | Features | Task Type | Source |
|---|---|---|---|---|---|
Iris |
dataset_ml::iris |
150 | 4 | Classification | UCI ML Repository |
BostonHousing |
dataset_ml::boston_housing |
506 | 13 | Regression | UCI ML Repository |
Diabetes |
dataset_ml::diabetes |
768 | 8 | Classification | Kaggle |
Titanic |
dataset_ml::titanic |
891 | 11 | Classification | Kaggle |
RedWineQuality |
dataset_ml::wine_quality::red_wine_quality |
1,599 | 11 | Regression | UCI ML Repository |
WhiteWineQuality |
dataset_ml::wine_quality::white_wine_quality |
4,898 | 11 | Regression | UCI ML Repository |
All structs are also re-exported at the crate root, so dataset_ml::Iris, dataset_ml::RedWineQuality, etc. work too.
Usage
use Iris;
Each dataset struct follows the same pattern:
new(storage_dir)— create instance (no I/O)features()— reference to feature matrixlabels()/targets()— reference to label/target vectordata()— all references at once
Note: Titanic's
features()returns(&Array2<String>, &Array2<f64>)(string + numeric features), anddata()returns a triple.
Migration from dataset-core 0.1.x
If you used the datasets feature of dataset-core 0.1.x, switch to this crate:
- dataset-core = { version = "0.1", features = ["datasets"] }
+ dataset-ml = "0.1"
| Old path | New path |
|---|---|
dataset_core::datasets::iris::Iris |
dataset_ml::iris::Iris |
dataset_core::datasets::boston_housing::BostonHousing |
dataset_ml::boston_housing::BostonHousing |
dataset_core::datasets::diabetes::Diabetes |
dataset_ml::diabetes::Diabetes |
dataset_core::datasets::titanic::Titanic |
dataset_ml::titanic::Titanic |
dataset_core::datasets::wine_quality::red_wine_quality::RedWineQuality |
dataset_ml::wine_quality::red_wine_quality::RedWineQuality |
dataset_core::datasets::wine_quality::white_wine_quality::WhiteWineQuality |
dataset_ml::wine_quality::white_wine_quality::WhiteWineQuality |
dataset_core::utils::* and dataset_core::DatasetError are unchanged — they remain in dataset-core under the utils feature.
Performance Considerations
- First access: downloads the file (if not on disk), validates SHA-256, parses, caches in memory.
- Subsequent accesses: return a reference to the cached data — zero allocation, zero I/O.
.to_owned(): clones cached data into a new owned value — use only when mutation is needed.- Offline: once downloaded, datasets stay on disk; no network needed on subsequent runs.
License
This project is licensed under the MIT License — see LICENSE for details.
Datasets Attribution
The bundled datasets are classic machine learning datasets widely used for educational and research purposes:
- Iris: Fisher's Iris dataset (1936)
- Boston Housing: Harrison & Rubinfeld (1978)
- Diabetes: Pima Indians Diabetes Database
- Titanic: Kaggle Titanic dataset
- Wine Quality: UCI Machine Learning Repository
Author
SomeB1oody — stanyin64@gmail.com