dataset-ml 0.1.0

Built-in machine learning dataset loaders (Iris, Boston Housing, Diabetes, Titanic, Wine Quality) built on top of dataset-core
Documentation

简体中文 | English

dataset-ml

Ready-to-use loaders for classic machine learning datasets, built on dataset-core.

Rust Version License: MIT crates.io

Overview

dataset-ml ships with loaders for six classic ML datasets. Each loader:

  • Downloads the source file on first access (with ureq).
  • Verifies a pinned SHA-256 hash to detect corruption or upstream changes.
  • Parses the CSV into ndarray Array1 / Array2.
  • Caches the parsed result in memory via dataset_core::Dataset<T> — subsequent accesses return a & reference with zero I/O.

Each module is also a complete reference implementation of the pattern for wrapping Dataset<T> for a concrete data source.

Installation

[dependencies]
dataset-ml = "0.1"

Datasets

Struct Module path Samples Features Task Type Source
Iris dataset_ml::iris 150 4 Classification UCI ML Repository
BostonHousing dataset_ml::boston_housing 506 13 Regression UCI ML Repository
Diabetes dataset_ml::diabetes 768 8 Classification Kaggle
Titanic dataset_ml::titanic 891 11 Classification Kaggle
RedWineQuality dataset_ml::wine_quality::red_wine_quality 1,599 11 Regression UCI ML Repository
WhiteWineQuality dataset_ml::wine_quality::white_wine_quality 4,898 11 Regression UCI ML Repository

All structs are also re-exported at the crate root, so dataset_ml::Iris, dataset_ml::RedWineQuality, etc. work too.

Usage

use dataset_ml::iris::Iris;

fn main() {
    let iris = Iris::new("./data");

    // Lazy: downloads and parses on first access, then cached.
    let features = iris.features().unwrap();  // &Array2<f64>
    let labels   = iris.labels().unwrap();    // &Array1<String>

    // Or get both at once:
    let (features, labels) = iris.data().unwrap();

    assert_eq!(features.shape(), &[150, 4]);
    assert_eq!(labels.len(), 150);

    // Call .to_owned() when you need a mutable copy.
    let mut owned = features.to_owned();
    owned[[0, 0]] = 5.5;
}

Each dataset struct follows the same pattern:

  • new(storage_dir) — create instance (no I/O)
  • features() — reference to feature matrix
  • labels() / targets() — reference to label/target vector
  • data() — all references at once

Note: Titanic's features() returns (&Array2<String>, &Array2<f64>) (string + numeric features), and data() returns a triple.

Migration from dataset-core 0.1.x

If you used the datasets feature of dataset-core 0.1.x, switch to this crate:

- dataset-core = { version = "0.1", features = ["datasets"] }
+ dataset-ml = "0.1"
Old path New path
dataset_core::datasets::iris::Iris dataset_ml::iris::Iris
dataset_core::datasets::boston_housing::BostonHousing dataset_ml::boston_housing::BostonHousing
dataset_core::datasets::diabetes::Diabetes dataset_ml::diabetes::Diabetes
dataset_core::datasets::titanic::Titanic dataset_ml::titanic::Titanic
dataset_core::datasets::wine_quality::red_wine_quality::RedWineQuality dataset_ml::wine_quality::red_wine_quality::RedWineQuality
dataset_core::datasets::wine_quality::white_wine_quality::WhiteWineQuality dataset_ml::wine_quality::white_wine_quality::WhiteWineQuality

dataset_core::utils::* and dataset_core::DatasetError are unchanged — they remain in dataset-core under the utils feature.

Performance Considerations

  • First access: downloads the file (if not on disk), validates SHA-256, parses, caches in memory.
  • Subsequent accesses: return a reference to the cached data — zero allocation, zero I/O.
  • .to_owned(): clones cached data into a new owned value — use only when mutation is needed.
  • Offline: once downloaded, datasets stay on disk; no network needed on subsequent runs.

License

This project is licensed under the MIT License — see LICENSE for details.

Datasets Attribution

The bundled datasets are classic machine learning datasets widely used for educational and research purposes:

  • Iris: Fisher's Iris dataset (1936)
  • Boston Housing: Harrison & Rubinfeld (1978)
  • Diabetes: Pima Indians Diabetes Database
  • Titanic: Kaggle Titanic dataset
  • Wine Quality: UCI Machine Learning Repository

Author

SomeB1oodystanyin64@gmail.com