sklears-datasets 0.1.0-alpha.2

Dataset utilities and generation for sklears
Documentation

sklears-datasets

Crates.io Documentation License Minimum Rust Version

Latest release: 0.1.0-alpha.2 (December 22, 2025). See the workspace release notes for highlights and upgrade guidance.

Overview

sklears-datasets centralizes dataset loaders, synthetic generators, and data utilities used throughout the sklears ecosystem. It mirrors scikit-learn’s dataset module while adding Rust-first performance and IO enhancements.

Key Features

  • Classic Loaders: Diabetes, Iris, Digits, Wine, Breast Cancer, 20 Newsgroups, and more.
  • Synthetic Generators: make_blobs, make_moons, make_circles, Gaussian quantiles, regression surfaces, and streaming generators.
  • File IO: CSV, Parquet, Arrow IPC, and memory-mapped dataset support with Polars integration.
  • Benchmark Utilities: Deterministic dataset splits and sampling strategies for reproducible experiments.

Quick Start

use sklears_datasets::{load_iris, make_blobs};

// Built-in dataset
let iris = load_iris()?;
println!("{} samples, {} features", iris.data.nrows(), iris.data.ncols());

// Synthetic data
let blobs = make_blobs(1000)
    .n_features(10)
    .centers(4)
    .cluster_std(2.5)
    .random_state(Some(42))
    .build()?;

Status

  • All loaders/generators validated through the 11,292 passing workspace tests for 0.1.0-alpha.2.
  • Supports lazy loading and streaming for large-scale workflows.
  • Future work (federated dataset shards, synthetic time series) tracked in this crate’s TODO.md.