alimentar 0.2.5

Features
Installation
Quick Start
Examples
Feature Flags
CLI Usage
Architecture
Performance
Quality Standards
Citation
License
Related Projects

Features

Zero-copy Arrow: All data flows through Arrow RecordBatches for maximum performance
Multiple backends: Local filesystem, S3-compatible storage, HTTP/HTTPS sources
HuggingFace Hub: Import datasets directly from HuggingFace
Streaming: Memory-efficient lazy loading for large datasets
Transforms: Filter, sort, sample, shuffle, normalize, and more
Registry: Local dataset registry with versioning and metadata
Data Quality: Null detection, duplicate checking, outlier analysis
Drift Detection: KS test, Chi-square, PSI for distribution monitoring
Federated Learning: Local, proportional, and stratified dataset splits
Built-in Datasets: MNIST, Fashion-MNIST, CIFAR-10, CIFAR-100, Iris
WASM support: Works in browsers via WebAssembly

Installation

Add to your Cargo.toml:

[dependencies]
alimentar = "0.1"

With specific features:

[dependencies]
alimentar = { version = "0.1", features = ["s3", "hf-hub"] }

Quick Start

use alimentar::{ArrowDataset, Dataset, DataLoader};

// Load a Parquet file
let dataset = ArrowDataset::from_parquet("data/train.parquet")?;
println!("Loaded {} rows", dataset.len());

// Create a DataLoader for batched iteration
let loader = DataLoader::new(dataset)
    .batch_size(32)
    .shuffle(true);

for batch in loader {
    // Process Arrow RecordBatch
    println!("Batch has {} rows", batch.num_rows());
}

Examples

Apply Transforms

use alimentar::{ArrowDataset, Dataset, Select, Filter, Shuffle};
use alimentar::Transform;

let dataset = ArrowDataset::from_parquet("data.parquet")?;

// Select specific columns
let select = Select::new(vec!["id", "text", "label"]);
let dataset = dataset.with_transform(Box::new(select));

// Shuffle with a seed for reproducibility
let shuffle = Shuffle::with_seed(42);

Import from HuggingFace

use alimentar::hf_hub::HfDataset;

let dataset = HfDataset::builder("squad")
    .split("train")
    .build()?
    .load()?;

Use S3 Backend

use alimentar::backend::{S3Backend, StorageBackend};

let backend = S3Backend::new(
    "my-bucket",
    "us-east-1",
    None, // Use default endpoint
)?;

let data = backend.get("datasets/train.parquet")?;

Streaming Large Datasets

use alimentar::streaming::StreamingDataset;

// Load in chunks of 1024 rows
let dataset = StreamingDataset::from_parquet("large_data.parquet", 1024)?;

for batch in dataset {
    // Process without loading entire file into memory
}

Feature Flags

Feature	Description	Default
`local`	Local filesystem backend	Yes
`tokio-runtime`	Async runtime support	Yes
`cli`	Command-line interface	Yes
`mmap`	Memory-mapped file support	Yes
`s3`	S3-compatible storage backend	No
`http`	HTTP/HTTPS sources	No
`hf-hub`	HuggingFace Hub integration	No
`wasm`	WebAssembly support	No

CLI Usage

# Convert between formats
alimentar convert data.csv data.parquet

# View dataset info
alimentar info data.parquet

# Preview first N rows
alimentar head data.parquet --rows 10

# Import from HuggingFace
alimentar import hf squad --output ./data/squad

Architecture

alimentar
├── ArrowDataset      # In-memory dataset backed by Arrow
├── StreamingDataset  # Lazy loading for large datasets
├── DataLoader        # Batched iteration with shuffle
├── Transforms        # Data transformations (Select, Filter, Sort, etc.)
├── Backends          # Storage (Local, S3, HTTP, Memory)
├── Registry          # Dataset versioning and metadata
└── HF Hub            # HuggingFace integration

Performance

Zero-copy data access via Arrow
Memory-mapped file support for large datasets
Parallel data loading (when not in WASM)
Efficient columnar operations

Quality Standards

This project follows extreme TDD practices:

85%+ test coverage
85%+ mutation score target
Zero unwrap()/expect() in library code
Comprehensive clippy lints

Citation

If you use alimentar in your research, please cite:

@software{alimentar2025,
  author       = {paiml},
  title        = {alimentar: Data Loading, Distribution and Tooling in Pure Rust},
  year         = {2025},
  publisher    = {GitHub},
  url          = {https://github.com/paiml/alimentar},
  version      = {0.1.0}
}

License

MIT License - see LICENSE for details.

Related Projects

trueno-db - GPU-accelerated vector database
aprender - ML/DL framework in Rust

alimentar 0.2.5

Table of Contents

Features

Installation

Quick Start

Examples

Apply Transforms

Import from HuggingFace

Use S3 Backend

Streaming Large Datasets

Feature Flags

CLI Usage

Architecture

Performance

Quality Standards

Citation

License

Related Projects