alimentar 0.2.5

Data Loading, Distribution and Tooling in Pure Rust
Documentation
<div align="center">

<p align="center">
  <img src=".github/alimentar-hero.svg" alt="alimentar" width="800">
</p>

<h1 align="center">alimentar</h1>

<p align="center">
  <b>Data Loading, Distribution and Tooling in Pure Rust</b>
</p>

<p align="center">
  <a href="https://crates.io/crates/alimentar"><img src="https://img.shields.io/crates/v/alimentar.svg" alt="Crates.io"></a>
  <a href="https://docs.rs/alimentar"><img src="https://docs.rs/alimentar/badge.svg" alt="Documentation"></a>
  <a href="https://github.com/paiml/alimentar/actions/workflows/ci.yml"><img src="https://github.com/paiml/alimentar/actions/workflows/ci.yml/badge.svg" alt="CI"></a>
  <a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-blue.svg" alt="License"></a>
</p>

<p align="center">
  <i>A sovereignty-first data loading library for the paiml AI stack. Provides HuggingFace-compatible functionality with local-first design.</i>
</p>

</div>

## Table of Contents

- [Features]#features
- [Installation]#installation
- [Quick Start]#quick-start
- [Examples]#examples
  - [Apply Transforms]#apply-transforms
  - [Import from HuggingFace]#import-from-huggingface
  - [Use S3 Backend]#use-s3-backend
  - [Streaming Large Datasets]#streaming-large-datasets
- [Feature Flags]#feature-flags
- [CLI Usage]#cli-usage
- [Architecture]#architecture
- [Performance]#performance
- [Quality Standards]#quality-standards
- [Citation]#citation
- [License]#license
- [Related Projects]#related-projects

## Features

- **Zero-copy Arrow**: All data flows through Arrow RecordBatches for maximum performance
- **Multiple backends**: Local filesystem, S3-compatible storage, HTTP/HTTPS sources
- **HuggingFace Hub**: Import datasets directly from HuggingFace
- **Streaming**: Memory-efficient lazy loading for large datasets
- **Transforms**: Filter, sort, sample, shuffle, normalize, and more
- **Registry**: Local dataset registry with versioning and metadata
- **Data Quality**: Null detection, duplicate checking, outlier analysis
- **Drift Detection**: KS test, Chi-square, PSI for distribution monitoring
- **Federated Learning**: Local, proportional, and stratified dataset splits
- **Built-in Datasets**: MNIST, Fashion-MNIST, CIFAR-10, CIFAR-100, Iris
- **WASM support**: Works in browsers via WebAssembly

## Installation

Add to your `Cargo.toml`:

```toml
[dependencies]
alimentar = "0.1"
```

With specific features:

```toml
[dependencies]
alimentar = { version = "0.1", features = ["s3", "hf-hub"] }
```

## Quick Start

```rust
use alimentar::{ArrowDataset, Dataset, DataLoader};

// Load a Parquet file
let dataset = ArrowDataset::from_parquet("data/train.parquet")?;
println!("Loaded {} rows", dataset.len());

// Create a DataLoader for batched iteration
let loader = DataLoader::new(dataset)
    .batch_size(32)
    .shuffle(true);

for batch in loader {
    // Process Arrow RecordBatch
    println!("Batch has {} rows", batch.num_rows());
}
```

## Examples

### Apply Transforms

```rust
use alimentar::{ArrowDataset, Dataset, Select, Filter, Shuffle};
use alimentar::Transform;

let dataset = ArrowDataset::from_parquet("data.parquet")?;

// Select specific columns
let select = Select::new(vec!["id", "text", "label"]);
let dataset = dataset.with_transform(Box::new(select));

// Shuffle with a seed for reproducibility
let shuffle = Shuffle::with_seed(42);
```

### Import from HuggingFace

```rust
use alimentar::hf_hub::HfDataset;

let dataset = HfDataset::builder("squad")
    .split("train")
    .build()?
    .load()?;
```

### Use S3 Backend

```rust
use alimentar::backend::{S3Backend, StorageBackend};

let backend = S3Backend::new(
    "my-bucket",
    "us-east-1",
    None, // Use default endpoint
)?;

let data = backend.get("datasets/train.parquet")?;
```

### Streaming Large Datasets

```rust
use alimentar::streaming::StreamingDataset;

// Load in chunks of 1024 rows
let dataset = StreamingDataset::from_parquet("large_data.parquet", 1024)?;

for batch in dataset {
    // Process without loading entire file into memory
}
```

## Feature Flags

| Feature | Description | Default |
|---------|-------------|---------|
| `local` | Local filesystem backend | Yes |
| `tokio-runtime` | Async runtime support | Yes |
| `cli` | Command-line interface | Yes |
| `mmap` | Memory-mapped file support | Yes |
| `s3` | S3-compatible storage backend | No |
| `http` | HTTP/HTTPS sources | No |
| `hf-hub` | HuggingFace Hub integration | No |
| `wasm` | WebAssembly support | No |

## CLI Usage

```bash
# Convert between formats
alimentar convert data.csv data.parquet

# View dataset info
alimentar info data.parquet

# Preview first N rows
alimentar head data.parquet --rows 10

# Import from HuggingFace
alimentar import hf squad --output ./data/squad
```

## Architecture

```
alimentar
├── ArrowDataset      # In-memory dataset backed by Arrow
├── StreamingDataset  # Lazy loading for large datasets
├── DataLoader        # Batched iteration with shuffle
├── Transforms        # Data transformations (Select, Filter, Sort, etc.)
├── Backends          # Storage (Local, S3, HTTP, Memory)
├── Registry          # Dataset versioning and metadata
└── HF Hub            # HuggingFace integration
```

## Performance

- Zero-copy data access via Arrow
- Memory-mapped file support for large datasets
- Parallel data loading (when not in WASM)
- Efficient columnar operations

## Quality Standards

This project follows extreme TDD practices:

- 85%+ test coverage
- 85%+ mutation score target
- Zero `unwrap()`/`expect()` in library code
- Comprehensive clippy lints

## Citation

If you use alimentar in your research, please cite:

```bibtex
@software{alimentar2025,
  author       = {paiml},
  title        = {alimentar: Data Loading, Distribution and Tooling in Pure Rust},
  year         = {2025},
  publisher    = {GitHub},
  url          = {https://github.com/paiml/alimentar},
  version      = {0.1.0}
}
```

## License

MIT License - see [LICENSE](LICENSE) for details.

## Related Projects

- [trueno-db]https://github.com/paiml/trueno-db - GPU-accelerated vector database
- [aprender]https://github.com/paiml/aprender - ML/DL framework in Rust