alimentar 0.2.8

Data Loading, Distribution and Tooling in Pure Rust
Documentation
# Introduction

**Alimentar** ("to feed" in Spanish) is a pure Rust data loading, transformation, and distribution library for the paiml sovereign AI stack. It provides HuggingFace-compatible functionality with sovereignty-first design.

## Why Alimentar?

The modern ML ecosystem often requires cloud connectivity, Python dependencies, and complex FFI bridges. Alimentar takes a different approach:

- **Sovereign-first** - Local storage by default, no mandatory cloud dependency
- **Pure Rust** - No Python, no FFI (fully WASM-compatible)
- **Zero-copy** - Arrow RecordBatch throughout for maximum efficiency
- **Ecosystem aligned** - Arrow 53, Parquet 53 (matches trueno, aprender)

## Key Features

### Data Loading
Load data from multiple sources with a unified API:

```rust
use alimentar::{ArrowDataset, DataLoader};

// Load from various formats
let csv_data = ArrowDataset::from_csv("data.csv", None)?;
let json_data = ArrowDataset::from_json("data.json", None)?;
let parquet_data = ArrowDataset::from_parquet("data.parquet")?;
```

### Transformations
Apply chainable transformations to your data:

```rust
use alimentar::{Dataset, Select, Filter, Normalize, Chain};

let dataset = ArrowDataset::from_parquet("train.parquet")?
    .with_transform(Chain::new(vec![
        Box::new(Select::new(vec!["feature1", "feature2", "label"])),
        Box::new(Normalize::zscore(vec!["feature1", "feature2"])),
    ]));
```

### DataLoader
Iterate over batches with shuffling support:

```rust
let loader = DataLoader::new(dataset)
    .batch_size(32)
    .shuffle(true);

for batch in loader {
    // Process batch
    println!("Batch with {} rows", batch.num_rows());
}
```

### Storage Backends
Store and retrieve datasets from multiple backends:

```rust
use alimentar::backend::{LocalBackend, S3Backend, MemoryBackend};

// Local filesystem
let local = LocalBackend::new("/data/datasets")?;

// S3-compatible storage
let s3 = S3Backend::builder()
    .bucket("my-datasets")
    .region("us-west-2")
    .build()
    .await?;

// In-memory (for WASM/testing)
let memory = MemoryBackend::new();
```

### Registry
Publish and discover datasets:

```rust
use alimentar::Registry;

let registry = Registry::new("/data/registry")?;

// Publish a dataset
registry.publish("my-dataset", dataset, metadata)?;

// Pull a dataset
let dataset = registry.pull("my-dataset", None)?;

// Search datasets
let results = registry.search("classification")?;
```

## Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                        alimentar                            │
├─────────────────────────────────────────────────────────────┤
│  Importers          │  Core            │  Exporters         │
│  ─────────          │  ────            │  ─────────         │
│  • HuggingFace Hub  │  • Dataset       │  • Local FS        │
│  • Local files      │  • DataLoader    │  • S3-compatible   │
│  • S3-compatible    │  • Transforms    │  • Registry API    │
│  • HTTP/HTTPS       │  • Streaming     │                    │
└─────────────────────────────────────────────────────────────┘
        ┌─────────────────────┼─────────────────────┐
        ▼                     ▼                     ▼
   trueno                aprender              assetgen
   (SIMD/GPU)            (ML/DL)              (Content)
```

## Quick Example

Here's a complete example of a typical ML data pipeline:

```rust
use alimentar::{
    ArrowDataset, DataLoader, Dataset,
    Select, FillNull, FillStrategy, Normalize, Chain,
};

fn main() -> alimentar::Result<()> {
    // Load training data
    let dataset = ArrowDataset::from_parquet("train.parquet")?;

    // Apply preprocessing transforms
    let processed = dataset.with_transform(Chain::new(vec![
        // Select relevant columns
        Box::new(Select::new(vec!["age", "income", "score", "label"])),
        // Handle missing values
        Box::new(FillNull::new("age", FillStrategy::Mean)),
        Box::new(FillNull::new("income", FillStrategy::Median)),
        // Normalize features
        Box::new(Normalize::zscore(vec!["age", "income", "score"])),
    ]));

    // Create data loader with batching and shuffling
    let loader = DataLoader::new(processed)
        .batch_size(64)
        .shuffle(true);

    // Iterate over batches for training
    for batch in loader {
        println!("Training on batch with {} rows", batch.num_rows());
        // Train your model here
    }

    Ok(())
}
```

## Next Steps

- [Installation]./getting-started/installation.md - Get alimentar set up
- [Quick Start]./getting-started/quick-start.md - Your first data pipeline
- [Core Concepts]./getting-started/core-concepts.md - Understand the fundamentals