alimentar 0.2.7

Data Loading, Distribution and Tooling in Pure Rust
Documentation
# DataLoader & Batching (Examples 11-20)

This section covers the DataLoader for ML training workflows.

## Example 11: Basic Batching

```rust
use alimentar::{ArrowDataset, DataLoader};

let dataset = ArrowDataset::from_parquet("data.parquet")?;
let loader = DataLoader::new(dataset).batch_size(100);

for batch in loader {
    println!("Batch rows: {}", batch.num_rows());
}
```

## Example 12: Shuffle with Determinism

```rust
use alimentar::{ArrowDataset, DataLoader};

let dataset = ArrowDataset::from_parquet("data.parquet")?;
let loader = DataLoader::new(dataset)
    .batch_size(100)
    .shuffle(true)
    .seed(42); // Reproducible

let batches: Vec<_> = loader.into_iter().collect();
```

## Example 13: Drop Last

```rust
use alimentar::{ArrowDataset, DataLoader};

let dataset = ArrowDataset::from_parquet("data.parquet")?;
// 1000 rows, batch_size 300 = 3 full batches + 1 partial
let loader = DataLoader::new(dataset)
    .batch_size(300)
    .drop_last(true); // Drop incomplete last batch

let batches: Vec<_> = loader.into_iter().collect();
assert_eq!(batches.len(), 3);
```

## Examples 14-15: Parallel and Prefetch

```rust
use alimentar::{ArrowDataset, DataLoader};

let dataset = ArrowDataset::from_parquet("data.parquet")?;
let loader = DataLoader::new(dataset)
    .batch_size(100)
    .num_workers(4)      // Parallel loading
    .prefetch_factor(2); // 2x batch prefetch
```

## Examples 16-17: Weighted and Stratified Sampling

```rust
use alimentar::{ArrowDataset, DataLoader, WeightedSampler};

// Weighted sampling by column
let sampler = WeightedSampler::from_column("weight");
let loader = DataLoader::new(dataset)
    .batch_size(100)
    .sampler(sampler);

// Stratified by label
let loader = DataLoader::new(dataset)
    .batch_size(100)
    .stratify_by("label");
```

## Examples 18-19: Infinite Iterator and Collate

```rust
use alimentar::{ArrowDataset, DataLoader};

// Infinite iteration for training
let loader = DataLoader::new(dataset)
    .batch_size(100)
    .infinite(true);

// Custom collate function
let loader = DataLoader::new(dataset)
    .batch_size(100)
    .collate_fn(|batches| {
        // Custom batch merging logic
        Ok(concat_batches(batches)?)
    });
```

## Example 20: Batch Size Benchmark

```rust
use alimentar::{ArrowDataset, DataLoader};
use std::time::Instant;

let dataset = ArrowDataset::from_parquet("large.parquet")?;

for batch_size in [32, 64, 128, 256, 512] {
    let start = Instant::now();
    let loader = DataLoader::new(dataset.clone()).batch_size(batch_size);
    let _: Vec<_> = loader.into_iter().collect();
    println!("batch_size={}: {:?}", batch_size, start.elapsed());
}
```

## Key Concepts

- **Batch size**: Controls memory/compute tradeoff
- **Shuffling**: Seed for reproducibility in training
- **Drop last**: Ensures uniform batch sizes
- **Prefetch**: Overlaps data loading with compute