Table of Contents
- Features
- Installation
- Quick Start
- Examples
- Feature Flags
- CLI Usage
- Architecture
- Performance
- Quality Standards
- Citation
- License
- Related Projects
Features
- Zero-copy Arrow: All data flows through Arrow RecordBatches for maximum performance
- Multiple backends: Local filesystem, S3-compatible storage, HTTP/HTTPS sources
- HuggingFace Hub: Import datasets directly from HuggingFace
- Streaming: Memory-efficient lazy loading for large datasets
- Transforms: Filter, sort, sample, shuffle, normalize, and more
- Registry: Local dataset registry with versioning and metadata
- Data Quality: Null detection, duplicate checking, outlier analysis
- Drift Detection: KS test, Chi-square, PSI for distribution monitoring
- Federated Learning: Local, proportional, and stratified dataset splits
- Built-in Datasets: MNIST, Fashion-MNIST, CIFAR-10, CIFAR-100, Iris
- WASM support: Works in browsers via WebAssembly
Installation
Add to your Cargo.toml:
[]
= "0.1"
With specific features:
[]
= { = "0.1", = ["s3", "hf-hub"] }
Quick Start
use ;
// Load a Parquet file
let dataset = from_parquet?;
println!;
// Create a DataLoader for batched iteration
let loader = new
.batch_size
.shuffle;
for batch in loader
Examples
Apply Transforms
use ;
use Transform;
let dataset = from_parquet?;
// Select specific columns
let select = new;
let dataset = dataset.with_transform;
// Shuffle with a seed for reproducibility
let shuffle = with_seed;
Import from HuggingFace
use HfDataset;
let dataset = builder
.split
.build?
.load?;
Use S3 Backend
use ;
let backend = new?;
let data = backend.get?;
Streaming Large Datasets
use StreamingDataset;
// Load in chunks of 1024 rows
let dataset = from_parquet?;
for batch in dataset
Feature Flags
| Feature | Description | Default |
|---|---|---|
local |
Local filesystem backend | Yes |
tokio-runtime |
Async runtime support | Yes |
cli |
Command-line interface | Yes |
mmap |
Memory-mapped file support | Yes |
s3 |
S3-compatible storage backend | No |
http |
HTTP/HTTPS sources | No |
hf-hub |
HuggingFace Hub integration | No |
wasm |
WebAssembly support | No |
CLI Usage
# Convert between formats
# View dataset info
# Preview first N rows
# Import from HuggingFace
Architecture
alimentar
├── ArrowDataset # In-memory dataset backed by Arrow
├── StreamingDataset # Lazy loading for large datasets
├── DataLoader # Batched iteration with shuffle
├── Transforms # Data transformations (Select, Filter, Sort, etc.)
├── Backends # Storage (Local, S3, HTTP, Memory)
├── Registry # Dataset versioning and metadata
└── HF Hub # HuggingFace integration
Performance
- Zero-copy data access via Arrow
- Memory-mapped file support for large datasets
- Parallel data loading (when not in WASM)
- Efficient columnar operations
Quality Standards
This project follows extreme TDD practices:
- 85%+ test coverage
- 85%+ mutation score target
- Zero
unwrap()/expect()in library code - Comprehensive clippy lints
Citation
If you use alimentar in your research, please cite:
License
MIT License - see LICENSE for details.