docs.rs failed to build alimentar-0.4.0
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
Visit the last successful build:
alimentar-0.3.0
Table of Contents
- Features
- Installation
- Quick Start
- Examples
- Feature Flags
- CLI Usage
- Architecture
- Performance
- Quality Standards
- Citation
- License
- Related Projects
Features
- Zero-copy Arrow: All data flows through Arrow RecordBatches for maximum performance
- Multiple backends: Local filesystem, S3-compatible storage, HTTP/HTTPS sources
- HuggingFace Hub: Import datasets directly from HuggingFace
- Streaming: Memory-efficient lazy loading for large datasets
- Transforms: Filter, sort, sample, shuffle, normalize, and more
- Registry: Local dataset registry with versioning and metadata
- Data Quality: Null detection, duplicate checking, outlier analysis
- Drift Detection: KS test, Chi-square, PSI for distribution monitoring
- Federated Learning: Local, proportional, and stratified dataset splits
- Built-in Datasets: MNIST, Fashion-MNIST, CIFAR-10, CIFAR-100, Iris
- WASM support: Works in browsers via WebAssembly
Installation
Add to your Cargo.toml:
[]
= "0.2"
With specific features:
[]
= { = "0.2", = ["s3", "hf-hub"] }
Quick Start
use ;
// Load a Parquet file
let dataset = from_parquet?;
println!;
// Create a DataLoader for batched iteration
let loader = new
.batch_size
.shuffle;
for batch in loader
Examples
Apply Transforms
use ;
use Transform;
let dataset = from_parquet?;
// Select specific columns
let select = new;
let dataset = dataset.with_transform;
// Shuffle with a seed for reproducibility
let shuffle = with_seed;
Import from HuggingFace
use HfDataset;
let dataset = builder
.split
.build?
.load?;
Use S3 Backend
use ;
let backend = new?;
let data = backend.get?;
Streaming Large Datasets
use StreamingDataset;
// Load in chunks of 1024 rows
let dataset = from_parquet?;
for batch in dataset
Feature Flags
| Feature | Description | Default |
|---|---|---|
local |
Local filesystem backend | Yes |
tokio-runtime |
Async runtime support | Yes |
cli |
Command-line interface | Yes |
mmap |
Memory-mapped file support | Yes |
s3 |
S3-compatible storage backend | No |
http |
HTTP/HTTPS sources | No |
hf-hub |
HuggingFace Hub integration | No |
wasm |
WebAssembly support | No |
CLI Usage
# Convert between formats
# View dataset info
# Preview first N rows
# Import from HuggingFace
Architecture
alimentar
├── ArrowDataset # In-memory dataset backed by Arrow
├── StreamingDataset # Lazy loading for large datasets
├── DataLoader # Batched iteration with shuffle
├── Transforms # Data transformations (Select, Filter, Sort, etc.)
├── Backends # Storage (Local, S3, HTTP, Memory)
├── Registry # Dataset versioning and metadata
└── HF Hub # HuggingFace integration
Performance
| Operation | Throughput | Memory |
|---|---|---|
| Parquet load (10M rows) | 1.2 GB/s | O(batch) streaming |
| CSV to Parquet | 450 MB/s | 2x file size |
| Arrow IPC round-trip | 2.8 GB/s | Zero-copy |
| Shuffle (1M rows) | 12 ms | In-place |
Key design choices:
- Zero-copy data access via Arrow
RecordBatch - Memory-mapped file support for datasets larger than RAM
- Parallel data loading via Tokio (when not in WASM)
- Columnar storage eliminates row-by-row overhead
Quality Standards
This project follows extreme TDD practices:
- 1,820+ tests passing
- 85%+ test coverage
- 85%+ mutation score target
- Zero
unwrap()/expect()in library code - Comprehensive clippy lints (0 warnings)
- Stratified split supports string labels (Utf8/LargeUtf8)
- Weighted sampling handles empty datasets gracefully
Citation
If you use alimentar in your research, please cite:
Contributing
Contributions are welcome! Please see the CONTRIBUTING.md guide for details.
Security
Please see our Security Policy for reporting vulnerabilities.
MSRV
Minimum Supported Rust Version: 1.75
License
MIT License - see LICENSE for details.