Expand description
alimentar - Data Loading, Distribution and Tooling in Pure Rust
A sovereign-first data loading library for the paiml AI stack. Provides HuggingFace-compatible functionality without mandatory cloud dependency.
§Design Principles
- Sovereign-first - Local storage default, no mandatory cloud dependency
- Pure Rust - No Python, no FFI (WASM-compatible)
- Zero-copy - Arrow
RecordBatchthroughout - Ecosystem aligned - Arrow 53, Parquet 53
§Quick Start
use alimentar::{ArrowDataset, DataLoader};
// Load a parquet file
let dataset = ArrowDataset::from_parquet("data/train.parquet").unwrap();
// Create a data loader
let loader = DataLoader::new(dataset).batch_size(32).shuffle(true);
// Iterate over batches
for batch in loader {
println!("Batch with {} rows", batch.num_rows());
}Re-exports§
pub use async_prefetch::AsyncPrefetchBuilder;pub use async_prefetch::AsyncPrefetchDataset;pub use async_prefetch::SyncPrefetchDataset;pub use dataloader::DataLoader;pub use dataset::ArrowDataset;pub use dataset::CsvOptions;pub use dataset::Dataset;pub use dataset::JsonOptions;pub use drift::ColumnDrift;pub use drift::DriftDetector;pub use drift::DriftReport;pub use drift::DriftSeverity;pub use drift::DriftTest;pub use error::Error;pub use error::Result;pub use federated::FederatedSplitCoordinator;pub use federated::FederatedSplitStrategy;pub use federated::GlobalSplitReport;pub use federated::NodeSplitInstruction;pub use federated::NodeSplitManifest;pub use federated::NodeSummary;pub use federated::SplitQualityIssue;pub use imbalance::ClassDistribution;pub use imbalance::ImbalanceDetector;pub use imbalance::ImbalanceMetrics;pub use imbalance::ImbalanceRecommendation;pub use imbalance::ImbalanceReport;pub use imbalance::ImbalanceSeverity;pub use mmap::MmapDataset;pub use mmap::MmapDatasetBuilder;pub use parallel::ParallelDataLoader;pub use parallel::ParallelDataLoaderBuilder;pub use quality::ColumnQuality;pub use quality::QualityChecker;pub use quality::QualityIssue;pub use quality::QualityProfile;pub use quality::QualityReport;pub use sketch::Centroid;pub use sketch::DDSketch;pub use sketch::DataSketch;pub use sketch::DistributedDriftDetector;pub use sketch::SketchDriftResult;pub use sketch::SketchType;pub use sketch::TDigest;pub use split::DatasetSplit;pub use transform::Cast;pub use transform::Chain;pub use transform::Drop;pub use transform::FillNull;pub use transform::FillStrategy;pub use transform::Filter;pub use transform::Map;pub use transform::NormMethod;pub use transform::Normalize;pub use transform::Rename;pub use transform::Select;pub use transform::Skip;pub use transform::Sort;pub use transform::SortOrder;pub use transform::Take;pub use transform::Transform;pub use transform::Unique;pub use transform::Sample;pub use transform::Shuffle;pub use tui::DatasetAdapter;pub use tui::DatasetViewer;pub use tui::RowDetailView;pub use tui::SchemaInspector;pub use tui::TuiError;pub use tui::TuiResult;pub use weighted::WeightedDataLoader;
Modules§
- async_
prefetch - Async prefetch for parallel I/O in streaming datasets.
- backend
- Storage backends for alimentar.
- cli
- CLI module for command-line interface alimentar CLI - Data Loading, Distribution and Tooling
- dataloader
- DataLoader for batched iteration over datasets.
- dataset
- Dataset types for alimentar.
- datasets
- Canonical ML dataset loaders
- drift
- Data drift detection for ML pipelines
- error
- Error types for alimentar.
- federated
- Federated Split Coordination for Privacy-Preserving ML
- format
- Alimentar Dataset Format (.ald)
- imbalance
- Imbalanced dataset detection for ML pipelines
- mmap
- Memory-mapped dataset for efficient large file access.
- parallel
- Parallel data loading with multi-worker support.
- quality
- Data quality assessment for ML pipelines
- registry
- Dataset registry for sharing and discovery.
- serve
- WASM Serve Module - Browser-based data serving and sharing
- sketch
- Sketch-based statistics for distributed/federated drift detection
- split
- Dataset splitting utilities
- streaming
- Streaming dataset for lazy/chunked data loading.
- tensor
- Tensor conversion utilities for ML framework integration.
- transform
- Data transforms for alimentar.
- tui
- TUI dataset viewer module TUI Dataset Viewer Module
- weighted
- Weighted DataLoader for importance sampling.
Structs§
- Record
Batch - A two-dimensional batch of column-oriented data with a defined schema.
- Schema
- Describes the meta-data of an ordered sequence of relative types.