Crate alimentar

Crate alimentar 

Source
Expand description

alimentar - Data Loading, Distribution and Tooling in Pure Rust

A sovereign-first data loading library for the paiml AI stack. Provides HuggingFace-compatible functionality without mandatory cloud dependency.

§Design Principles

  1. Sovereign-first - Local storage default, no mandatory cloud dependency
  2. Pure Rust - No Python, no FFI (WASM-compatible)
  3. Zero-copy - Arrow RecordBatch throughout
  4. Ecosystem aligned - Arrow 53, Parquet 53

§Quick Start

use alimentar::{ArrowDataset, DataLoader};

// Load a parquet file
let dataset = ArrowDataset::from_parquet("data/train.parquet").unwrap();

// Create a data loader
let loader = DataLoader::new(dataset).batch_size(32).shuffle(true);

// Iterate over batches
for batch in loader {
    println!("Batch with {} rows", batch.num_rows());
}

Re-exports§

pub use async_prefetch::AsyncPrefetchBuilder;
pub use async_prefetch::AsyncPrefetchDataset;
pub use async_prefetch::SyncPrefetchDataset;
pub use dataloader::DataLoader;
pub use dataset::ArrowDataset;
pub use dataset::CsvOptions;
pub use dataset::Dataset;
pub use dataset::JsonOptions;
pub use drift::ColumnDrift;
pub use drift::DriftDetector;
pub use drift::DriftReport;
pub use drift::DriftSeverity;
pub use drift::DriftTest;
pub use error::Error;
pub use error::Result;
pub use federated::FederatedSplitCoordinator;
pub use federated::FederatedSplitStrategy;
pub use federated::GlobalSplitReport;
pub use federated::NodeSplitInstruction;
pub use federated::NodeSplitManifest;
pub use federated::NodeSummary;
pub use federated::SplitQualityIssue;
pub use imbalance::ClassDistribution;
pub use imbalance::ImbalanceDetector;
pub use imbalance::ImbalanceMetrics;
pub use imbalance::ImbalanceRecommendation;
pub use imbalance::ImbalanceReport;
pub use imbalance::ImbalanceSeverity;
pub use mmap::MmapDataset;
pub use mmap::MmapDatasetBuilder;
pub use parallel::ParallelDataLoader;
pub use parallel::ParallelDataLoaderBuilder;
pub use quality::ColumnQuality;
pub use quality::QualityChecker;
pub use quality::QualityIssue;
pub use quality::QualityProfile;
pub use quality::QualityReport;
pub use sketch::Centroid;
pub use sketch::DDSketch;
pub use sketch::DataSketch;
pub use sketch::DistributedDriftDetector;
pub use sketch::SketchDriftResult;
pub use sketch::SketchType;
pub use sketch::TDigest;
pub use split::DatasetSplit;
pub use transform::Cast;
pub use transform::Chain;
pub use transform::Drop;
pub use transform::FillNull;
pub use transform::FillStrategy;
pub use transform::Filter;
pub use transform::Map;
pub use transform::NormMethod;
pub use transform::Normalize;
pub use transform::Rename;
pub use transform::Select;
pub use transform::Skip;
pub use transform::Sort;
pub use transform::SortOrder;
pub use transform::Take;
pub use transform::Transform;
pub use transform::Unique;
pub use transform::Sample;
pub use transform::Shuffle;
pub use tui::DatasetAdapter;
pub use tui::DatasetViewer;
pub use tui::RowDetailView;
pub use tui::SchemaInspector;
pub use tui::TuiError;
pub use tui::TuiResult;
pub use weighted::WeightedDataLoader;

Modules§

async_prefetch
Async prefetch for parallel I/O in streaming datasets.
backend
Storage backends for alimentar.
cli
CLI module for command-line interface alimentar CLI - Data Loading, Distribution and Tooling
dataloader
DataLoader for batched iteration over datasets.
dataset
Dataset types for alimentar.
datasets
Canonical ML dataset loaders
drift
Data drift detection for ML pipelines
error
Error types for alimentar.
federated
Federated Split Coordination for Privacy-Preserving ML
format
Alimentar Dataset Format (.ald)
imbalance
Imbalanced dataset detection for ML pipelines
mmap
Memory-mapped dataset for efficient large file access.
parallel
Parallel data loading with multi-worker support.
quality
Data quality assessment for ML pipelines
registry
Dataset registry for sharing and discovery.
serve
WASM Serve Module - Browser-based data serving and sharing
sketch
Sketch-based statistics for distributed/federated drift detection
split
Dataset splitting utilities
streaming
Streaming dataset for lazy/chunked data loading.
tensor
Tensor conversion utilities for ML framework integration.
transform
Data transforms for alimentar.
tui
TUI dataset viewer module TUI Dataset Viewer Module
weighted
Weighted DataLoader for importance sampling.

Structs§

RecordBatch
A two-dimensional batch of column-oriented data with a defined schema.
Schema
Describes the meta-data of an ordered sequence of relative types.

Type Aliases§

SchemaRef
A reference-counted reference to a Schema.