torsh-data
Data loading and preprocessing framework for ToRSh with PyTorch-compatible API.
Overview
This crate provides comprehensive data handling utilities including:
- Datasets: Abstract interfaces and common implementations
- DataLoader: Efficient multi-threaded data loading with batching
- Samplers: Various sampling strategies for data iteration
- Transformations: Common data preprocessing operations
- Domain-specific support: Vision, audio, and tabular data
Note: This crate can leverage scirs2-datasets for additional dataset utilities and sample datasets.
Features
Core Features
std(default): Standard library supportimage-support(default): Enable image loading and vision datasetsmmap-support(default): Enable memory-mapped file support for large datasets
Data Format Support
audio-support: Enable audio processing capabilitiesdataframe: Enable tabular data support with Polars integrationarrow-support: Apache Arrow integration for efficient data interchangehdf5-support: HDF5 file format support for scientific datasetsparquet-support: Apache Parquet columnar storage format
Advanced Features
sparse: Sparse tensor support for efficient handling of sparse dataserialize: Serialization support with serdeasync-support: Asynchronous data loading with tokioprivacy: Privacy-preserving data loading with differential privacyfederated: Federated learning support for distributed datasets
GPU Acceleration (Experimental)
gpu-acceleration: Enable GPU-accelerated data preprocessingcuda: CUDA backend support (implemented)opencl: OpenCL backend support (placeholder for future)vulkan: Vulkan backend support (placeholder for future)metal: Metal backend support for Apple Silicon (placeholder for future)webgpu: WebGPU backend support for web and cross-platform (placeholder for future)
WebAssembly
wasm: WebAssembly support for browser-based data loading
Convenience
full: Enable all optional features
Usage
Basic Dataset and DataLoader
use *;
use *;
// Create a simple tensor dataset
let data = tensor!;
let labels = tensor!;
let dataset = new;
// Create a dataloader
let dataloader = builder
.batch_size
.shuffle
.num_workers
.build?;
// Iterate through batches
for batch in dataloader
Custom Dataset Implementation
use Dataset;
use Tensor;
use PathBuf;
Samplers
// Sequential sampling
let sampler = new;
// Random sampling with replacement
let sampler = new
.with_replacement
.num_samples;
// Batch sampling
let batch_sampler = new;
// Use with DataLoader
let dataloader = builder
.batch_sampler
.build?;
Data Transformations
use ;
// Create a transformation pipeline
let transform = new;
// Apply to data
let transformed = transform.apply?;
Collate Functions
use ;
// Default collate function (stacks tensors)
let dataloader = builder
.collate_fn
.build?;
// Custom collate for variable-length sequences
let pad_collate = new;
let dataloader = builder
.collate_fn
.build?;
Vision Datasets (with image-support feature)
use ;
// Load from directory structure
let dataset = new?
.with_transform;
// Built-in datasets
let mnist = MNISTnew?;
let cifar = CIFAR10new?;
Tabular Data (with dataframe feature)
use CSVDataset;
let dataset = new?
.with_target_column
.with_features
.with_dtype;
Audio Support (with audio-support feature)
use ;
let dataset = new?
.with_sample_rate
.with_transform;
Multi-Processing and Performance
// Parallel data loading
let dataloader = builder
.batch_size
.num_workers // Parallel loading threads
.prefetch_factor // Prefetch batches
.persistent_workers // Keep workers alive
.pin_memory // Pin memory for faster GPU transfer
.build?;
Integration with Polars
When the dataframe feature is enabled, torsh-data integrates with Polars for efficient tabular data handling:
use *;
use DataFrameDataset;
let df = from_path?
.has_header
.finish?;
let dataset = from_dataframe
.with_features
.with_target;
License
Licensed under the Apache License, Version 2.0. See LICENSE for details.