TenfloweRS Dataset
Data loading and preprocessing utilities for TenfloweRS, providing efficient dataset management, transformations, and data pipelines for machine learning workflows.
Stable (v0.1.1 -- 2026-04-24) | 504 tests passing | 0 clippy warnings
Overview
tenflowers-dataset implements:
- Dataset Abstractions: Flexible trait-based dataset interface
- Data Transformations: Preprocessing and augmentation pipelines
- Batch Processing: Efficient batching with automatic tensor stacking
- Data Loading: Support for CSV, Parquet, HDF5, TFRecord, images, and audio formats
- Parallel Processing: Multi-threaded data loading and preprocessing
- Memory Efficiency: Lazy loading, memory mapping, and caching strategies
- Distributed Streaming: Sharded streaming for large-scale training
- Multimodal Support: Unified transforms for multi-modal data
- Synthetic Data Generation: Built-in generators for images, text, time series, and modern ML datasets
- Reproducibility: Deterministic data pipelines with seed control
Features
- Flexible Dataset Trait: Define custom datasets for any data source
- Composable Transforms: Chain preprocessing operations with a pipeline API
- Automatic Batching: Convert individual samples to batched tensors
- Data Augmentation: Augmentation techniques for images, text, and audio including noise injection
- Prefetching: Overlap data loading with model computation
- Distributed Support: Sharding for multi-GPU training
- Caching with Telemetry: LRU caching with hit-rate monitoring
- Vision Transforms: Resize, crop, flip, color jitter, and normalization
Usage
Basic Tensor Dataset
use ;
use Tensor;
// Create dataset from tensors
let features = from_vec?;
let labels = from_vec?;
let dataset = new;
// Access individual samples
let = dataset.get?;
// Create batched iterator
for batch in dataset.batch
Data Transformations
use ;
// Normalize features with mean and std
let normalize = new;
// Chain multiple transforms
let transform = new;
// Apply to dataset
let transformed = dataset.transform;
Custom Dataset Implementation
use Dataset;
use PathBuf;
Data Augmentation Pipeline
use ;
// Create augmentation pipeline for images
let augmentation = builder
.random_crop
.random_horizontal_flip
.random_rotation
.color_jitter
.normalize
.build?;
// Apply to dataset during training
let train_dataset = dataset.transform;
Parallel Data Loading
use ;
// Create parallel data loader
let loader = builder
.dataset
.batch_size
.num_workers // Parallel loading threads
.prefetch // Prefetch 2 batches
.shuffle
.drop_last // Drop incomplete final batch
.build?;
// Iterate with automatic prefetching
for batch in loader
Architecture
Core Components
- Dataset Trait: Unified interface for all data sources
- Transform Trait: Composable data transformations
- DataLoader: Efficient batching and prefetching
- Samplers: Control iteration order and distribution
Supported Data Formats
- In-Memory: Tensor datasets, array datasets
- Files: Images (PNG, JPEG), CSV, JSON, Parquet
- Binary: TFRecord, MessagePack
- Text: Plain text, tokenized sequences
- Audio: WAV, MP3, FLAC with on-the-fly processing
Performance Features
- Memory Mapping: For large datasets that do not fit in RAM
- Caching: LRU cache with telemetry for frequently accessed samples
- Prefetching: Overlap I/O with computation
- Parallel Loading: Multi-threaded data loading
- NUMA-aware: Optional NUMA memory placement
Feature Flags
parallel: Parallel data loading via Rayonserialize: Serialization support for datasets and transformsimages: Image loading and processingmmap: Memory-mapped dataset supportparquet: Parquet file format supportcsv_format: CSV file format supporttfrecord: TFRecord file format supportaudio: Audio file loading and processingdownload: Dataset download utilitiesgpu: GPU direct data loadingnuma: NUMA-aware memory placement
Integration with TenfloweRS
- Seamless Tensor Creation: Direct conversion to tenflowers-core tensors
- Device Placement: Automatic CPU/GPU placement
- Gradient Tape: Compatible with autograd for data-dependent gradients
- Model Training: Direct integration with neural network training loops
License
Licensed under Apache-2.0