Expand description
§TenfloweRS Dataset
Efficient data loading, preprocessing, and augmentation for machine learning in TenfloweRS. This crate provides high-performance data pipelines with support for various formats, transformations, and distributed loading strategies.
§Features
- Multiple Data Formats: CSV, image folders, HDF5, Arrow/Parquet, JSON, and custom formats
- Efficient Data Loading: Multi-threaded prefetching, NUMA-aware scheduling, zero-copy operations
- Rich Transformations: SIMD-accelerated transforms, GPU preprocessing, composition
- Advanced Sampling: Stratified, importance, and distributed sampling strategies
- Data Quality: Built-in quality analysis, outlier detection, and drift monitoring
- Production Features: Checkpointing, versioning, reproducibility, and debugging tools
- Streaming Support: Large dataset streaming with predictive prefetching
§Quick Start
§Loading Data from CSV
use tenflowers_dataset::{CsvDataset, CsvDatasetBuilder};
// Load CSV data
let dataset = CsvDatasetBuilder::new("data.csv")
.has_header(true)
.delimiter(b',')
.build()?;
println!("Dataset has {} samples", dataset.len());§Image Folder Dataset
use tenflowers_dataset::{ImageFolderDataset, ImageFolderDatasetBuilder};
// Load images from directory structure:
// train/
// cat/
// img1.jpg
// img2.jpg
// dog/
// img3.jpg
let dataset = ImageFolderDatasetBuilder::new("train/")
.image_size((224, 224))
.build()?;
println!("Found {} images in {} classes", dataset.len(), dataset.num_classes());§Data Loader with Batching
use tenflowers_dataset::{DataLoader, DataLoaderBuilder};
use tenflowers_dataset::{CsvDataset, RandomSampler};
// Create a data loader
let loader = DataLoaderBuilder::new(dataset)
.batch_size(32)
.shuffle(true)
.num_workers(4)
.prefetch(2)
.build()?;
// Iterate through batches
for batch in loader.iter() {
let (features, labels) = batch?;
// Training step...
}§Data Transformations
use tenflowers_dataset::transforms::{Compose, Normalize, RandomCrop, ToTensor};
// Compose multiple transformations
let transform = Compose::new(vec![
Box::new(RandomCrop::new((224, 224))),
Box::new(ToTensor),
Box::new(Normalize::new(vec![0.485, 0.456, 0.406], vec![0.229, 0.224, 0.225])),
]);§Advanced Features
§Distributed Data Loading
use tenflowers_dataset::{DataLoaderBuilder, DistributedSampler};
use tenflowers_dataset::CsvDataset;
// Split dataset across multiple workers
let sampler = DistributedSampler::new(dataset.len(), 4, 0); // 4 workers, rank 0
let loader = DataLoaderBuilder::new(dataset)
.sampler(Box::new(sampler))
.batch_size(32)
.build()?;§Data Quality Analysis
use tenflowers_dataset::{DataQualityAnalyzer, QualityAnalysisConfig};
use tenflowers_dataset::CsvDataset;
// Analyze data quality
let analyzer = DataQualityAnalyzer::new(QualityAnalysisConfig::default());
let report = analyzer.analyze(&dataset)?;
println!("Data quality score: {:.2}", report.overall_score());
println!("Issues found: {}", report.num_issues());§Caching and Prefetching
use tenflowers_dataset::{EnhancedDataLoaderBuilder, CsvDataset};
// Use enhanced data loader with smart caching
let loader = EnhancedDataLoaderBuilder::new(dataset)
.batch_size(64)
.num_workers(8)
.enable_caching(true)
.cache_size_mb(512)
.adaptive_prefetch(true)
.build()?;§Custom Dataset
use tenflowers_core::{Tensor, Result};
use std::marker::PhantomData;
struct MyDataset<T> {
data: Vec<Vec<T>>,
_phantom: PhantomData<T>,
}
impl<T: Clone> MyDataset<T> {
fn len(&self) -> usize {
self.data.len()
}
fn get(&self, index: usize) -> Option<&Vec<T>> {
self.data.get(index)
}
}§Architecture Overview
The crate is organized into the following modules:
formats: Data format readers (CSV, image, HDF5, Arrow, Parquet)dataloader: Multi-threaded data loading with batching and samplingtransforms: Data transformation and augmentation operationscache: Caching strategies for frequently accessed datadistributed_loading: Distributed and sharded data loadingdata_quality: Data quality analysis and validationstatistics: Dataset statistics computationvisualization: Dataset visualization utilitiesreproducibility: Reproducibility and versioning supportdebug_tools: Profiling and debugging utilities
§Performance Optimization
§SIMD Transformations
Many transformations use SIMD instructions for maximum performance:
use tenflowers_dataset::simd_transforms::{SimdNormalize, SimdResize};
// SIMD-accelerated normalization
let normalize = SimdNormalize::new(vec![0.5, 0.5, 0.5], vec![0.5, 0.5, 0.5]);§GPU Preprocessing
use tenflowers_dataset::gpu_transforms::{GpuResize, GpuNormalize};
use tenflowers_core::Device;
// Run transformations on GPU
let device = Device::gpu(0)?;
let resize = GpuResize::new((224, 224), &device)?;§Zero-Copy Operations
use tenflowers_dataset::zero_copy::{ZeroCopyLoader, MmapDataset};
// Memory-mapped dataset for large files
let dataset = MmapDataset::new("large_dataset.bin")?;§Integration with TenfloweRS Ecosystem
This crate integrates seamlessly with:
tenflowers-core: Tensor operations and device managementtenflowers-neural: Neural network training pipelinestenflowers-autograd: Gradient-based transformationsscirs2-core: Scientific computing utilities
§Supported Data Formats
- CSV: Comma-separated values with customizable delimiters
- Images: JPEG, PNG, BMP, TIFF via image folder structure
- HDF5: Hierarchical data format for scientific data
- Arrow/Parquet: Columnar data formats for analytics
- JSON: Structured JSON data
- Custom: Extensible format registry for custom formats
§Debugging and Profiling
use tenflowers_dataset::{DatasetDebugger, PipelineProfiler};
use tenflowers_dataset::CsvDataset;
// Profile data loading pipeline
let profiler = PipelineProfiler::new();
profiler.start();
// ... load data ...
let report = profiler.generate_report();
println!("Bottlenecks: {:?}", report.bottlenecks());§Pipeline Inspection
InspectablePipeline instruments each transform step to record per-step latency,
input/output shapes, and error rates:
use tenflowers_dataset::{InspectablePipeline};
let mut pipeline = InspectablePipeline::new();
// pipeline.add_step("norm", Box::new(my_transform));
// let report = pipeline.run_inspection_batch(&dataset, 100);
// println!("avg latency: {} μs", report.avg_latency_per_step_micros());§Data Drift Metrics
Three statistical drift measures are available as free functions:
population_stability_index: PSI over equal-width bins; < 0.1 stable, > 0.2 significant.ks_two_sample: Kolmogorov-Smirnov max |ECDF_a − ECDF_b|, range [0, 1].jensen_shannon_divergence: Symmetric KL-based divergence, range [0, 1].compute_drift: Convenience wrapper returning aDriftReportwith all three.
use tenflowers_dataset::compute_drift;
let reference: Vec<f64> = (0..100).map(|i| i as f64).collect();
let current: Vec<f64> = (0..100).map(|i| i as f64 + 50.0).collect();
let report = compute_drift(&reference, ¤t).expect("drift computation failed");
println!("PSI: {:.4}, KS: {:.4}, significant: {}", report.psi, report.ks_statistic, report.is_significant_drift);§Adaptive Prefetch PID Controller
PidAdaptiveController adjusts prefetch depth based on cache-hit rate telemetry
using a classic PID algorithm with anti-windup integral clamping:
use tenflowers_dataset::PidAdaptiveController;
let mut ctrl = PidAdaptiveController::new(0.5, 0.05, 0.01, 0.80, 4, 1, 32);
let new_depth = ctrl.tick(0.65); // below setpoint → depth increases
println!("Recommended prefetch depth: {}", new_depth);§Schema Validation
SchemaValidator::validate_full returns a SchemaValidationReport with structured
FieldDiff entries for every field:
use tenflowers_dataset::{SchemaValidator, FieldDiff};
let validator = SchemaValidator::lenient(); // widening allowed
// let report = validator.validate_full(&actual_metadata, &expected_fields);
// for (name, diff) in &report.diffs { println!("{}: {:?}", name, diff); }§Quick Start (Runnable Doctest)
The following example builds a tiny in-memory dataset and verifies all samples can be retrieved — no file I/O or external dependencies required:
use tenflowers_dataset::{Dataset, TensorDataset};
use tenflowers_core::Tensor;
let features = Tensor::<f32>::from_vec(
vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0],
&[4, 2],
).expect("tensor creation failed");
let labels = Tensor::<f32>::from_vec(vec![0.0, 1.0, 0.0, 1.0], &[4])
.expect("tensor creation failed");
let dataset = TensorDataset::new(features, labels);
assert_eq!(dataset.len(), 4);
for i in 0..dataset.len() {
let (feat, lbl) = dataset.get(i).expect("get should succeed");
assert_eq!(feat.shape().dims()[0], 2);
assert_eq!(lbl.shape().size(), 1);
}Re-exports§
pub use data_quality::compute_drift;pub use data_quality::jensen_shannon_divergence;pub use data_quality::ks_two_sample;pub use data_quality::population_stability_index;pub use data_quality::DataQualityAnalyzer;pub use data_quality::DataQualityExt;pub use data_quality::DataQualityIssue;pub use data_quality::DataQualityMetrics;pub use data_quality::DriftDetectionConfig;pub use data_quality::DriftDetectionResult;pub use data_quality::DriftReport;pub use data_quality::DriftType;pub use data_quality::IssueCategory;pub use data_quality::IssueSeverity;pub use data_quality::OutlierDetectionMethod;pub use data_quality::QualityAnalysisConfig;pub use data_quality::StatisticalTest;pub use dataloader::BatchResult;pub use dataloader::BucketCollate;pub use dataloader::CollateFn;pub use dataloader::DataLoader;pub use dataloader::DataLoaderBuilder;pub use dataloader::DataLoaderConfig;pub use dataloader::DefaultCollate;pub use dataloader::DistributedSampler;pub use dataloader::ImportanceSampler;pub use dataloader::PaddingCollate;pub use dataloader::PaddingStrategy;pub use dataloader::RandomSampler;pub use dataloader::Sampler;pub use dataloader::SequentialSampler;pub use dataloader::StratifiedSampler;pub use debug_tools::Bottleneck;pub use debug_tools::BottleneckCategory;pub use debug_tools::ConsistencyReport;pub use debug_tools::DatasetDebugger;pub use debug_tools::EventType;pub use debug_tools::InspectablePipeline;pub use debug_tools::InspectionEvent;pub use debug_tools::PipelineInspectionReport;pub use debug_tools::PipelineProfiler;pub use debug_tools::ProfileEvent;pub use debug_tools::ProfileReport;pub use debug_tools::ProfilerConfig;pub use debug_tools::SampleInfo as DebugSampleInfo;pub use debug_tools::Severity;pub use debug_tools::StageStatistics;pub use debug_tools::StageTimer;pub use enhanced_dataloader::EnhancedDataLoader;pub use enhanced_dataloader::EnhancedDataLoaderBuilder;pub use enhanced_dataloader::LoaderStats;pub use enhanced_dataloader::WorkerStats;pub use error_taxonomy::classification;pub use error_taxonomy::helpers as error_helpers;pub use error_taxonomy::DatasetErrorBuilder;pub use error_taxonomy::DatasetErrorCategory;pub use error_taxonomy::DatasetErrorContext;pub use formats::common::MissingValueStrategy;pub use formats::common::NamingPattern;pub use formats::csv::ChunkedCsvDataset;pub use formats::csv::CsvChunk;pub use formats::csv::CsvDataset;pub use formats::csv::CsvDatasetBuilder;pub use formats::image::image_folder_dataset_with_transform;pub use formats::image::ImageFolderConfig;pub use formats::image::ImageFolderDataset;pub use formats::image::ImageFolderDatasetBuilder;pub use formats::registry::global as format_registry;pub use formats::registry::register_format_factory;pub use formats::registry::FormatInfo;pub use formats::registry::GlobalFormatRegistry;pub use formats::schema_validator::FieldDiff;pub use formats::schema_validator::SchemaValidator;pub use formats::schema_validator::ValidationPolicy;pub use formats::schema_validator::ValidationReport as SchemaValidationReport;pub use transforms::AddNoise;pub use transforms::BackgroundNoise;pub use transforms::DatasetExt;pub use transforms::GaussianNoise;pub use transforms::GlobalNormalize;pub use transforms::MinMaxScale;pub use transforms::NoiseType;pub use transforms::Normalize;pub use transforms::PerChannelNormalize;pub use transforms::RealTimeAudioAugmentation;pub use transforms::RobustScaler;pub use transforms::Transform;pub use transforms::TransformedDataset;pub use formats::json::JsonConfig;pub use formats::json::JsonDataset;pub use formats::json::JsonDatasetBuilder;pub use formats::json::JsonDatasetInfo;pub use formats::json::JsonLDataset;pub use formats::text::LabelStrategy;pub use formats::text::TextConfig;pub use formats::text::TextDataset;pub use formats::text::TextDatasetBuilder;pub use formats::text::TextDatasetInfo;pub use formats::text::TokenizationStrategy;pub use formats::text::TokenizedDataset;pub use formats::text::Vocabulary;pub use active_learning::ActiveLearningDataset;pub use active_learning::ActiveLearningSampler;pub use active_learning::DiversityStrategy;pub use active_learning::LabeledSubset;pub use active_learning::UncertaintyStrategy;pub use active_learning::UnlabeledSubset;pub use adaptive_prefetch::AdaptationStrategy;pub use adaptive_prefetch::AdaptivePrefetchPolicy;pub use adaptive_prefetch::AdaptivePrefetchTuner;pub use adaptive_prefetch::PidAdaptiveController;pub use adaptive_prefetch::PrefetchMetrics as AdaptivePrefetchMetrics;pub use adaptive_prefetch::TuningDecision;pub use advanced_benchmarks::AdvancedBenchmarkSuite;pub use advanced_benchmarks::BenchmarkConfig;pub use advanced_benchmarks::BenchmarkResult;pub use advanced_benchmarks::CpuStats;pub use advanced_benchmarks::GpuStats;pub use advanced_benchmarks::MemoryStats;pub use advanced_benchmarks::MemoryTracker as BenchmarkMemoryTracker;pub use advanced_benchmarks::SystemInfo;pub use advanced_benchmarks::ThroughputStats;pub use advanced_benchmarks::TimingStats;pub use advanced_sampling::AdvancedImportanceSampler;pub use advanced_sampling::BalancingStrategy;pub use advanced_sampling::ClassBalancedSampler;pub use advanced_sampling::CurriculumScheduler;pub use advanced_sampling::CurriculumStrategy;pub use advanced_sampling::HardNegativeMiner;pub use advanced_sampling::MiningStrategy;pub use attention_optimized::AttentionOptimizedConfig;pub use attention_optimized::AttentionOptimizedDataset;pub use attention_optimized::AttentionOptimizedDatasetBuilder;pub use attention_optimized::AttentionPattern;pub use attention_optimized::AttentionSequence;pub use attention_optimized::SequenceMetadata as AttentionSequenceMetadata;pub use benchmarks::BenchmarkDatasets;pub use benchmarks::CifarDataset;pub use benchmarks::DatasetInfo;pub use benchmarks::IrisDataset;pub use benchmarks::MnistDataset;pub use cache::AggregatedStats;pub use cache::AlertSeverity;pub use cache::AlertThresholds;pub use cache::AlertType;pub use cache::CacheEvent;pub use cache::CacheEventType;pub use cache::CacheExt;pub use cache::CacheStats;pub use cache::CacheTelemetryCollector;pub use cache::CacheTelemetryMetrics;pub use cache::CachedDataset;pub use cache::EnhancedTelemetryCollector;pub use cache::LruCache;pub use cache::MetricsSnapshot;pub use cache::PerformanceAlert;pub use cache::PerformanceBaselines;pub use cache::TelemetryConfig;pub use cache::ThreadSafeLruCache;pub use cache::WarmingStrategy;pub use cache::PersistentCache;pub use cache::PersistentlyCachedDataset;pub use cache::TensorPersistentCache;pub use distributed_loading::create_distributed_dataloader;pub use distributed_loading::CollectiveOpType;pub use distributed_loading::CommunicationManager;pub use distributed_loading::DistributedLoadingConfig;pub use distributed_loading::DistributedLoadingStats;pub use distributed_loading::DistributedMessage;pub use distributed_loading::EnhancedDistributedSampler;pub use distributed_loading::NodeInfo;pub use distributed_sharding::DatasetShardingExt;pub use distributed_sharding::ShardConfig;pub use distributed_sharding::ShardStatistics;pub use distributed_sharding::ShardStrategy;pub use distributed_sharding::ShardableDataset;pub use distributed_sharding::ShardedDataset;pub use distributed_streaming::CheckpointState;pub use distributed_streaming::PartitionStrategy;pub use distributed_streaming::StreamCoordinator;pub use distributed_streaming::StreamingConfig;pub use distributed_streaming::StreamingShardIterator;pub use distributed_streaming::StreamingShardLoader;pub use distributed_streaming::StreamingStats;pub use distributed_streaming::WorkerHealth;pub use distributed_streaming::WorkerMetrics;pub use distributed_streaming::WorkerStatus;pub use federated::AggregationStrategy;pub use federated::ClientConfig;pub use federated::ClientId;pub use federated::ClientIndexedDataset;pub use federated::ClientStats;pub use federated::DataDistribution;pub use federated::FederatedAggregator;pub use federated::FederatedClientDataset;pub use federated::FederatedDatasetExt;pub use federated::FederatedFeatureStats;pub use federated::FederatedPartitioner;pub use federated::NoiseMechanism;pub use federated::PartitioningStrategy;pub use federated::PrivacyConfig;pub use federated::PrivacyManager;pub use federated::PrivateStats;pub use federated::QualityMetrics;pub use formats::arrow::ArrowArrayExt;pub use formats::arrow::ArrowConfig;pub use formats::arrow::ArrowDataset;pub use formats::arrow::ArrowDatasetBuilder;pub use formats::arrow::ArrowFormatFactory;pub use formats::arrow::ArrowFormatReader;pub use formats::arrow::ArrowTensorView;pub use formats::audio::AudioConfig;pub use formats::audio::AudioDataset;pub use formats::audio::AudioDatasetBuilder;pub use formats::audio::AudioDatasetInfo;pub use formats::audio::AudioInfo;pub use formats::audio::AudioLabelStrategy;pub use formats::audio::FeatureType as AudioFeatureType;pub use formats::parquet::ParquetConfig;pub use formats::parquet::ParquetDataset;pub use formats::parquet::ParquetDatasetBuilder;pub use formats::parquet::ParquetDatasetInfo;pub use formats::tfrecord::Feature;pub use formats::tfrecord::FeatureInfo;pub use formats::tfrecord::FeatureType;pub use formats::tfrecord::TFRecord;pub use formats::tfrecord::TFRecordConfig;pub use formats::tfrecord::TFRecordDataset;pub use formats::tfrecord::TFRecordDatasetBuilder;pub use formats::tfrecord::TFRecordDatasetInfo;pub use formats::webdataset::StreamingWebDataset;pub use formats::webdataset::WebDataset;pub use formats::webdataset::WebDatasetBuilder;pub use formats::webdataset::WebDatasetConfig;pub use formats::webdataset::WebDatasetSample;pub use formats::zarr::ZarrArrayInfo;pub use formats::zarr::ZarrCompressionType;pub use formats::zarr::ZarrConfig;pub use formats::zarr::ZarrDataset;pub use formats::zarr::ZarrDatasetBuilder;pub use formats::zarr::ZarrDatasetExt;pub use gpu_transforms::GpuColorJitter;pub use gpu_transforms::GpuContext;pub use gpu_transforms::GpuGaussianBlur;pub use gpu_transforms::GpuGaussianNoise;pub use gpu_transforms::GpuRandomCrop;pub use gpu_transforms::GpuRandomHorizontalFlip;pub use gpu_transforms::GpuResize;pub use gpu_transforms::GpuRotation;pub use memory_pool::GlobalMemoryPool;pub use memory_pool::MemoryPool;pub use memory_pool::MemoryPoolExt;pub use memory_pool::PoolStats;pub use memory_pool::PooledMemory;pub use multimodal::FusionStrategy;pub use multimodal::Modality;pub use multimodal::MultimodalConfig;pub use multimodal::MultimodalDataset;pub use multimodal::MultimodalDatasetBuilder;pub use multimodal::MultimodalSample;pub use multimodal::MultimodalTransform;pub use multimodal::MultimodalTransformedDataset;pub use numa_scheduler::NumaAssignmentStats;pub use numa_scheduler::NumaAssignmentStrategy;pub use numa_scheduler::NumaConfig;pub use numa_scheduler::NumaNode;pub use numa_scheduler::NumaScheduler;pub use numa_scheduler::NumaTopology;pub use numa_scheduler::NumaWorkerAssignment;pub use online_learning::ADWINDetector;pub use online_learning::DriftDetectionMethod;pub use online_learning::DriftDetector;pub use online_learning::ErrorRateDetector;pub use online_learning::KSDetector;pub use online_learning::OnlineLearningConfig;pub use online_learning::OnlineLearningDataset;pub use online_learning::OnlineStats;pub use online_learning::PageHinkleyDetector;pub use predictive_prefetch::AccessPattern;pub use predictive_prefetch::AccessStats;pub use predictive_prefetch::PredictivePrefetchDataset;pub use predictive_prefetch::PredictivePrefetcher;pub use predictive_prefetch::PrefetchConfig;pub use real_datasets::AgNewsConfig;pub use real_datasets::Cifar10Config;pub use real_datasets::ImageNetConfig;pub use real_datasets::ImdbConfig;pub use real_datasets::MnistConfig;pub use real_datasets::RealAgNewsBuilder;pub use real_datasets::RealAgNewsDataset;pub use real_datasets::RealCifar10Builder;pub use real_datasets::RealCifar10Dataset;pub use real_datasets::RealImageNetBuilder;pub use real_datasets::RealImageNetDataset;pub use real_datasets::RealImdbBuilder;pub use real_datasets::RealImdbDataset;pub use real_datasets::RealMnistBuilder;pub use real_datasets::RealMnistDataset;pub use reproducibility::DatasetConfig;pub use reproducibility::DeterministicDataset;pub use reproducibility::DeterministicOps;pub use reproducibility::DeterministicOrdering;pub use reproducibility::EnvironmentInfo;pub use reproducibility::ExperimentConfig;pub use reproducibility::ExperimentTracker;pub use reproducibility::OperationRecord;pub use reproducibility::OrderingStrategy;pub use reproducibility::ReproducibilityExt;pub use reproducibility::SamplingConfig;pub use reproducibility::SeedInfo;pub use reproducibility::SeedManager;pub use reproducibility::TransformConfig;pub use schema_inference::FieldStatistics;pub use schema_inference::InferenceConfig;pub use schema_inference::InferredDataType;pub use schema_inference::InferredField;pub use schema_inference::InferredSchema;pub use schema_inference::SchemaInferenceEngine;pub use simd_transforms::BenchmarkResult as SimdBenchmarkResult;pub use simd_transforms::SimdBenchmark;pub use simd_transforms::SimdColorConvert;pub use simd_transforms::SimdConvolution;pub use simd_transforms::SimdElementWise;pub use simd_transforms::SimdHistogram;pub use simd_transforms::SimdHistogramTransform;pub use simd_transforms::SimdMatrixOps;pub use simd_transforms::SimdNormalize;pub use simd_transforms::SimdOperation;pub use simd_transforms::SimdStats;pub use smart_cache::AccessPatternPredictor;pub use smart_cache::CacheConfig;pub use smart_cache::CacheLevel;pub use smart_cache::EvictionPolicy;pub use smart_cache::PredictiveSmartCache;pub use smart_cache::SmartCache;pub use smart_cache::SmartCachedDataset;pub use statistics::AdvancedStatistics;pub use statistics::AdvancedStatisticsExt;pub use statistics::CorrelationAnalyzer;pub use statistics::DatasetStatisticsComputer;pub use statistics::DatasetStatisticsExt;pub use statistics::DatasetStats;pub use statistics::Histogram;pub use statistics::MultivariateStatistics;pub use statistics::PCAResult;pub use statistics::StatisticsConfig;pub use stream_prefetch_optimizer::AccessEvent;pub use stream_prefetch_optimizer::AccessPatternAnalyzer;pub use stream_prefetch_optimizer::AccessType;pub use stream_prefetch_optimizer::PatternPrediction;pub use stream_prefetch_optimizer::PatternSignature;pub use stream_prefetch_optimizer::PrefetchMetrics;pub use stream_prefetch_optimizer::PrefetchOptimizerConfig;pub use stream_prefetch_optimizer::StreamPrefetchOptimizer;pub use streaming_optimized::AdaptiveBuffer;pub use streaming_optimized::CompressionType;pub use streaming_optimized::StreamingOptimizedConfig;pub use streaming_optimized::StreamingOptimizedDataset;pub use streaming_optimized::StreamingOptimizedDatasetBuilder;pub use streaming_optimized::StreamingOptimizedIterator;pub use streaming_optimized::StreamingStats as OptimizedStreamingStats;pub use synthetic::ContrastiveLearningDataset;pub use synthetic::DatasetGenerator;pub use synthetic::Episode;pub use synthetic::FewShotDataset;pub use synthetic::GeometricShape;pub use synthetic::GradientDirection;pub use synthetic::ImagePatternConfig;pub use synthetic::ImagePatternGenerator;pub use synthetic::ImagePatternType;pub use synthetic::MetaLearningDataset;pub use synthetic::ModernMLConfig;pub use synthetic::NoiseDistribution;pub use synthetic::SelfSupervisedDataset;pub use synthetic::StripeOrientation;pub use synthetic::SyntheticConfig;pub use synthetic::SyntheticDataset;pub use synthetic::SyntheticTextCorpus;pub use synthetic::TaskDataset;pub use synthetic::TextCorpusConfig;pub use synthetic::TextSynthesisTask;pub use synthetic::TimeSeriesPattern;pub use throughput_benchmark::MemoryStats as ThroughputMemoryStats;pub use throughput_benchmark::ThreadStats as ThroughputThreadStats;pub use throughput_benchmark::ThroughputBenchmarkConfig;pub use throughput_benchmark::ThroughputBenchmarkHarness;pub use throughput_benchmark::ThroughputBenchmarkResult;pub use validation::DataValidator;pub use validation::DatasetValidationExt;pub use validation::RangeConstraint;pub use validation::SchemaInfo;pub use validation::ValidationConfig;pub use validation::ValidationResult;pub use versioning::DatasetLineage;pub use versioning::DatasetSizeInfo;pub use versioning::DatasetVersionManager;pub use versioning::LineageTree;pub use versioning::TransformationRecord;pub use versioning::VersionId;pub use versioning::VersionMetadata;pub use versioning::VersionedDataset;pub use visualization::ClassDistribution;pub use visualization::DatasetVisualizationExt;pub use visualization::DatasetVisualizer;pub use visualization::DistributionInfo;pub use visualization::FeatureHistogram;pub use visualization::FeatureStats;pub use visualization::SampleInfo;pub use visualization::SamplePreview;pub use work_stealing::WorkStealingQueue;pub use zero_copy::MemoryMappedDataset;pub use zero_copy::TensorView;pub use zero_copy::ZeroCopyDataset;pub use zero_copy::MemoryMappedFileDataset;pub use zero_copy::MemoryMappedFileStats;pub use dataset_core::BatchedDataset;pub use dataset_core::ConcatDataset;pub use dataset_core::Dataset;pub use dataset_core::DatasetSplit;pub use dataset_core::DatasetSplitter;pub use dataset_core::DatasetUtilsExt;pub use dataset_core::FilteredDataset;pub use dataset_core::MergeStrategy;pub use dataset_core::MergedDataset;pub use dataset_core::SubsetDataset;pub use dataset_core::TensorDataset;
Modules§
- active_
learning - Active Learning module for intelligent data selection in machine learning pipelines
- adaptive_
prefetch - Adaptive prefetch auto-tuning policy
- advanced_
benchmarks - Advanced benchmarking system for comprehensive performance analysis
- advanced_
sampling - Advanced sampling strategies for training optimization
- attention_
optimized - Attention-optimized datasets for transformer architectures
- benchmarks
- cache
- Caching utilities for datasets
- config
- Configuration management system for TenfloweRS Dataset
- data_
quality - Data quality metrics and drift detection
- dataloader
- DataLoader Module
- dataset_
core - Core dataset traits, types, and utility implementations.
- debug_
tools - Debug and profiling tools for data pipeline analysis
- distributed_
loading - Enhanced distributed loading with multi-node support, RDMA optimization, and collective operations
- distributed_
sharding - Deterministic shard loader for distributed training
- distributed_
streaming - Distributed streaming loaders for large-scale data processing
- enhanced_
dataloader - Enhanced DataLoader with work stealing queue for improved multi-threaded performance
- error_
taxonomy - Error taxonomy and standardized error handling for dataset operations
- federated
- Federated learning dataset utilities for privacy-preserving distributed ML
- formats
- File format support for datasets
- gpu_
transforms - GPU-accelerated image transforms using WGPU compute shaders - Modular Architecture
- memory_
pool - Memory pooling utilities for dataset operations
- multimodal
- Multimodal dataset support for modern LLM training
- numa_
scheduler - NUMA-aware scheduling for multi-threaded data loading
- online_
learning - Online Learning module for concept drift detection and real-time processing
- predictive_
prefetch - Predictive prefetching with access pattern learning
- real_
datasets - Real dataset loaders with automatic downloading
- reproducibility
- Reproducibility utilities for deterministic dataset operations
- schema_
inference - Unified schema inference and auto-detection across data formats
- simd_
transforms - SIMD-accelerated transforms for high-performance data processing
- smart_
cache - Smart caching system with adaptive policies and multi-tier caching
- statistics
- Dataset statistics computation module
- stream_
prefetch_ optimizer - Advanced streaming data prefetching optimization system
- streaming_
optimized - Optimized streaming datasets for large-scale training
- synthetic
- Synthetic Dataset Generation - Modular Architecture
- throughput_
benchmark - Throughput benchmark performance harness for datasets
- transforms
- Data transformation utilities for datasets
- validation
- versioning
- Dataset versioning with snapshots and lineage tracking
- visualization
- Dataset visualization module
- work_
stealing - Work-stealing queue for efficient multi-threaded data loading
- zero_
copy - Zero-copy operations for memory-efficient dataset loading
Macros§
- compose
- Macro for easy pipeline composition
- conditional
- Macro for conditional transforms
- profile_
transform - Macro for easy transform profiling
- random_
choice - Macro for random choice
Enums§
- Tensor
Error - Enhanced error handling with contextual information and recovery strategies