Expand description
§TenfloweRS Core
The core tensor operations and device management library for the TenfloweRS machine learning framework. This crate provides the foundational building blocks for building, training, and deploying deep learning models in pure Rust with safety, performance, and cross-platform GPU acceleration.
§Features
- Tensor Operations: Comprehensive n-dimensional array operations with automatic broadcasting
- Device Management: Unified CPU/GPU abstraction with automatic memory management
- Performance: SIMD vectorization, parallel execution, and GPU compute kernels
- Cross-Platform GPU: WGPU-based GPU support (Metal, Vulkan, DirectX, WebGPU)
- Advanced Optimizations: Mixed precision, quantization, kernel fusion, memory pooling
- Production Features: Checkpointing, serialization, deterministic execution, profiling
- SciRS2 Integration: Built on the robust SciRS2 scientific computing ecosystem
§Quick Start
§Basic Tensor Creation and Operations
use tenflowers_core::{Tensor, Device};
// Create tensors
let a = Tensor::<f32>::zeros(&[2, 3]);
let b = Tensor::<f32>::ones(&[2, 3]);
// Arithmetic operations
let c = tenflowers_core::ops::add(&a, &b)?;
let d = tenflowers_core::ops::mul(&a, &b)?;
// Matrix multiplication
let x = Tensor::<f32>::ones(&[2, 3]);
let y = Tensor::<f32>::ones(&[3, 4]);
let z = tenflowers_core::ops::matmul(&x, &y)?;§GPU Acceleration
ⓘ
use tenflowers_core::{Tensor, Device};
// Create tensor on GPU
let device = Device::gpu(0)?;
let gpu_tensor = Tensor::<f32>::zeros(&[1000, 1000]).to_device(&device)?;
// Operations automatically run on GPU
let result = tenflowers_core::ops::matmul(&gpu_tensor, &gpu_tensor)?;§Advanced Features
§Mixed Precision Training
use tenflowers_core::{Tensor, f16, MixedPrecisionConfig};
// Use f16 for faster training with less memory
let fp16_tensor = Tensor::<f16>::ones(&[1024, 1024]);
let result = tenflowers_core::ops::matmul(&fp16_tensor, &fp16_tensor)?;§Quantization
ⓘ
use tenflowers_core::{Tensor, quantize, QuantizationParams};
let tensor = Tensor::<f32>::ones(&[100, 100]);
// Quantize to 8-bit for inference
let quantized = quantize(&tensor, 8)?;§Deterministic Execution
use tenflowers_core::{set_deterministic_mode, set_global_seed};
// Enable deterministic mode for reproducible results
set_deterministic_mode(true);
set_global_seed(42);§Architecture Overview
The crate is organized into the following modules:
tensor: Core tensor type with device placement and memory managementops: Tensor operations (arithmetic, linear algebra, neural network primitives)device: Device abstraction (CPU, GPU, custom accelerators)dtype: Data type system (f32, f64, f16, bf16, i32, etc.)shape: Shape inference and validationmemory: Memory management, pooling, and optimizationgraph: Computation graph construction and optimizationsession: Graph execution enginequantization: Model quantization for deploymentmixed_precision: Mixed precision training utilitiescheckpointing: Model checkpointing and restorationdeterministic: Deterministic execution controlsmonitoring: Performance monitoring and profiling
§Performance Features
§SIMD Optimization
The crate automatically uses SIMD instructions when available for maximum performance:
ⓘ
use tenflowers_core::{Tensor, SimdCapabilities};
// Check available SIMD features
let capabilities = SimdCapabilities::detect();
println!("SIMD support: {:?}", capabilities);
// Operations automatically use SIMD when beneficial
let a = Tensor::<f32>::ones(&[10000]);
let b = Tensor::<f32>::ones(&[10000]);
let c = tenflowers_core::ops::add(&a, &b)?;§Memory Optimization
ⓘ
use tenflowers_core::{Tensor, Device};
use tenflowers_core::memory::{BufferPool, GlobalBufferPool};
// Use buffer pooling for efficient memory reuse
let pool = GlobalBufferPool::get();
pool.set_max_pool_size(1024 * 1024 * 1024); // 1GB
// Tensors automatically use the pool
let tensor = Tensor::<f32>::zeros(&[1000, 1000]);§Integration with TenfloweRS Ecosystem
This crate integrates seamlessly with:
tenflowers-autograd: Automatic differentiation enginetenflowers-neural: High-level neural network layerstenflowers-dataset: Data loading and preprocessingscirs2-core: Scientific computing primitivesscirs2-autograd: Static graph optimization
§GPU Support
TenfloweRS Core uses WGPU for cross-platform GPU acceleration, supporting:
- Metal (macOS, iOS)
- Vulkan (Windows, Linux, Android)
- DirectX 12 (Windows)
- WebGPU (browsers)
Enable GPU support with the gpu feature flag:
[dependencies]
tenflowers-core = { version = "0.1.0", features = ["gpu"] }§Safety and Correctness
TenfloweRS Core is designed with safety as a primary concern:
- Memory-safe by default (no unsafe code in core tensor operations)
- Extensive shape validation and error handling
- Gradient checking utilities for numerical correctness
- Deterministic execution modes for reproducibility
§Performance Benchmarking
Use the built-in benchmarking utilities to measure performance:
ⓘ
use tenflowers_core::{Tensor, Device};
use tenflowers_core::profiling::Profiler;
let profiler = Profiler::new();
profiler.start("matmul");
let a = Tensor::<f32>::ones(&[1000, 1000]);
let b = Tensor::<f32>::ones(&[1000, 1000]);
let c = tenflowers_core::ops::matmul(&a, &b)?;
profiler.stop("matmul");
profiler.print_summary();Re-exports§
pub use complex::Complex32;pub use complex::Complex64;pub use device::Device;pub use dtype::dtype_from_type;pub use dtype::DType;pub use error::Result;pub use error::TensorError;pub use fallback::cleanup_memory_and_retry;pub use fallback::execute_binary_op_with_fallback;pub use fallback::execute_unary_op_with_fallback;pub use fallback::get_fallback_config;pub use fallback::is_auto_fallback_enabled;pub use fallback::set_auto_fallback_enabled;pub use fallback::set_fallback_config;pub use fallback::FallbackConfig;pub use fallback::FallbackWrapper;pub use half_precision::HalfPrecision;pub use half_precision::MixedPrecisionConfig as HalfMixedPrecisionConfig;pub use integration::BaselinePerformance;pub use integration::OptimizationBreakdown;pub use integration::PerformanceTargets;pub use integration::UltraPerformanceValidator;pub use integration::ValidationReport;pub use integration::ValidationResult;pub use integration::ValidationTestSuite;pub use layout::convert_layout;pub use layout::infer_layout;pub use layout::DataLayout;pub use layout::LayoutOptimizer;pub use layout::OperationType;pub use quantization::dequantize;pub use quantization::dynamic_quantize;pub use quantization::fake_quantize;pub use quantization::per_channel_quantize;pub use quantization::quantize;pub use quantization::QuantizationParams;pub use shape::Shape;pub use shape_error_taxonomy::validate_broadcast_shapes;pub use shape_error_taxonomy::validate_elementwise_shapes;pub use shape_error_taxonomy::validate_matmul_shapes;pub use shape_error_taxonomy::validate_reduction_axis;pub use shape_error_taxonomy::validate_reshape;pub use shape_error_taxonomy::ShapeErrorBuilder;pub use shape_error_taxonomy::ShapeErrorCategory;pub use shape_error_taxonomy::ShapeErrorUtils;pub use simd::global_simd_engine;pub use simd::AdvancedKernelRegistry;pub use simd::CacheFriendlyMatMul;pub use simd::CacheOptimizedTensorOps;pub use simd::ConvolutionParams;pub use simd::CpuFeatures;pub use simd::ElementWiseOp;pub use simd::KernelOptimizationStrategy;pub use simd::MemoryAccessPattern;pub use simd::ReductionOp as SimdReductionOp;pub use simd::SimdEngineConfig;pub use simd::SpecializedKernel;pub use simd::UltraSimdEngine;pub use tensor::Tensor;pub use adaptive_tuning::execute_with_adaptive_tuning;pub use adaptive_tuning::AdaptiveTuner;pub use adaptive_tuning::ExecutionStrategy;pub use adaptive_tuning::OperationMetrics;pub use adaptive_tuning::PerformancePredictor;pub use adaptive_tuning::GLOBAL_TUNER;pub use collective::all_gather;pub use collective::all_reduce;pub use collective::broadcast;pub use collective::create_process_group;pub use collective::init_collective;pub use collective::CollectiveManager;pub use collective::CollectiveOp;pub use collective::CommunicationGroup;pub use collective::ReductionOp;pub use context::get_context;pub use context::set_context;pub use context::Context;pub use cross_platform_optimization::get_global_optimizer;pub use cross_platform_optimization::get_optimal_configuration;pub use cross_platform_optimization::initialize_cross_platform_optimizer;pub use cross_platform_optimization::CrossPlatformOptimizer;pub use cross_platform_optimization::OptimalConfiguration;pub use cross_platform_optimization::TargetArchitecture;pub use cross_platform_optimization::TargetPlatform;pub use deterministic::clear_operation_log;pub use deterministic::get_global_seed;pub use deterministic::get_operation_log;pub use deterministic::get_operation_seed;pub use deterministic::get_state_snapshot;pub use deterministic::is_deterministic_mode;pub use deterministic::is_strict_mode;pub use deterministic::mark_non_deterministic;pub use deterministic::reset_operation_counter;pub use deterministic::restore_state_snapshot;pub use deterministic::set_deterministic_mode;pub use deterministic::set_global_seed;pub use deterministic::set_strict_mode;pub use deterministic::should_use_deterministic_gpu_ops;pub use deterministic::DeterministicConfig;pub use deterministic::DeterministicScope;pub use deterministic::DeterministicSnapshot;pub use deterministic::DeterministicState;pub use dispatch_init::ensure_initialized as ensure_dispatch_initialized;pub use dispatch_registry::get_registry;pub use dispatch_registry::BackendType;pub use dispatch_registry::BinaryKernelFn;pub use dispatch_registry::DispatchBenchmarkResult;pub use dispatch_registry::DispatchRegistry;pub use dispatch_registry::KernelImplementation;pub use dispatch_registry::OperationDescriptor;pub use dispatch_registry::UnaryKernelFn;pub use dispatch_registry::F32_REGISTRY;pub use dispatch_registry::F64_REGISTRY;pub use dispatch_registry::I32_REGISTRY;pub use eager_execution::CacheStatistics;pub use eager_execution::EagerExecutionConfig;pub use eager_execution::EagerExecutionEngine;pub use eager_execution::EagerPerformanceReport;pub use eager_execution::ExecutionMetrics;pub use eager_execution::EAGER_ENGINE;pub use gpu_memory_metrics::generate_memory_report;pub use gpu_memory_metrics::get_gpu_memory_snapshot;pub use gpu_memory_metrics::get_gpu_memory_usage;pub use gpu_memory_metrics::get_gpu_peak_memory;pub use gpu_memory_metrics::print_memory_report;pub use gpu_memory_metrics::reset_gpu_memory_metrics;pub use gpu_memory_metrics::GpuMemoryMetrics;pub use gpu_memory_metrics::GpuMemoryReport;pub use gpu_memory_metrics::GpuMemorySnapshot;pub use gpu_memory_metrics::GPU_MEMORY_METRICS;pub use gradient_clipping::GradientClipper;pub use gradient_clipping::GradientClippingConfig;pub use gradient_clipping::GradientStatistics;pub use gradient_clipping::NormType;pub use graph::AttributeValue;pub use graph::AttributeValueDef;pub use graph::EdgeId;pub use graph::Graph;pub use graph::GraphDef;pub use graph::GraphEdge;pub use graph::GraphNode;pub use graph::NodeDef;pub use graph::NodeId;pub use graph::NodeType;pub use large_model_optimization::LargeModelConfig;pub use large_model_optimization::LargeModelOptimizationReport;pub use large_model_optimization::LargeModelOptimizer;pub use large_model_optimization::MemoryOptimizationStats;pub use large_model_optimization::ModelExecutionPlan;pub use large_model_optimization::LARGE_MODEL_OPTIMIZER;pub use memory::global_monitor;pub use memory::global_monitor_arc;pub use memory::IntegratedDiagnosticReport;pub use memory::KernelOccupancyStats;pub use memory::MemoryAliasDetector;pub use memory::MemoryPool;pub use memory::MemoryPoolStats;pub use memory::MultiStreamMemoryManager;pub use memory::OperationTimer;pub use memory::OptimizationResult;pub use memory::PerformanceMonitor;pub use memory::PoolHealthMetrics;pub use memory::PoolHealthStatus;pub use memory::PoolOptimizationConfig;pub use memory::StridedView;pub use memory_tensorflow_comparison::MemoryComparisonReport;pub use memory_tensorflow_comparison::MemoryOptimizationSuggestion;pub use memory_tensorflow_comparison::MemoryProfilingConfig;pub use memory_tensorflow_comparison::MemorySnapshot;pub use memory_tensorflow_comparison::TensorFlowMemoryProfiler;pub use memory_tensorflow_comparison::MEMORY_PROFILER;pub use mixed_precision::disable_autocast;pub use mixed_precision::enable_autocast;pub use mixed_precision::enable_autocast_bfloat16;pub use mixed_precision::from_bfloat16_f32;pub use mixed_precision::from_bfloat16_f64;pub use mixed_precision::from_half;pub use mixed_precision::from_half_f32;pub use mixed_precision::from_half_f64;pub use mixed_precision::to_bfloat16_f32;pub use mixed_precision::to_bfloat16_f64;pub use mixed_precision::to_half;pub use mixed_precision::to_half_f32;pub use mixed_precision::to_half_f64;pub use mixed_precision::AutocastContext;pub use mixed_precision::GradientScaler;pub use mixed_precision::MixedPrecisionConfig;pub use mixed_precision::MixedPrecisionState;pub use monitoring::AlertSeverity;pub use monitoring::BottleneckType;pub use monitoring::MonitoringConfig as UltraMonitoringConfig;pub use monitoring::MonitoringReport;pub use monitoring::OperationMetrics as MonitoringOperationMetrics;pub use monitoring::OptimizationOpportunity;pub use monitoring::PerformanceAlert;pub use monitoring::PerformanceDashboard;pub use monitoring::PerformancePrediction;pub use monitoring::PerformancePredictor as MonitoringPerformancePredictor;pub use monitoring::PerformanceSnapshot;pub use monitoring::SystemBottleneck;pub use monitoring::SystemMetrics;pub use monitoring::TrendDirection;pub use monitoring::TrendType;pub use monitoring::UltraPerformanceMonitor;pub use neural_optimization::LayerPerformanceMetrics;pub use neural_optimization::NetworkPerformanceReport;pub use neural_optimization::OptimizationBreakdown as NeuralOptimizationBreakdown;pub use neural_optimization::UltraOptimizedActivations;pub use neural_optimization::UltraOptimizedDenseLayer;pub use neural_optimization::UltraOptimizedNeuralNetwork;pub use onnx_interop::OnnxConfig;pub use onnx_interop::OnnxExporter;pub use onnx_interop::OnnxImporter;pub use onnx_interop::OnnxModel;pub use ops::execute_fused_graph;pub use ops::get_fusion_stats;pub use ops::infer_binary_elementwise;pub use ops::infer_binary_elementwise_validated;pub use ops::infer_concat;pub use ops::infer_conv2d;pub use ops::infer_matmul;pub use ops::infer_reduction;pub use ops::infer_reshape;pub use ops::print_framework_comparison_results;pub use ops::print_fusion_report;pub use ops::record_fusion_opportunity;pub use ops::reset_fusion_stats;pub use ops::run_framework_comparison_benchmark;pub use ops::BroadcastableConstraint;pub use ops::ElementwiseOpType;pub use ops::ExactShapeConstraint;pub use ops::FrameworkBenchmarkConfig;pub use ops::FrameworkComparisonResult;pub use ops::FusionGraph;pub use ops::FusionNode;pub use ops::FusionPassBuilder;pub use ops::FusionStats;pub use ops::MatMulCompatibleConstraint;pub use ops::MinRankConstraint;pub use ops::RankConstraint;pub use ops::ShapeConstraint;pub use ops::ShapeContext;pub use ops::ShapeValidator;pub use performance_gates::get_baseline;pub use performance_gates::list_baselines;pub use performance_gates::register_baseline;pub use performance_gates::OperationBaseline;pub use performance_gates::PerformanceGate;pub use performance_gates::PerformanceGateSuite;pub use performance_gates::PerformanceMeasurement;pub use production_benchmarks::run_comprehensive_production_benchmarks;pub use production_benchmarks::BenchmarkConfig;pub use production_benchmarks::BenchmarkResult;pub use production_benchmarks::BenchmarkSummary as ProductionBenchmarkSummary;pub use production_benchmarks::OptimizationBreakdown as ProductionOptimizationBreakdown;pub use production_benchmarks::ProblemSize;pub use production_benchmarks::ProductionBenchmarkReport;pub use production_benchmarks::ProductionBenchmarkSuite;pub use production_benchmarks::QualityMetrics;pub use production_performance_monitoring::get_global_monitor;pub use production_performance_monitoring::initialize_performance_monitoring;pub use production_performance_monitoring::record_performance_event;pub use production_performance_monitoring::AlertThresholds;pub use production_performance_monitoring::MonitoringConfig;pub use production_performance_monitoring::PerformanceEvent;pub use production_performance_monitoring::PerformanceMetrics;pub use production_performance_monitoring::ProductionPerformanceMonitor;pub use session::create_session;pub use session::DefaultSession;pub use session::FeedDict;pub use session::FetchSpec;pub use session::Session;pub use session::SessionConfig;pub use simplified_benchmarks::run_simple_benchmarks;pub use simplified_benchmarks::validate_optimizations;pub use simplified_benchmarks::BenchmarkReport;pub use simplified_benchmarks::BenchmarkSummary;pub use simplified_benchmarks::SimpleBenchmarkConfig;pub use simplified_benchmarks::SimpleBenchmarkResult;pub use simplified_benchmarks::SimpleBenchmarkSuite;pub use strided::SliceParams;pub use strided::StridedLayout;pub use structured_arrays::FieldDescriptor;pub use structured_arrays::FieldValue;pub use structured_arrays::StructuredArray;pub use system_health::run_quick_health_check;pub use system_health::run_system_health_check;pub use system_health::FeaturesInfo;pub use system_health::GpuMemoryInfo;pub use system_health::HealthCheckConfig;pub use system_health::HealthStatus;pub use system_health::MemoryInfo;pub use system_health::PerformanceBenchmarks;pub use system_health::SystemHealthChecker;pub use system_health::SystemInfo;pub use tensor_view::MemoryStats;pub use tensor_view::TensorView;pub use tensor_view::TensorViewOps;pub use wasm::utils as wasm_utils;pub use wasm::WasmContext;
Modules§
- adaptive_
tuning - Adaptive Performance Tuning System
- buffer
- checkpointing
- Activation Checkpointing for Memory-Efficient Training
- collective
- complex
- Complex number types and operations for TenfloweRS
- context
- cross_
platform_ optimization - deployment
- Model deployment optimization for TenfloweRS
- deterministic
- device
- dispatch_
init - dispatch_
registry - dispatch_
registry_ examples - dispatch_
registry_ extended - dtype
- eager_
execution - Ultra-Performance Eager Execution Optimization Module
- error
- fallback
- Automatic fallback mechanisms for operation recovery
- gpu_
memory_ metrics - gpu_
stub - gradient_
clipping - Advanced Gradient Clipping with Adaptive Scaling
- gradient_
coverage_ audit - gradient_
validation_ framework - graph
- Computation Graph Module
- half_
precision - Half precision floating point support
- integration
- Integration Module for Ultra-Performance Validation
- large_
model_ optimization - Large Model Optimization Module
- layout
- memory
- Memory management infrastructure for TenfloweRS
- memory_
tensorflow_ comparison - Memory Usage Profiling and TensorFlow Comparison Module
- mixed_
precision - monitoring
- Ultra-Advanced Production Performance Monitoring
- neural_
optimization - Ultra-Optimized Neural Network Layer Integration
- numerical_
gradient - Numerical Gradient Validation Utilities
- onnx_
interop - ONNX import/export functionality for model interoperability
- ops
- performance_
benchmarks - Comprehensive Performance Benchmarking Suite for TenfloweRS Optimizations
- performance_
gates - production_
benchmarks - Comprehensive Production Benchmarks
- production_
performance_ monitoring - quantization
- Quantization operations for TenfloweRS
- session
- shape
- shape_
error_ taxonomy - shape_
inference_ helpers - Shape Inference Helpers and Enhanced Diagnostics
- simd
- Ultra-High-Performance SIMD Optimizations powered by SciRS2-Core
- simplified_
benchmarks - strided
- structured_
arrays - system_
health - tensor
- Tensor Module - Modular Architecture
- tensor_
view - ultra_
performance_ profiler - 🚀 Ultra-Performance Profiler Integration
- wasm
- WebAssembly platform support for TenfloweRS
- wasm_
optimization - WebAssembly optimization module for edge deployment
Macros§
- eager_
execute - Convenience macro for eager operation execution
- exact_
shape - Macro for creating exact shape constraints
- measure_
performance - Performance measurement macro
- min_
rank - Macro for creating minimum rank constraints
- rank
- Macro for creating rank constraints
- register_
binary_ kernel - Macro to register a binary kernel
- register_
kernel_ with_ backend - Macro to simplify kernel registration with feature gates
- register_
operation - Macro to simplify operation registration
- register_
unary_ kernel - Macro to register a unary kernel
- time_
operation - Macro for easily timing operations
- validate_
shapes - Compile-time shape validation macro
- with_
device - Macro for device scope