Skip to main content

Crate tenflowers_core

Crate tenflowers_core 

Source
Expand description

§TenfloweRS Core

The core tensor operations and device management library for the TenfloweRS machine learning framework. This crate provides the foundational building blocks for building, training, and deploying deep learning models in pure Rust with safety, performance, and cross-platform GPU acceleration.

§Features

  • Tensor Operations: Comprehensive n-dimensional array operations with automatic broadcasting
  • Device Management: Unified CPU/GPU abstraction with automatic memory management
  • Performance: SIMD vectorization, parallel execution, and GPU compute kernels
  • Cross-Platform GPU: WGPU-based GPU support (Metal, Vulkan, DirectX, WebGPU)
  • Advanced Optimizations: Mixed precision, quantization, kernel fusion, memory pooling
  • Production Features: Checkpointing, serialization, deterministic execution, profiling
  • SciRS2 Integration: Built on the robust SciRS2 scientific computing ecosystem

§Quick Start

§Basic Tensor Creation and Operations

use tenflowers_core::{Tensor, Device};

// Create tensors
let a = Tensor::<f32>::zeros(&[2, 3]);
let b = Tensor::<f32>::ones(&[2, 3]);

// Arithmetic operations
let c = tenflowers_core::ops::add(&a, &b)?;
let d = tenflowers_core::ops::mul(&a, &b)?;

// Matrix multiplication
let x = Tensor::<f32>::ones(&[2, 3]);
let y = Tensor::<f32>::ones(&[3, 4]);
let z = tenflowers_core::ops::matmul(&x, &y)?;

§GPU Acceleration

use tenflowers_core::{Tensor, Device};

// Create tensor on GPU
let device = Device::gpu(0)?;
let gpu_tensor = Tensor::<f32>::zeros(&[1000, 1000]).to_device(&device)?;

// Operations automatically run on GPU
let result = tenflowers_core::ops::matmul(&gpu_tensor, &gpu_tensor)?;

§Advanced Features

§Mixed Precision Training
use tenflowers_core::{Tensor, f16, MixedPrecisionConfig};

// Use f16 for faster training with less memory
let fp16_tensor = Tensor::<f16>::ones(&[1024, 1024]);
let result = tenflowers_core::ops::matmul(&fp16_tensor, &fp16_tensor)?;
§Quantization
use tenflowers_core::{Tensor, quantize, QuantizationParams};

let tensor = Tensor::<f32>::ones(&[100, 100]);

// Quantize to 8-bit for inference
let quantized = quantize(&tensor, 8)?;
§Deterministic Execution
use tenflowers_core::{set_deterministic_mode, set_global_seed};

// Enable deterministic mode for reproducible results
set_deterministic_mode(true);
set_global_seed(42);

§Architecture Overview

The crate is organized into the following modules:

  • tensor: Core tensor type with device placement and memory management
  • ops: Tensor operations (arithmetic, linear algebra, neural network primitives)
  • device: Device abstraction (CPU, GPU, custom accelerators)
  • dtype: Data type system (f32, f64, f16, bf16, i32, etc.)
  • shape: Shape inference and validation
  • memory: Memory management, pooling, and optimization
  • graph: Computation graph construction and optimization
  • session: Graph execution engine
  • quantization: Model quantization for deployment
  • mixed_precision: Mixed precision training utilities
  • checkpointing: Model checkpointing and restoration
  • deterministic: Deterministic execution controls
  • monitoring: Performance monitoring and profiling

§Performance Features

§SIMD Optimization

The crate automatically uses SIMD instructions when available for maximum performance:

use tenflowers_core::{Tensor, SimdCapabilities};

// Check available SIMD features
let capabilities = SimdCapabilities::detect();
println!("SIMD support: {:?}", capabilities);

// Operations automatically use SIMD when beneficial
let a = Tensor::<f32>::ones(&[10000]);
let b = Tensor::<f32>::ones(&[10000]);
let c = tenflowers_core::ops::add(&a, &b)?;

§Memory Optimization

use tenflowers_core::{Tensor, Device};
use tenflowers_core::memory::{BufferPool, GlobalBufferPool};

// Use buffer pooling for efficient memory reuse
let pool = GlobalBufferPool::get();
pool.set_max_pool_size(1024 * 1024 * 1024); // 1GB

// Tensors automatically use the pool
let tensor = Tensor::<f32>::zeros(&[1000, 1000]);

§Integration with TenfloweRS Ecosystem

This crate integrates seamlessly with:

  • tenflowers-autograd: Automatic differentiation engine
  • tenflowers-neural: High-level neural network layers
  • tenflowers-dataset: Data loading and preprocessing
  • scirs2-core: Scientific computing primitives
  • scirs2-autograd: Static graph optimization

§GPU Support

TenfloweRS Core uses WGPU for cross-platform GPU acceleration, supporting:

  • Metal (macOS, iOS)
  • Vulkan (Windows, Linux, Android)
  • DirectX 12 (Windows)
  • WebGPU (browsers)

Enable GPU support with the gpu feature flag:

[dependencies]
tenflowers-core = { version = "0.1.0", features = ["gpu"] }

§Safety and Correctness

TenfloweRS Core is designed with safety as a primary concern:

  • Memory-safe by default (no unsafe code in core tensor operations)
  • Extensive shape validation and error handling
  • Gradient checking utilities for numerical correctness
  • Deterministic execution modes for reproducibility

§Performance Benchmarking

Use the built-in benchmarking utilities to measure performance:

use tenflowers_core::{Tensor, Device};
use tenflowers_core::profiling::Profiler;

let profiler = Profiler::new();
profiler.start("matmul");

let a = Tensor::<f32>::ones(&[1000, 1000]);
let b = Tensor::<f32>::ones(&[1000, 1000]);
let c = tenflowers_core::ops::matmul(&a, &b)?;

profiler.stop("matmul");
profiler.print_summary();

Re-exports§

pub use complex::Complex32;
pub use complex::Complex64;
pub use device::Device;
pub use dtype::dtype_from_type;
pub use dtype::DType;
pub use error::Result;
pub use error::TensorError;
pub use fallback::cleanup_memory_and_retry;
pub use fallback::execute_binary_op_with_fallback;
pub use fallback::execute_unary_op_with_fallback;
pub use fallback::get_fallback_config;
pub use fallback::is_auto_fallback_enabled;
pub use fallback::set_auto_fallback_enabled;
pub use fallback::set_fallback_config;
pub use fallback::FallbackConfig;
pub use fallback::FallbackWrapper;
pub use half_precision::HalfPrecision;
pub use half_precision::MixedPrecisionConfig as HalfMixedPrecisionConfig;
pub use integration::BaselinePerformance;
pub use integration::OptimizationBreakdown;
pub use integration::PerformanceTargets;
pub use integration::UltraPerformanceValidator;
pub use integration::ValidationReport;
pub use integration::ValidationResult;
pub use integration::ValidationTestSuite;
pub use layout::convert_layout;
pub use layout::infer_layout;
pub use layout::DataLayout;
pub use layout::LayoutOptimizer;
pub use layout::OperationType;
pub use quantization::dequantize;
pub use quantization::dynamic_quantize;
pub use quantization::fake_quantize;
pub use quantization::per_channel_quantize;
pub use quantization::quantize;
pub use quantization::QuantizationParams;
pub use shape::Shape;
pub use shape_error_taxonomy::validate_broadcast_shapes;
pub use shape_error_taxonomy::validate_elementwise_shapes;
pub use shape_error_taxonomy::validate_matmul_shapes;
pub use shape_error_taxonomy::validate_reduction_axis;
pub use shape_error_taxonomy::validate_reshape;
pub use shape_error_taxonomy::ShapeErrorBuilder;
pub use shape_error_taxonomy::ShapeErrorCategory;
pub use shape_error_taxonomy::ShapeErrorUtils;
pub use simd::global_simd_engine;
pub use simd::AdvancedKernelRegistry;
pub use simd::CacheFriendlyMatMul;
pub use simd::CacheOptimizedTensorOps;
pub use simd::ConvolutionParams;
pub use simd::CpuFeatures;
pub use simd::ElementWiseOp;
pub use simd::KernelOptimizationStrategy;
pub use simd::MemoryAccessPattern;
pub use simd::ReductionOp as SimdReductionOp;
pub use simd::SimdEngineConfig;
pub use simd::SpecializedKernel;
pub use simd::UltraSimdEngine;
pub use tensor::Tensor;
pub use adaptive_tuning::execute_with_adaptive_tuning;
pub use adaptive_tuning::AdaptiveTuner;
pub use adaptive_tuning::ExecutionStrategy;
pub use adaptive_tuning::OperationMetrics;
pub use adaptive_tuning::PerformancePredictor;
pub use adaptive_tuning::GLOBAL_TUNER;
pub use collective::all_gather;
pub use collective::all_reduce;
pub use collective::broadcast;
pub use collective::create_process_group;
pub use collective::init_collective;
pub use collective::CollectiveManager;
pub use collective::CollectiveOp;
pub use collective::CommunicationGroup;
pub use collective::ReductionOp;
pub use context::get_context;
pub use context::set_context;
pub use context::Context;
pub use cross_platform_optimization::get_global_optimizer;
pub use cross_platform_optimization::get_optimal_configuration;
pub use cross_platform_optimization::initialize_cross_platform_optimizer;
pub use cross_platform_optimization::CrossPlatformOptimizer;
pub use cross_platform_optimization::OptimalConfiguration;
pub use cross_platform_optimization::TargetArchitecture;
pub use cross_platform_optimization::TargetPlatform;
pub use deterministic::clear_operation_log;
pub use deterministic::get_global_seed;
pub use deterministic::get_operation_log;
pub use deterministic::get_operation_seed;
pub use deterministic::get_state_snapshot;
pub use deterministic::is_deterministic_mode;
pub use deterministic::is_strict_mode;
pub use deterministic::mark_non_deterministic;
pub use deterministic::reset_operation_counter;
pub use deterministic::restore_state_snapshot;
pub use deterministic::set_deterministic_mode;
pub use deterministic::set_global_seed;
pub use deterministic::set_strict_mode;
pub use deterministic::should_use_deterministic_gpu_ops;
pub use deterministic::DeterministicConfig;
pub use deterministic::DeterministicScope;
pub use deterministic::DeterministicSnapshot;
pub use deterministic::DeterministicState;
pub use dispatch_init::ensure_initialized as ensure_dispatch_initialized;
pub use dispatch_registry::get_registry;
pub use dispatch_registry::BackendType;
pub use dispatch_registry::BinaryKernelFn;
pub use dispatch_registry::DispatchBenchmarkResult;
pub use dispatch_registry::DispatchRegistry;
pub use dispatch_registry::KernelImplementation;
pub use dispatch_registry::OperationDescriptor;
pub use dispatch_registry::UnaryKernelFn;
pub use dispatch_registry::F32_REGISTRY;
pub use dispatch_registry::F64_REGISTRY;
pub use dispatch_registry::I32_REGISTRY;
pub use eager_execution::CacheStatistics;
pub use eager_execution::EagerExecutionConfig;
pub use eager_execution::EagerExecutionEngine;
pub use eager_execution::EagerPerformanceReport;
pub use eager_execution::ExecutionMetrics;
pub use eager_execution::EAGER_ENGINE;
pub use gpu_memory_metrics::generate_memory_report;
pub use gpu_memory_metrics::get_gpu_memory_snapshot;
pub use gpu_memory_metrics::get_gpu_memory_usage;
pub use gpu_memory_metrics::get_gpu_peak_memory;
pub use gpu_memory_metrics::print_memory_report;
pub use gpu_memory_metrics::reset_gpu_memory_metrics;
pub use gpu_memory_metrics::GpuMemoryMetrics;
pub use gpu_memory_metrics::GpuMemoryReport;
pub use gpu_memory_metrics::GpuMemorySnapshot;
pub use gpu_memory_metrics::GPU_MEMORY_METRICS;
pub use gradient_clipping::GradientClipper;
pub use gradient_clipping::GradientClippingConfig;
pub use gradient_clipping::GradientStatistics;
pub use gradient_clipping::NormType;
pub use graph::AttributeValue;
pub use graph::AttributeValueDef;
pub use graph::EdgeId;
pub use graph::Graph;
pub use graph::GraphDef;
pub use graph::GraphEdge;
pub use graph::GraphNode;
pub use graph::NodeDef;
pub use graph::NodeId;
pub use graph::NodeType;
pub use large_model_optimization::LargeModelConfig;
pub use large_model_optimization::LargeModelOptimizationReport;
pub use large_model_optimization::LargeModelOptimizer;
pub use large_model_optimization::MemoryOptimizationStats;
pub use large_model_optimization::ModelExecutionPlan;
pub use large_model_optimization::LARGE_MODEL_OPTIMIZER;
pub use memory::global_monitor;
pub use memory::global_monitor_arc;
pub use memory::IntegratedDiagnosticReport;
pub use memory::KernelOccupancyStats;
pub use memory::MemoryAliasDetector;
pub use memory::MemoryPool;
pub use memory::MemoryPoolStats;
pub use memory::MultiStreamMemoryManager;
pub use memory::OperationTimer;
pub use memory::OptimizationResult;
pub use memory::PerformanceMonitor;
pub use memory::PoolHealthMetrics;
pub use memory::PoolHealthStatus;
pub use memory::PoolOptimizationConfig;
pub use memory::StridedView;
pub use memory_tensorflow_comparison::MemoryComparisonReport;
pub use memory_tensorflow_comparison::MemoryOptimizationSuggestion;
pub use memory_tensorflow_comparison::MemoryProfilingConfig;
pub use memory_tensorflow_comparison::MemorySnapshot;
pub use memory_tensorflow_comparison::TensorFlowMemoryProfiler;
pub use memory_tensorflow_comparison::MEMORY_PROFILER;
pub use mixed_precision::disable_autocast;
pub use mixed_precision::enable_autocast;
pub use mixed_precision::enable_autocast_bfloat16;
pub use mixed_precision::from_bfloat16_f32;
pub use mixed_precision::from_bfloat16_f64;
pub use mixed_precision::from_half;
pub use mixed_precision::from_half_f32;
pub use mixed_precision::from_half_f64;
pub use mixed_precision::to_bfloat16_f32;
pub use mixed_precision::to_bfloat16_f64;
pub use mixed_precision::to_half;
pub use mixed_precision::to_half_f32;
pub use mixed_precision::to_half_f64;
pub use mixed_precision::AutocastContext;
pub use mixed_precision::GradientScaler;
pub use mixed_precision::MixedPrecisionConfig;
pub use mixed_precision::MixedPrecisionState;
pub use monitoring::AlertSeverity;
pub use monitoring::BottleneckType;
pub use monitoring::MonitoringConfig as UltraMonitoringConfig;
pub use monitoring::MonitoringReport;
pub use monitoring::OperationMetrics as MonitoringOperationMetrics;
pub use monitoring::OptimizationOpportunity;
pub use monitoring::PerformanceAlert;
pub use monitoring::PerformanceDashboard;
pub use monitoring::PerformancePrediction;
pub use monitoring::PerformancePredictor as MonitoringPerformancePredictor;
pub use monitoring::PerformanceSnapshot;
pub use monitoring::SystemBottleneck;
pub use monitoring::SystemMetrics;
pub use monitoring::TrendDirection;
pub use monitoring::TrendType;
pub use monitoring::UltraPerformanceMonitor;
pub use neural_optimization::LayerPerformanceMetrics;
pub use neural_optimization::NetworkPerformanceReport;
pub use neural_optimization::OptimizationBreakdown as NeuralOptimizationBreakdown;
pub use neural_optimization::UltraOptimizedActivations;
pub use neural_optimization::UltraOptimizedDenseLayer;
pub use neural_optimization::UltraOptimizedNeuralNetwork;
pub use onnx_interop::OnnxConfig;
pub use onnx_interop::OnnxExporter;
pub use onnx_interop::OnnxImporter;
pub use onnx_interop::OnnxModel;
pub use ops::execute_fused_graph;
pub use ops::get_fusion_stats;
pub use ops::infer_binary_elementwise;
pub use ops::infer_binary_elementwise_validated;
pub use ops::infer_concat;
pub use ops::infer_conv2d;
pub use ops::infer_matmul;
pub use ops::infer_reduction;
pub use ops::infer_reshape;
pub use ops::print_framework_comparison_results;
pub use ops::print_fusion_report;
pub use ops::record_fusion_opportunity;
pub use ops::reset_fusion_stats;
pub use ops::run_framework_comparison_benchmark;
pub use ops::BroadcastableConstraint;
pub use ops::ElementwiseOpType;
pub use ops::ExactShapeConstraint;
pub use ops::FrameworkBenchmarkConfig;
pub use ops::FrameworkComparisonResult;
pub use ops::FusionGraph;
pub use ops::FusionNode;
pub use ops::FusionPassBuilder;
pub use ops::FusionStats;
pub use ops::MatMulCompatibleConstraint;
pub use ops::MinRankConstraint;
pub use ops::RankConstraint;
pub use ops::ShapeConstraint;
pub use ops::ShapeContext;
pub use ops::ShapeValidator;
pub use performance_gates::get_baseline;
pub use performance_gates::list_baselines;
pub use performance_gates::register_baseline;
pub use performance_gates::OperationBaseline;
pub use performance_gates::PerformanceGate;
pub use performance_gates::PerformanceGateSuite;
pub use performance_gates::PerformanceMeasurement;
pub use production_benchmarks::run_comprehensive_production_benchmarks;
pub use production_benchmarks::BenchmarkConfig;
pub use production_benchmarks::BenchmarkResult;
pub use production_benchmarks::BenchmarkSummary as ProductionBenchmarkSummary;
pub use production_benchmarks::OptimizationBreakdown as ProductionOptimizationBreakdown;
pub use production_benchmarks::ProblemSize;
pub use production_benchmarks::ProductionBenchmarkReport;
pub use production_benchmarks::ProductionBenchmarkSuite;
pub use production_benchmarks::QualityMetrics;
pub use production_performance_monitoring::get_global_monitor;
pub use production_performance_monitoring::initialize_performance_monitoring;
pub use production_performance_monitoring::record_performance_event;
pub use production_performance_monitoring::AlertThresholds;
pub use production_performance_monitoring::MonitoringConfig;
pub use production_performance_monitoring::PerformanceEvent;
pub use production_performance_monitoring::PerformanceMetrics;
pub use production_performance_monitoring::ProductionPerformanceMonitor;
pub use session::create_session;
pub use session::DefaultSession;
pub use session::FeedDict;
pub use session::FetchSpec;
pub use session::Session;
pub use session::SessionConfig;
pub use simplified_benchmarks::run_simple_benchmarks;
pub use simplified_benchmarks::validate_optimizations;
pub use simplified_benchmarks::BenchmarkReport;
pub use simplified_benchmarks::BenchmarkSummary;
pub use simplified_benchmarks::SimpleBenchmarkConfig;
pub use simplified_benchmarks::SimpleBenchmarkResult;
pub use simplified_benchmarks::SimpleBenchmarkSuite;
pub use strided::SliceParams;
pub use strided::StridedLayout;
pub use structured_arrays::FieldDescriptor;
pub use structured_arrays::FieldValue;
pub use structured_arrays::StructuredArray;
pub use system_health::run_quick_health_check;
pub use system_health::run_system_health_check;
pub use system_health::FeaturesInfo;
pub use system_health::GpuMemoryInfo;
pub use system_health::HealthCheckConfig;
pub use system_health::HealthStatus;
pub use system_health::MemoryInfo;
pub use system_health::PerformanceBenchmarks;
pub use system_health::SystemHealthChecker;
pub use system_health::SystemInfo;
pub use tensor_view::MemoryStats;
pub use tensor_view::TensorView;
pub use tensor_view::TensorViewOps;
pub use wasm::utils as wasm_utils;
pub use wasm::WasmContext;

Modules§

adaptive_tuning
Adaptive Performance Tuning System
buffer
checkpointing
Activation Checkpointing for Memory-Efficient Training
collective
complex
Complex number types and operations for TenfloweRS
context
cross_platform_optimization
deployment
Model deployment optimization for TenfloweRS
deterministic
device
dispatch_init
dispatch_registry
dispatch_registry_examples
dispatch_registry_extended
dtype
eager_execution
Ultra-Performance Eager Execution Optimization Module
error
fallback
Automatic fallback mechanisms for operation recovery
gpu_memory_metrics
gpu_stub
gradient_clipping
Advanced Gradient Clipping with Adaptive Scaling
gradient_coverage_audit
gradient_validation_framework
graph
Computation Graph Module
half_precision
Half precision floating point support
integration
Integration Module for Ultra-Performance Validation
large_model_optimization
Large Model Optimization Module
layout
memory
Memory management infrastructure for TenfloweRS
memory_tensorflow_comparison
Memory Usage Profiling and TensorFlow Comparison Module
mixed_precision
monitoring
Ultra-Advanced Production Performance Monitoring
neural_optimization
Ultra-Optimized Neural Network Layer Integration
numerical_gradient
Numerical Gradient Validation Utilities
onnx_interop
ONNX import/export functionality for model interoperability
ops
performance_benchmarks
Comprehensive Performance Benchmarking Suite for TenfloweRS Optimizations
performance_gates
production_benchmarks
Comprehensive Production Benchmarks
production_performance_monitoring
quantization
Quantization operations for TenfloweRS
session
shape
shape_error_taxonomy
shape_inference_helpers
Shape Inference Helpers and Enhanced Diagnostics
simd
Ultra-High-Performance SIMD Optimizations powered by SciRS2-Core
simplified_benchmarks
strided
structured_arrays
system_health
tensor
Tensor Module - Modular Architecture
tensor_view
ultra_performance_profiler
🚀 Ultra-Performance Profiler Integration
wasm
WebAssembly platform support for TenfloweRS
wasm_optimization
WebAssembly optimization module for edge deployment

Macros§

eager_execute
Convenience macro for eager operation execution
exact_shape
Macro for creating exact shape constraints
measure_performance
Performance measurement macro
min_rank
Macro for creating minimum rank constraints
rank
Macro for creating rank constraints
register_binary_kernel
Macro to register a binary kernel
register_kernel_with_backend
Macro to simplify kernel registration with feature gates
register_operation
Macro to simplify operation registration
register_unary_kernel
Macro to register a unary kernel
time_operation
Macro for easily timing operations
validate_shapes
Compile-time shape validation macro
with_device
Macro for device scope

Structs§

bf16
A 16-bit floating point type implementing the bfloat16 format.
f16
A 16-bit floating point type implementing the IEEE 754-2008 standard binary16 a.k.a “half” format.