Crate tenflowers_autograd

Expand description

§TenfloweRS Automatic Differentiation

tenflowers-autograd provides a comprehensive automatic differentiation engine for the TenfloweRS machine learning framework. This crate implements both forward-mode and reverse-mode automatic differentiation with support for higher-order derivatives, custom gradients, and advanced optimization techniques.

§Features

Complete Gradient Operations: All fundamental tensor operations with mathematically correct gradients
Higher-Order Derivatives: Efficient computation of Hessians, third-order derivatives, and beyond
Performance Optimization: Kernel fusion, memory optimization, and distributed gradient computation
Advanced Differentiation: Mixed-mode AD, implicit differentiation, and custom gradient functions
Neural Network Integration: Seamless integration with tenflowers-neural for deep learning
Distributed Training: Parameter servers, gradient compression, and cross-datacenter replication

§Quick Start

use tenflowers_autograd::{GradientTape, TrackedTensor};
use tenflowers_core::{Tensor, Device};

let device = Device::Cpu;
let mut tape = GradientTape::new();

// Create tracked tensors
let x = tape.watch(Tensor::<f32>::ones(&[2, 2]));
let y = tape.watch(Tensor::<f32>::ones(&[2, 2]));

// Compute gradients using GradientTape::gradient
let z = tape.watch(Tensor::<f32>::ones(&[2, 2])); // Placeholder for x+y result
let gradients = tape.gradient(&[z], &[x, y])?;
println!("Gradient of x: {:?}", gradients[0]);

§Advanced Usage

§Custom Gradients

use tenflowers_autograd::{CustomGradientFunction, GradientTape};
use tenflowers_core::{Tensor, Result};

struct MyCustomOp;

impl CustomGradientFunction<f32> for MyCustomOp {
    fn forward(&self, inputs: &[&Tensor<f32>]) -> Result<Tensor<f32>> {
        // Custom forward implementation: y = x^2 + sin(x)
        let x = inputs[0];
        let x_squared = tenflowers_core::ops::mul(x, x)?;
        let sin_x = tenflowers_core::ops::sin(x)?;
        tenflowers_core::ops::add(&x_squared, &sin_x)
    }

    fn backward(&self, grad_output: &Tensor<f32>, inputs: &[&Tensor<f32>], output: &Tensor<f32>) -> Result<Vec<Tensor<f32>>> {
        // Custom backward implementation: dy/dx = 2x + cos(x)
        let x = inputs[0];
        let two = tenflowers_core::Tensor::from_array(scirs2_core::ndarray::arr0(2.0f32).into_dyn());
        let two_x = tenflowers_core::ops::mul(&two, x)?;
        let cos_x = tenflowers_core::ops::cos(x)?;
        let grad_x = tenflowers_core::ops::add(&two_x, &cos_x)?;
        let final_grad = tenflowers_core::ops::mul(grad_output, &grad_x)?;
        Ok(vec![final_grad])
    }

    fn name(&self) -> &str {
        "MyCustomOp"
    }
}

§Higher-Order Derivatives

use tenflowers_autograd::{GradientTape, TrackedTensor};
use tenflowers_core::Tensor;

let mut tape = GradientTape::new();
let x = TrackedTensor::new(Tensor::<f32>::ones(&[1]));
let target = TrackedTensor::new(Tensor::<f32>::ones(&[1]));

// Compute third-order derivatives
let third_order = tape.third_derivative(&target, &x)?;

// Compute nth-order derivatives
let nth_order = tape.nth_derivative(&target, &x, 3)?;

§Mixed-Precision Autograd

The amp_policy module provides a full Automatic Mixed Precision (AMP) policy. During the forward pass, selected ops run in f16/bf16; the gradient tape records in higher precision to avoid underflow. A dynamic loss scaler adapts the scale factor every N steps.

use tenflowers_autograd::amp_policy::{AmpPolicy, DynamicLossScaler};
use tenflowers_autograd::GradientTape;

let policy = AmpPolicy::default(); // f16 forward, f32 backward
let scaler = DynamicLossScaler::new(1024.0, 2.0, 2000);

// Wrap your tape with the AMP policy
let mut tape = GradientTape::new();
// scaler.scale(loss)  →  backward  →  scaler.unscale(grads)  →  step

See amp_policy and the mixed_precision.rs example for the full API.

§Gradient Checkpointing

Gradient checkpointing trades recomputation for memory. Only every N-th intermediate activation is kept; the others are recomputed on demand during backpropagation. The CheckpointManager and checkpoint_sequence function implement this strategy.

use tenflowers_autograd::{CheckpointManager, CheckpointStrategy, checkpoint_sequence};

let manager = CheckpointManager::new(CheckpointStrategy::EveryNLayers(2));

// Wrap N layer closures; only even-indexed layers are stored.
let output = checkpoint_sequence(&tape, input, &layers, &manager)?;

See checkpointing and the gradient_checkpointing.rs example for more.

§Higher-Order Derivatives (grad-of-grad)

The higher_order module exposes Hessian-vector products, second-order partial derivatives, and arbitrary nth-order differentiation. These are implemented by nesting GradientTape scopes — no separate API is required.

use tenflowers_autograd::{GradientTape, TrackedTensor};
use tenflowers_core::Tensor;

let mut outer_tape = GradientTape::new();
let x = outer_tape.watch(Tensor::<f32>::ones(&[4]));

// First-order gradient ∂L/∂x
let gradients = outer_tape.gradient(&[x.clone()], &[x.clone()])?;

// For true second-order (Hessian), use higher_order::compute_hessian
// or nest two GradientTape::new() scopes.

See higher_order and the higher_order_grads.rs example.

§Custom Gradient Operations

Implement CustomGradientFunction to register arbitrary forward/backward kernels. The engine calls forward during the forward pass and backward during reverse accumulation, so the custom op is fully integrated with kernel fusion and checkpointing.

See custom_gradients and the custom_gradient_operations.rs example.

§Performance Features

Kernel Fusion: Automatically fuses operations to reduce memory bandwidth
Gradient Compression: Quantization and sparsification for distributed training
Memory Optimization: Checkpointing and in-place operations for large models
JIT Compilation: Runtime kernel optimization for specific tensor shapes

§Integration with TenfloweRS Ecosystem

This crate integrates seamlessly with:

tenflowers-core: Core tensor operations and device management
tenflowers-neural: Neural network layers and training loops
tenflowers-dataset: Data loading and preprocessing
scirs2-autograd: Static graph optimization and analysis

Re-exports§

pub use boolean_indexing::boolean_mask_backward;
pub use boolean_indexing::integer_array_indexing_backward;
pub use boolean_indexing::where_backward;
pub use checkpointing::checkpoint_sequence;
pub use checkpointing::ActivationCheckpointPolicy;
pub use checkpointing::ActivationCheckpointing;
pub use checkpointing::ActivationRecomputeManager;
pub use checkpointing::CheckpointManager;
pub use checkpointing::CheckpointStrategy;
pub use checkpointing::CheckpointedFunction;
pub use checkpointing::CheckpointedGradientTape;
pub use checkpointing::CheckpointingStats;
pub use checkpointing::LayerMetadata;
pub use checkpointing::RecomputationContext;
pub use context::AutogradContext;
pub use context::ShapeInferenceRule;
pub use context::StaticShapeInference;
pub use coverage_matrix::CategoryCoverage;
pub use coverage_matrix::CoverageMatrix;
pub use coverage_matrix::CoverageReport;
pub use coverage_matrix::OperationCategory;
pub use coverage_matrix::OperationMetadata;
pub use custom_gradients::CustomGradientFunction;
pub use custom_gradients::CustomGradientOp;
pub use custom_gradients::GradientClipFunction;
pub use custom_gradients::GradientScaleFunction;
pub use custom_gradients::StopGradientFunction;
pub use debug::GradientDebugInfo;
pub use debug::GradientDebugger;
pub use deterministic::clear_operation_seeds;
pub use deterministic::get_global_seed;
pub use deterministic::get_operation_seed;
pub use deterministic::get_seeded_operation_count;
pub use deterministic::hash_tensor_data;
pub use deterministic::is_deterministic;
pub use deterministic::reset_deterministic_state;
pub use deterministic::set_deterministic;
pub use deterministic::set_global_seed;
pub use deterministic::set_operation_seed;
pub use deterministic::DeterministicConfig;
pub use deterministic::DeterministicContext;
pub use deterministic::DeterministicOperation;
pub use deterministic::ReproducibilityChecker;
pub use deterministic::ReproducibilityStats;
pub use deterministic::SeedManager;
pub use device_placement::DevicePlacementConfig;
pub use device_placement::DevicePlacementOptimizer;
pub use device_placement::GraphOperation;
pub use device_placement::PlacementDecision;
pub use device_placement::PlacementResult;
pub use device_placement::PlacementStrategy;
pub use ellipsis_newaxis::ellipsis_newaxis_backward;
pub use ellipsis_newaxis::AdvancedIndexer;
pub use ellipsis_newaxis::IndexSpec;
pub use error_taxonomy::utils as error_utils;
pub use error_taxonomy::AutogradErrorBuilder;
pub use error_taxonomy::ErrorPatternValidator;
pub use error_taxonomy::GradientContext;
pub use error_taxonomy::ValidationResult;
pub use forward_ad::forward_ops;
pub use forward_ad::DualTensor;
pub use forward_ad::ForwardADContext;
pub use forward_ad::ForwardMode;
pub use forward_reverse::ComplexityEstimate;
pub use forward_reverse::DifferentiationMode;
pub use forward_reverse::ForwardReverseConfig;
pub use forward_reverse::ForwardReverseDifferentiator;
pub use global_pooling::adaptive_avg_pool2d_backward;
pub use global_pooling::adaptive_max_pool2d_backward;
pub use global_pooling::fractional_adaptive_avg_pool2d_backward;
pub use global_pooling::global_avg_pool2d_backward;
pub use global_pooling::global_max_pool2d_backward;
pub use gpu_gradient_expansion::GpuCategoryCoverage;
pub use gpu_gradient_expansion::GpuCoverageAnalysis;
pub use gpu_gradient_expansion::GpuGradInfo;
pub use gpu_gradient_expansion::GpuGradStatus;
pub use gpu_gradient_expansion::GpuGradientPlanner;
pub use gpu_gradient_expansion::ImplementationPlan;
pub use gpu_gradient_expansion::ImplementationTask;
pub use gpu_gradient_expansion::Priority;
pub use grad_ops::batch_fused_activations_forward_backward;
pub use grad_ops::fused_gelu_forward_backward;
pub use grad_ops::fused_log_softmax_forward_backward;
pub use grad_ops::fused_tanh_forward_backward;
pub use gradient_accumulation::accumulate_gradients_over_batch;
pub use gradient_accumulation::GradientAccumulator;
pub use gradient_buffer_manager_simple::global_gradient_buffer_manager;
pub use gradient_buffer_manager_simple::AllocationMetrics;
pub use gradient_buffer_manager_simple::EfficiencyMetrics;
pub use gradient_buffer_manager_simple::GradientBufferAllocation;
pub use gradient_buffer_manager_simple::GradientBufferConfig;
pub use gradient_buffer_manager_simple::GradientBufferManager;
pub use gradient_buffer_manager_simple::GradientMemoryStatistics;
pub use gradient_buffer_manager_simple::MemoryPressureStatistics;
pub use gradient_compression::CompressedGradient;
pub use gradient_compression::CompressionConfig;
pub use gradient_compression::CompressionMethod;
pub use gradient_compression::CompressionStats;
pub use gradient_compression::GradientCompressor;
pub use gradient_ops::accumulate_gradients;
pub use gradient_ops::add_gradient_noise;
pub use gradient_ops::average_gradients;
pub use gradient_ops::clip_by_global_norm;
pub use gradient_ops::clip_by_value;
pub use gradient_ops::compute_gradient_statistics;
pub use gradient_ops::has_invalid_gradients;
pub use gradient_ops::scale_gradients;
pub use gradient_ops::zero_gradients;
pub use gradient_ops::GradientPipeline;
pub use gradient_ops::GradientStatistics;
pub use gradient_ops::NamedGradientAccumulator;
pub use gradient_visualization::ColorScheme;
pub use gradient_visualization::EdgeType;
pub use gradient_visualization::GradientFlowAnalysis;
pub use gradient_visualization::GradientFlowEdge;
pub use gradient_visualization::GradientFlowIssue;
pub use gradient_visualization::GradientFlowNode;
pub use gradient_visualization::GradientFlowVisualizer;
pub use gradient_visualization::GradientStats;
pub use gradient_visualization::IssueType;
pub use gradient_visualization::LayoutAlgorithm;
pub use gradient_visualization::NodeType;
pub use gradient_visualization::OutputFormat;
pub use gradient_visualization::Severity;
pub use gradient_visualization::ValueStats;
pub use gradient_visualization::VisualizationSettings;
pub use graph_optimization::CommunicationPlan;
pub use graph_optimization::EnhancedGraphOptimizer;
pub use graph_optimization::GradientFusion;
pub use graph_optimization::GraphOptimizationConfig;
pub use graph_optimization::GraphOptimizationResult;
pub use graph_optimization::MemoryOptimization;
pub use hybrid_scheduler::ExecutionStats;
pub use hybrid_scheduler::ExecutionSummary;
pub use hybrid_scheduler::GraphAnalysis;
pub use hybrid_scheduler::HybridScheduler;
pub use hybrid_scheduler::SchedulerConfig;
pub use hybrid_scheduler::StrategyCost;
pub use implicit_differentiation::FixedPointFunction;
pub use implicit_differentiation::GradientInfo;
pub use implicit_differentiation::ImplicitDiffConfig;
pub use implicit_differentiation::ImplicitDifferentiator;
pub use implicit_differentiation::ImplicitFunction;
pub use implicit_differentiation::OptimizationLayer;
pub use inplace_ops::InPlaceOptimizer;
pub use inplace_ops::InPlaceSequenceOptimizer;
pub use jit_compiler::CompiledKernel;
pub use jit_compiler::DeviceFeatures;
pub use jit_compiler::GradientKernelTemplate;
pub use jit_compiler::JitCompiler;
pub use jit_compiler::KernelPerformance;
pub use jit_compiler::KernelSignature;
pub use jit_compiler::OptimizationLevel;
pub use jit_integration::utils as jit_utils;
pub use jit_integration::JitConfig;
pub use jit_integration::JitGradientContext;
pub use jit_integration::JitGradientTapeExt;
pub use kernel_fusion::FusableOp;
pub use kernel_fusion::FusedKernel;
pub use kernel_fusion::FusionStats;
pub use kernel_fusion::KernelFusionOptimizer;
pub use kernel_fusion::OpSequence;
pub use memory_diff_reporter::MemoryDiff;
pub use memory_diff_reporter::MemoryDiffReporter;
pub use memory_diff_reporter::MemorySnapshot;
pub use memory_profiler::get_global_profiler;
pub use memory_profiler::GradientMemoryProfiler;
pub use memory_profiler::MemoryReport;
pub use memory_profiler::MemoryStats;
pub use neural_integration::AutogradLayer;
pub use neural_integration::AutogradOptimizer;
pub use neural_integration::AutogradTrainer;
pub use neural_integration::OptimizerType;
pub use neural_integration::TrainingMetrics;
pub use no_grad::enable_grad;
pub use no_grad::is_grad_enabled;
pub use no_grad::no_grad;
pub use no_grad::set_grad_enabled;
pub use no_grad::EnableGradGuard;
pub use no_grad::NoGradGuard;
pub use numerical_checker::CheckerConfig;
pub use numerical_checker::ErrorAnalysis;
pub use numerical_checker::FiniteDifferenceMethod;
pub use numerical_checker::GradientCheckResult;
pub use numerical_checker::NumericalChecker;
pub use parameter_server::FaultToleranceMode;
pub use parameter_server::LoadBalancingStrategy;
pub use parameter_server::ParameterServer;
pub use parameter_server::ParameterServerClient;
pub use parameter_server::ParameterServerConfig;
pub use parameter_server::ParameterServerStats;
pub use performance_benchmark::BenchmarkConfig;
pub use performance_benchmark::BenchmarkReport;
pub use performance_benchmark::BenchmarkResult;
pub use performance_benchmark::BenchmarkStatistics;
pub use performance_benchmark::BenchmarkSummary;
pub use performance_benchmark::ComparisonResult;
pub use performance_benchmark::PerformanceBenchmark;
pub use performance_benchmark::RegressionReport;
pub use performance_benchmark::RegressionSeverity;
pub use performance_benchmark::ThroughputMetrics;
pub use simd_grad_ops_simple::global_simd_grad_ops;
pub use simd_grad_ops_simple::SimdGradConfig;
pub use simd_grad_ops_simple::SimdGradOps;
pub use simd_grad_ops_simple::SimdPerformanceMetrics;
pub use special_functions::bessel_j0_backward;
pub use special_functions::bessel_j1_backward;
pub use special_functions::beta_backward;
pub use special_functions::digamma_backward;
pub use special_functions::erf_backward;
pub use special_functions::erfc_backward;
pub use special_functions::gamma_backward;
pub use special_functions::lgamma_backward;
pub use subgraph_extraction::ExtractionStrategy;
pub use subgraph_extraction::Subgraph;
pub use subgraph_extraction::SubgraphConfig;
pub use subgraph_extraction::SubgraphExtractionResult;
pub use subgraph_extraction::SubgraphExtractor;
pub use subgraph_extraction::SubgraphOperation;
pub use tape::GradientTape;
pub use tape::Operation;
pub use tape::TapeNode;
pub use tape::TrackedTensor;
pub use tape_optimization::TapeOptimizationConfig;
pub use tape_optimization::TapeOptimizationStats;
pub use tape_optimization::TapeOptimizer;
pub use tensor_ext::TensorAutograd;
pub use tensor_networks::ContractionEdge;
pub use tensor_networks::ContractionPath;
pub use tensor_networks::ContractionStep;
pub use tensor_networks::ContractionStrategy;
pub use tensor_networks::TensorNetwork;
pub use tensor_networks::TensorNetworkGradient;
pub use tensor_networks::TensorNetworkNode;
pub use tensor_networks::TensorNetworkOptimizer;
pub use ultra_gradient_engine_simple::global_ultra_gradient_engine;
pub use ultra_gradient_engine_simple::GradientMemoryStats;
pub use ultra_gradient_engine_simple::GradientPerformanceMetrics;
pub use ultra_gradient_engine_simple::OptimizationInsights;
pub use ultra_gradient_engine_simple::UltraGradientConfig;
pub use ultra_gradient_engine_simple::UltraGradientEngine;
pub use ultra_gradient_engine_simple::UltraGradientResult;
pub use ultra_gradient_engine_simple::UltraGradientTapeExt;
pub use advanced_grad_ops::gradient_clipping;
pub use advanced_grad_ops::higher_order as advanced_higher_order;
pub use advanced_grad_ops::jacobian;
pub use advanced_grad_ops::optimization;
pub use advanced_grad_ops::AdaptiveGradientAccumulator;
pub use amp_policy::AMPConfig;
pub use amp_policy::AMPPolicy;
pub use amp_policy::AMPStabilityMetrics;
pub use amp_policy::ScaleAdjustment;
pub use amp_policy::ScaleAdjustmentReason;
pub use efficient_memory::AggregationStats;
pub use efficient_memory::CheckpointStats;
pub use efficient_memory::GradientCheckpointer;
pub use efficient_memory::GradientMemoryManager;
pub use efficient_memory::GradientMemoryPool;
pub use efficient_memory::LazyGradient;
pub use efficient_memory::MemoryManagerStats;
pub use efficient_memory::MemoryPoolStats;
pub use efficient_memory::StreamingGradientAggregator;
pub use gradient_analyzer::AnalysisConfig;
pub use gradient_analyzer::GradientAnalysisReport;
pub use gradient_analyzer::GradientAnalyzer;
pub use gradient_analyzer::GradientFlowAnalysis as AdvancedGradientFlowAnalysis;
pub use gradient_analyzer::GradientIssue;
pub use gradient_analyzer::GradientStatistics as AnalyzerGradientStatistics;
pub use gradient_analyzer::PerformanceMetrics;
pub use second_order_utils::compute_hessian;
pub use second_order_utils::compute_hessian_diagonal;
pub use second_order_utils::compute_jacobian;
pub use second_order_utils::compute_laplacian;
pub use second_order_utils::directional_second_derivative;
pub use second_order_utils::hessian_vector_product;

Modules§

advanced_grad_ops: Advanced Gradient Operations
amp_policy: Automatic Mixed Precision (AMP) Policy for Gradient Computation
boolean_indexing
checkpointing
context
coverage_matrix: Gradient Coverage Matrix Generator
custom_gradients
debug
deterministic: Deterministic Execution Mode
device_placement: Device Placement Optimization
efficient_memory: Memory-Efficient Gradient Computation
ellipsis_newaxis
error_taxonomy: Error Taxonomy Alignment for Autograd
forward_ad
forward_reverse: Forward-Reverse Mode Integration
global_pooling: Global pooling gradient operations
gpu_gradient_expansion: GPU Gradient Expansion Strategy
grad_ops: Gradient operations module
gradient_accumulation
gradient_analyzer: Gradient Analysis and Debugging Tools
gradient_buffer_manager_simple: Simplified Gradient Buffer Manager for Ultra-High-Performance Memory Management
gradient_compression
gradient_compression_advanced: Advanced Gradient Compression for Distributed Training
gradient_ops: Gradient Operation Utilities
gradient_utils: Gradient Computation Utilities
gradient_visualization: Gradient Flow Visualization
graph_optimization: Enhanced Graph Optimization Framework
higher_order: Higher-order derivative computation
hybrid_scheduler: Hybrid Differentiation Scheduler
implicit_differentiation: Implicit Differentiation
inplace_ops
jit_compiler: JIT Compilation System for Gradient Kernels
jit_integration: JIT Compiler Integration with Autograd System
kernel_fusion
memory_diff_reporter: Memory Diff Reporter
memory_profiler
neural_integration: Neural network integration layer for autograd
no_grad: No-gradient context for disabling gradient computation
numerical_checker: Numerical Gradient Checker
ops: Gradient operation modules
parameter_server: Parameter server for distributed training.
performance_benchmark: Performance Benchmarking Framework for Autograd
second_order: Second-Order Gradient Methods
second_order_utils: Second-Order Derivative Utilities
simd_grad_ops_simple: Simplified SIMD-Accelerated Gradient Operations for Ultra-High-Performance
special_functions: Gradient operations for special mathematical functions
subgraph_extraction: Subgraph Extraction and Parallelization
tape: Automatic differentiation tape for gradient computation
tape_optimization
tensor_ext
tensor_networks: Tensor Network Gradients
ultra_gradient: Ultra-High-Performance Gradient Computation Engine
ultra_gradient_engine_simple: Simplified Ultra-High-Performance Gradient Computation Engine

Macros§

enable_grad: Macro for enable-grad context (convenience)
no_grad: Macro for no-grad context (convenience)
profile_gradient_op: Convenience macro for profiling gradient operations

Traits§

Differentiable