Expand description
§TrustformeRS Optimization
This crate provides state-of-the-art optimization algorithms for training transformer models, including distributed training support and memory-efficient techniques.
§Overview
TrustformeRS Optim includes:
- Core Optimizers: Adam, AdamW, SGD, LAMB, AdaFactor
- Cutting-Edge 2024-2025 Optimizers: HN-Adam, AdEMAMix, Muon, CAME, MicroAdam for state-of-the-art performance
- Schedule-Free Optimizers: Schedule-Free SGD and Adam (no LR scheduling needed)
- Advanced Quantization: 4-bit optimizers with NF4 and block-wise quantization
- Memory-Efficient Optimization: MicroAdam with compressed gradients and low space overhead
- Learning Rate Schedulers: Linear, Cosine, Polynomial, Step, Exponential
- Distributed Training: ZeRO optimization stages, multi-node support
- Memory Optimization: Gradient accumulation, mixed precision, CPU offloading
§Optimizers
§Adam and AdamW
Adaptive Moment Estimation with optional weight decay:
use trustformers_optim::{AdamW, OptimizerState};
use trustformers_core::traits::Optimizer;
let mut optimizer = AdamW::new(
1e-3, // learning_rate
(0.9, 0.999), // (beta1, beta2)
1e-8, // epsilon
0.01, // weight_decay
);
// Ready to use in training loop with .zero_grad(), .update(), and .step()§SGD
Stochastic Gradient Descent with momentum and Nesterov acceleration:
use trustformers_optim::SGD;
let optimizer = SGD::new(
0.1, // learning_rate
0.9, // momentum
1e-4, // weight_decay
true, // nesterov
);§Schedule-Free Optimizers
Revolutionary optimizers that eliminate the need for learning rate scheduling:
use trustformers_optim::{ScheduleFreeAdam, ScheduleFreeSGD};
use trustformers_core::traits::Optimizer;
// Schedule-Free Adam - no learning rate scheduling needed!
let optimizer = ScheduleFreeAdam::for_language_models();
// Higher learning rates work better (e.g., 0.25-1.0 instead of 0.001)
let optimizer = ScheduleFreeAdam::new(0.5, 0.9, 0.95, 1e-8, 0.1);
// Schedule-Free SGD for simpler models
let optimizer = ScheduleFreeSGD::for_large_models();
// No learning rate scheduler needed! Just use .zero_grad(), .update(), .step()
// eval_mode() can be used to switch to average weights§Cutting-Edge 2024-2025 Optimizers
The latest state-of-the-art optimizers for superior performance:
§🌟 NEW: Latest 2025 Research Algorithms 🚀
Self-Scaled BFGS (SSBFGS) - Revolutionary quasi-Newton method:
use trustformers_optim::{SSBFGS, SSBFGSConfig};
// For Physics-Informed Neural Networks (PINNs)
let optimizer = SSBFGS::for_physics_informed();
// For challenging non-convex problems
let optimizer = SSBFGS::for_non_convex();
// Custom configuration
let optimizer = SSBFGS::from_config(SSBFGSConfig {
learning_rate: 0.8,
history_size: 15,
scaling_factor: 1.2,
momentum: 0.95,
});
// Get optimization statistics
let stats = optimizer.get_stats();
println!("Current scaling factor: {:.3}", stats.current_scaling_factor);Self-Scaled Broyden (SSBroyden) - Efficient rank-1 updates:
use trustformers_optim::{SSBroyden, SSBroydenConfig};
// Optimized for PINNs with rank-1 efficiency
let optimizer = SSBroyden::for_physics_informed();
// More computationally efficient than BFGS
let optimizer = SSBroyden::new(); // Default configurationPDE-aware Optimizer - Specialized for Physics-Informed Neural Networks:
use trustformers_optim::{PDEAwareOptimizer, PDEAwareConfig};
// Specialized configurations for different PDEs
let burgers_opt = PDEAwareOptimizer::for_burgers_equation(); // Burgers' equation
let allen_cahn_opt = PDEAwareOptimizer::for_allen_cahn(); // Allen-Cahn equation
let kdv_opt = PDEAwareOptimizer::for_kdv_equation(); // Korteweg-de Vries
let sharp_grad_opt = PDEAwareOptimizer::for_sharp_gradients(); // Sharp gradient regions
// Get PDE-specific optimization statistics
let stats = sharp_grad_opt.get_pde_stats();
println!("Average residual variance: {:.6}", stats.average_residual_variance);🔬 Research Breakthrough Features:
- Orders-of-magnitude improvements in PINN training accuracy
- Dynamic rescaling based on gradient history and PDE residual variance
- Sharp gradient handling for challenging PDE optimization landscapes
- Lower computational cost than second-order methods like SOAP
- Specialized presets for different equation types (Burgers, Allen-Cahn, KdV)
§BGE-Adam (2024) - Revolutionary Performance Optimization! 🚀
Enhanced Adam with entropy weighting and adaptive gradient strategy, now featuring OptimizedBGEAdam with 3-5x speedup:
use trustformers_optim::{BGEAdam, OptimizedBGEAdam, BGEAdamConfig, OptimizedBGEAdamConfig};
// 🚀 RECOMMENDED: Use the optimized version for 3-5x better performance!
let optimizer = OptimizedBGEAdam::new(); // 3-5x faster than original!
// Performance-optimized presets for different use cases
let llm_optimizer = OptimizedBGEAdam::for_large_models(); // For LLMs (optimized settings)
let vision_optimizer = OptimizedBGEAdam::for_vision(); // For computer vision
let perf_optimizer = OptimizedBGEAdam::for_high_performance(); // Maximum speed
// Built-in performance monitoring and entropy statistics
println!("{}", optimizer.performance_stats());
let (min_entropy, max_entropy, avg_entropy) = optimizer.get_entropy_stats();
// Original BGE-Adam still available (but much slower)
let original_optimizer = BGEAdam::new(
1e-3, // learning rate
(0.9, 0.999), // (β1, β2)
1e-8, // epsilon
0.01, // weight decay
0.1, // entropy scaling factor
0.05, // β1 adaptation factor
0.05, // β2 adaptation factor
);🔥 Performance Improvements in OptimizedBGEAdam:
- ⚡ 3.4-4.9x faster execution (16.3ms → 4.7ms per iteration for 50k params)
- 💾 85-87x memory reduction through optimized buffer management
- 🔥 Single-pass processing eliminates redundant calculations
- 🚀 Vectorized operations with SIMD-friendly processing patterns
§HN-Adam (2024)
Hybrid Norm Adam with adaptive step size:
use trustformers_optim::{HNAdam, HNAdamConfig};
// Automatically adjusts step size based on update norms
let optimizer = HNAdam::new(1e-3, (0.9, 0.999), 1e-8, 0.01, 0.1);
// Or use presets for specific tasks
let transformer_opt = HNAdam::for_transformers(); // Optimized for transformers
let vision_opt = HNAdam::for_vision(); // Optimized for computer vision
// Better convergence speed and accuracy than standard Adam§AdEMAMix (2024)
Dual EMA system for better gradient utilization:
use trustformers_optim::AdEMAMix;
// Revolutionary dual EMA optimizer from Apple/EPFL
let optimizer = AdEMAMix::for_llm_training(); // Optimized for LLMs
// Or for vision tasks
let optimizer = AdEMAMix::for_vision_training();
// 95% data efficiency improvement demonstrated in research§Muon (2024)
Second-order optimizer for hidden layers:
use trustformers_optim::Muon;
// Used in NanoGPT and CIFAR-10 speed records
let optimizer = Muon::for_nanogpt(); // <1% FLOP overhead
// For large language models
let optimizer = Muon::for_large_lm();
// Automatically chooses 2D optimization for matrices, 1D fallback for vectors§CAME (2023)
Confidence-guided memory efficient optimization:
use trustformers_optim::CAME;
// Memory efficient with fast convergence
let optimizer = CAME::for_bert_training();
// For memory-constrained environments
let optimizer = CAME::for_memory_constrained();
// Check memory savings
println!("Memory savings: {:.1}%", optimizer.memory_savings_ratio() * 100.0);§MicroAdam (NeurIPS 2024)
Memory-efficient Adam with compressed gradients:
use trustformers_optim::MicroAdam;
// Standard configuration with adaptive compression
let optimizer = MicroAdam::new();
// For large language models (higher compression)
let optimizer = MicroAdam::for_large_models();
// Memory-constrained environments (aggressive compression)
let optimizer = MicroAdam::for_memory_constrained();
// Check compression statistics
println!("{}", optimizer.compression_statistics());
println!("Memory savings: {:.1}%", optimizer.memory_savings_ratio() * 100.0);§Advanced Quantization
Ultra-low memory usage with 4-bit quantization:
use trustformers_optim::{Adam4bit, AdvancedQuantizationConfig, QuantizationMethod};
// 4-bit Adam with NF4 quantization (75% memory savings)
let optimizer = Adam4bit::new(0.001, 0.9, 0.999, 1e-8, 0.01);
// Custom quantization configuration
let quant_config = AdvancedQuantizationConfig {
method: QuantizationMethod::NF4,
block_size: 64,
adaptation_rate: 0.01,
double_quantization: true,
..Default::default()
};
let optimizer = Adam4bit::with_quantization_config(
Default::default(),
quant_config,
);
// Massive memory savings for large models
println!("Memory savings: {:.1}%", optimizer.memory_savings() * 100.0);§Learning Rate Schedules
Control learning rate during training:
use trustformers_optim::{AdamW, CosineScheduler, LRScheduler};
let base_lr = 1e-3;
let optimizer = AdamW::new(base_lr, (0.9, 0.999), 1e-8, 0.01);
// Cosine annealing with warmup
let scheduler = CosineScheduler::new(
base_lr,
1000, // num_warmup_steps
10000, // num_training_steps
1e-5, // min_lr
);
// Update learning rate each step
for step in 0..10000 {
let current_lr = scheduler.get_lr(step);
// Use current_lr with optimizer.set_lr(current_lr)
}§ZeRO Optimization
Memory-efficient distributed training:
// ZeRO distributed training (requires distributed environment)
use trustformers_optim::{AdamW};
let optimizer = AdamW::new(1e-4, (0.9, 0.999), 1e-8, 0.01);
// ZeRO configuration and distributed setup would go here§ZeRO Stages
- Stage 1: Optimizer state partitioning (4x memory reduction)
- Stage 2: Optimizer + gradient partitioning (8x memory reduction)
- Stage 3: Full parameter partitioning (Nx memory reduction)
§Multi-Node Training
Scale training across multiple machines:
// Multi-node distributed training setup
// Configuration and training would require distributed environment
// Example: MultiNodeTrainer::new(config)§Advanced Features
§Gradient Accumulation
// Example: Accumulate gradients over multiple batches before stepping
// if (step + 1) % accumulation_steps == 0 {
// optimizer.step(&mut model.parameters())?;
// optimizer.zero_grad();
// }§Mixed Precision Training
// Mixed precision optimizers can provide memory savings and speed improvements
// Configuration example:
// MixedPrecisionOptimizer::new(base_optimizer, scale_config)§Performance Tips
-
Choose the Right Optimizer:
- AdamW for most transformer training
- SGD for fine-tuning with small learning rates
- LAMB for large batch training
-
Learning Rate Scheduling:
- Use warmup for stable training start
- Cosine schedule for most cases
- Linear decay for fine-tuning
-
Memory Optimization:
- Enable ZeRO Stage 2 for models > 1B parameters
- Use gradient accumulation for larger effective batch sizes
- Consider CPU offloading for very large models
-
Distributed Training:
- Use data parallelism for models < 10B parameters
- Add model parallelism for larger models
- Enable communication overlap for better throughput
Re-exports§
pub use adafactor_new::AdaFactor;pub use adafactor_new::AdaFactorConfig;pub use adafisher_simple::AdaFisher;pub use adafisher_simple::AdaFisherConfig;pub use adam::AdaBelief;pub use adam::Adam;pub use adam::AdamW;pub use adam::NAdam;pub use adam::RAdam;pub use adam_v2::AdamConfig;pub use adam_v2::StandardizedAdam;pub use adam_v2::StandardizedAdamW;pub use adamax_plus::AdaMaxPlus;pub use adamax_plus::AdaMaxPlusConfig;pub use adan::Adan;pub use adan::AdanConfig;pub use adaptive::create_ranger;pub use adaptive::create_ranger_with_config;pub use adaptive::AMSBound;pub use adaptive::AdaBound;pub use adaptive::Ranger;pub use ademamix::AdEMAMix;pub use ademamix::AdEMAMixConfig;pub use advanced_2025_research::AdaWin;pub use advanced_2025_research::AdaWinConfig;pub use advanced_2025_research::DiWo;pub use advanced_2025_research::DiWoConfig;pub use advanced_2025_research::MeZOV2;pub use advanced_2025_research::MeZOV2Config;pub use advanced_distributed_features::AutoScaler;pub use advanced_distributed_features::AutoScalerConfig;pub use advanced_distributed_features::CheckpointConfig as AdvancedCheckpointConfig;pub use advanced_distributed_features::CheckpointInfo;pub use advanced_distributed_features::CostOptimizer;pub use advanced_distributed_features::MLOptimizerConfig;pub use advanced_distributed_features::OptimizationResult;pub use advanced_distributed_features::OptimizationType;pub use advanced_distributed_features::PerformanceMLOptimizer;pub use advanced_distributed_features::ScalingDecision;pub use advanced_distributed_features::ScalingStrategy;pub use advanced_distributed_features::SmartCheckpointManager;pub use advanced_distributed_features::WorkloadPredictor;pub use advanced_features::CheckpointConfig;pub use advanced_features::FusedOptimizer;pub use advanced_features::MemoryBandwidthOptimizer;pub use advanced_features::MultiOptimizerStats;pub use advanced_features::MultiOptimizerTrainer;pub use advanced_features::ResourceUtilization;pub use advanced_features::WarmupOptimizer;pub use advanced_features::WarmupStrategy;pub use amacp::AMacP;pub use amacp::AMacPConfig;pub use amacp::AMacPStats;pub use async_optim::AsyncSGD;pub use async_optim::AsyncSGDConfig;pub use async_optim::DelayCompensationMethod;pub use async_optim::DelayedGradient;pub use async_optim::DelayedGradientConfig;pub use async_optim::ElasticAveraging;pub use async_optim::ElasticAveragingConfig;pub use async_optim::Hogwild;pub use async_optim::HogwildConfig;pub use async_optim::ParameterServer;pub use averaged_adam::AveragedAdam;pub use averaged_adam::AveragedAdamConfig;pub use bge_adam::BGEAdam;pub use bge_adam::BGEAdamConfig;pub use bge_adam_optimized::OptimizedBGEAdam;pub use bge_adam_optimized::OptimizedBGEAdamConfig;pub use cache_friendly::CacheConfig;pub use cache_friendly::CacheFriendlyAdam;pub use cache_friendly::CacheFriendlyState;pub use cache_friendly::CacheStats;pub use cache_friendly::ParameterMetadata;pub use came::CAMEConfig;pub use came::CAME;pub use common::BiasCorrection;pub use common::GradientProcessor;pub use common::OptimizerState;pub use common::ParameterIds;pub use common::ParameterUpdate;pub use common::StateMemoryStats;pub use common::WeightDecayMode;pub use compression::CompressedAllReduce;pub use compression::CompressedGradient;pub use compression::CompressionMethod;pub use compression::GradientCompressor;pub use continual_learning::AllocationStrategy;pub use continual_learning::EWCConfig;pub use continual_learning::FisherMethod;pub use continual_learning::L2Regularization;pub use continual_learning::L2RegularizationConfig;pub use continual_learning::MemoryReplay;pub use continual_learning::MemoryReplayConfig;pub use continual_learning::MemorySelectionStrategy;pub use continual_learning::PackNet;pub use continual_learning::PackNetConfig;pub use continual_learning::UpdateStrategy;pub use continual_learning::EWC;pub use convergence::AggMo;pub use convergence::AggMoConfig;pub use convergence::FISTAConfig;pub use convergence::HeavyBall;pub use convergence::HeavyBallConfig;pub use convergence::NesterovAcceleratedGradient;pub use convergence::NesterovAcceleratedGradientConfig;pub use convergence::QHMConfig;pub use convergence::VarianceReduction;pub use convergence::VarianceReductionConfig;pub use convergence::VarianceReductionMethod;pub use convergence::FISTA;pub use convergence::QHM;pub use cpu_offload::create_cpu_offloaded_adam;pub use cpu_offload::create_cpu_offloaded_adamw;pub use cpu_offload::create_cpu_offloaded_sgd;pub use cpu_offload::CPUOffloadConfig;pub use cpu_offload::CPUOffloadStats;pub use cpu_offload::CPUOffloadedOptimizer;pub use cross_framework::ConfigSource;pub use cross_framework::ConfigTarget;pub use cross_framework::CrossFrameworkConverter;pub use cross_framework::Framework;pub use cross_framework::JAXOptimizerConfig;pub use cross_framework::PyTorchOptimizerConfig;pub use cross_framework::TrustformeRSOptimizerConfig;pub use cross_framework::UniversalOptimizerConfig;pub use cross_framework::UniversalOptimizerState;pub use deep_distributed_qp::DeepDistributedQP;pub use deep_distributed_qp::DeepDistributedQPConfig;pub use enhanced_distributed_training::Bottleneck;pub use enhanced_distributed_training::CompressionConfig;pub use enhanced_distributed_training::CompressionType;pub use enhanced_distributed_training::DistributedConfig;pub use enhanced_distributed_training::DistributedTrainingStats;pub use enhanced_distributed_training::DynamicBatchingConfig;pub use enhanced_distributed_training::EnhancedDistributedTrainer;pub use enhanced_distributed_training::FaultToleranceConfig;pub use enhanced_distributed_training::MemoryOptimizationConfig;pub use enhanced_distributed_training::MonitoringConfig as DistributedMonitoringConfig;pub use enhanced_distributed_training::PerformanceMetrics as DistributedPerformanceMetrics;pub use enhanced_distributed_training::PerformanceTrend;pub use enhanced_distributed_training::TrainingStepResult;pub use eva::EVAConfig;pub use eva::EVA;pub use federated::ClientInfo;pub use federated::ClientSelectionStrategy;pub use federated::DifferentialPrivacy;pub use federated::DifferentialPrivacyConfig;pub use federated::FedAvg;pub use federated::FedAvgConfig;pub use federated::FedProx;pub use federated::FedProxConfig;pub use federated::NoiseMechanism;pub use federated::SecureAggregation;pub use fusion::simd;pub use fusion::FusedOperation;pub use fusion::FusedOptimizerState;pub use fusion::FusionConfig;pub use fusion::FusionStats;pub use genie_stub::DomainStats;pub use genie_stub::GENIEConfig;pub use genie_stub::GENIEStats;pub use genie_stub::GENIE;pub use gradient_processing::AdaptiveClippingConfig;pub use gradient_processing::GradientProcessedOptimizer;pub use gradient_processing::GradientProcessingConfig;pub use gradient_processing::HessianApproximationType;pub use gradient_processing::HessianPreconditioningConfig;pub use gradient_processing::NoiseInjectionConfig;pub use gradient_processing::NoiseType;pub use gradient_processing::SmoothingConfig;pub use hardware_aware::create_edge_optimizer;pub use hardware_aware::create_gpu_adam;pub use hardware_aware::create_mobile_optimizer;pub use hardware_aware::create_tpu_optimizer;pub use hardware_aware::CompressionRatio;pub use hardware_aware::EdgeOptimizer;pub use hardware_aware::GPUAdam;pub use hardware_aware::HardwareAwareConfig;pub use hardware_aware::HardwareTarget;pub use hardware_aware::MobileOptimizer;pub use hardware_aware::TPUOptimizer;pub use hardware_aware::TPUVersion;pub use hierarchical_aggregation::AggregationStats;pub use hierarchical_aggregation::AggregationStrategy;pub use hierarchical_aggregation::ButterflyStructure;pub use hierarchical_aggregation::CommunicationGroups;pub use hierarchical_aggregation::FaultDetector;pub use hierarchical_aggregation::HierarchicalAggregator;pub use hierarchical_aggregation::HierarchicalConfig;pub use hierarchical_aggregation::NodeTopology;pub use hierarchical_aggregation::RecoveryStrategy;pub use hierarchical_aggregation::RingStructure;pub use hierarchical_aggregation::TreeStructure;pub use hn_adam::HNAdam;pub use hn_adam::HNAdamConfig;pub use hyperparameter_tuning::BayesianOptimizer;pub use hyperparameter_tuning::HyperparameterSample;pub use hyperparameter_tuning::HyperparameterSpace;pub use hyperparameter_tuning::HyperparameterTuner;pub use hyperparameter_tuning::MultiObjectiveOptimizer;pub use hyperparameter_tuning::OptimizationTask;pub use hyperparameter_tuning::OptimizerType;pub use hyperparameter_tuning::PerformanceMetrics as HyperparameterPerformanceMetrics;pub use hyperparameter_tuning::TaskType as HyperparameterTaskType;pub use jax_compat::JAXAdam;pub use jax_compat::JAXAdamW;pub use jax_compat::JAXChain;pub use jax_compat::JAXCosineDecay;pub use jax_compat::JAXCosineDecaySchedule;pub use jax_compat::JAXExponentialDecay;pub use jax_compat::JAXGradientTransformation;pub use jax_compat::JAXLearningRateSchedule;pub use jax_compat::JAXOptState;pub use jax_compat::JAXOptimizerFactory;pub use jax_compat::JAXOptimizerState;pub use jax_compat::JAXWarmupCosineDecay;pub use jax_compat::JAXSGD;pub use kernel_fusion::CoalescingLevel;pub use kernel_fusion::FusedGPUState;pub use kernel_fusion::GPUMemoryStats;pub use kernel_fusion::KernelFusedAdam;pub use kernel_fusion::KernelFusionConfig;pub use lamb::LAMB;pub use lancbio::LancBiO;pub use lancbio::LancBiOConfig;pub use lion::Lion;pub use lion::LionConfig;pub use lookahead::Lookahead;pub use lookahead::LookaheadAdam;pub use lookahead::LookaheadAdamW;pub use lookahead::LookaheadNAdam;pub use lookahead::LookaheadRAdam;pub use lookahead::LookaheadSGD;pub use lora::create_lora_adam;pub use lora::create_lora_adamw;pub use lora::create_lora_sgd;pub use lora::LoRAAdapter;pub use lora::LoRAConfig;pub use lora::LoRAOptimizer;pub use lora_rite_stub::LoRARITE;pub use lora_rite_stub::LoRARITEConfig;pub use lora_rite_stub::LoRARITEStats;pub use lora_rite_stub::TransformationStats;pub use memory_layout::AlignedAllocator;pub use memory_layout::AlignmentConfig;pub use memory_layout::LayoutOptimizedAdam;pub use memory_layout::LayoutStats;pub use memory_layout::SoAOptimizerState;pub use microadam::MicroAdam;pub use microadam::MicroAdamConfig;pub use monitoring::ConvergenceIndicators;pub use monitoring::ConvergenceSpeed;pub use monitoring::HyperparameterSensitivity;pub use monitoring::HyperparameterSensitivityConfig;pub use monitoring::HyperparameterSensitivityMetrics;pub use monitoring::MemoryStats;pub use monitoring::MemoryUsage;pub use monitoring::MetricStats;pub use monitoring::MonitoringConfig;pub use monitoring::OptimizerMetrics;pub use monitoring::OptimizerMonitor;pub use monitoring::OptimizerRecommendation;pub use monitoring::OptimizerSelector;pub use monitoring::PerformanceStats;pub use monitoring::PerformanceTier;pub use muon::Muon;pub use muon::MuonConfig;pub use pde_aware::PDEAwareConfig;pub use pde_aware::PDEAwareOptimizer;pub use pde_aware::PDEAwareStats;pub use prodigy::Prodigy;pub use prodigy::ProdigyConfig;pub use performance_validation::BenchmarkScenario;pub use performance_validation::ConvergenceAnalysisResults;pub use performance_validation::CorrectnessResults;pub use performance_validation::DistributedValidationResults;pub use performance_validation::MathematicalProperty;pub use performance_validation::MathematicalTestCase;pub use performance_validation::MemoryValidationResults;pub use performance_validation::PerformanceBenchmarkResults;pub use performance_validation::PerformanceValidator;pub use performance_validation::RegressionAnalysisResults;pub use performance_validation::StatisticalMetrics;pub use performance_validation::ValidationConfig;pub use performance_validation::ValidationResults;pub use pytorch_compat::PyTorchAdam;pub use pytorch_compat::PyTorchAdamW;pub use pytorch_compat::PyTorchLRScheduler;pub use pytorch_compat::PyTorchOptimizer;pub use pytorch_compat::PyTorchOptimizerFactory;pub use pytorch_compat::PyTorchOptimizerState;pub use pytorch_compat::PyTorchParamGroup;pub use pytorch_compat::PyTorchSGD;pub use quantized::Adam8bit;pub use quantized::AdamW8bit;pub use quantized::QuantizationConfig;pub use quantized::QuantizedState;pub use quantized_advanced::Adam4bit;pub use quantized_advanced::Adam4bitOptimizerConfig;pub use quantized_advanced::AdvancedQuantizationConfig;pub use quantized_advanced::GradientStatistics;pub use quantized_advanced::QuantizationMethod;pub use quantized_advanced::QuantizationUtils;pub use quantized_advanced::QuantizedTensor;pub use quantum_inspired::QuantumAnnealingConfig;pub use quantum_inspired::QuantumAnnealingOptimizer;pub use quantum_inspired::QuantumAnnealingStats;pub use schedule_free::ScheduleFreeAdam;pub use schedule_free::ScheduleFreeAdamConfig;pub use schedule_free::ScheduleFreeSGD;pub use schedule_free::ScheduleFreeSGDConfig;pub use scheduler::AdaptiveScheduler;pub use scheduler::CompositeScheduler;pub use scheduler::ConstantWithWarmupScheduler;pub use scheduler::CosineScheduler;pub use scheduler::CosineWithRestartsScheduler;pub use scheduler::CyclicalMode;pub use scheduler::CyclicalScheduler;pub use scheduler::DynamicScheduler;pub use scheduler::ExponentialScheduler;pub use scheduler::LRScheduler;pub use scheduler::LinearScheduler;pub use scheduler::OneCycleScheduler;pub use scheduler::Phase;pub use scheduler::PhaseBasedScheduler;pub use scheduler::PolynomialScheduler;pub use scheduler::StepScheduler;pub use scheduler::SwitchCondition;pub use scheduler::TaskSpecificScheduler;pub use scheduler::TaskType as SchedulerTaskType;pub use second_order::LineSearchMethod;pub use second_order::NewtonCG;pub use second_order::SSBFGSConfig;pub use second_order::SSBFGSStats;pub use second_order::SSBroyden;pub use second_order::SSBroydenConfig;pub use second_order::LBFGS;pub use second_order::SSBFGS;pub use sgd::SGD;pub use simd_optimizations::SIMDConfig;pub use simd_optimizations::SIMDOptimizer;pub use simd_optimizations::SIMDPerformanceInfo;pub use sofo_stub::ForwardModeStats;pub use sofo_stub::MemoryStats as SOFOMemoryStats;pub use sofo_stub::SOFOConfig;pub use sofo_stub::SOFOStats;pub use sofo_stub::SOFO;pub use sophia::Sophia;pub use sophia::SophiaConfig;pub use sparse::SparseAdam;pub use sparse::SparseConfig;pub use sparse::SparseMomentumState;pub use sparse::SparseSGD;pub use task_specific::create_bert_optimizer;pub use task_specific::create_gan_optimizer;pub use task_specific::create_maml_optimizer;pub use task_specific::create_ppo_optimizer;pub use task_specific::BERTOptimizer;pub use task_specific::GANOptimizer;pub use task_specific::MetaOptimizer as TaskMetaOptimizer;pub use task_specific::RLOptimizer;pub use tensorflow_compat::TensorFlowAdam;pub use tensorflow_compat::TensorFlowAdamW;pub use tensorflow_compat::TensorFlowCosineDecay;pub use tensorflow_compat::TensorFlowExponentialDecay;pub use tensorflow_compat::TensorFlowLearningRateSchedule;pub use tensorflow_compat::TensorFlowOptimizer;pub use tensorflow_compat::TensorFlowOptimizerConfig;pub use tensorflow_compat::TensorFlowOptimizerFactory;pub use traits::AdaptiveMomentumOptimizer;pub use traits::AsyncOptimizer;pub use traits::ClassicalMomentumOptimizer;pub use traits::CompositeOptimizer;pub use traits::DistributedOptimizer;pub use traits::FederatedOptimizer;pub use traits::GPUOptimizer;pub use traits::GradientCompressionOptimizer;pub use traits::HardwareOptimizer;pub use traits::HardwareStats;pub use traits::LookaheadOptimizer;pub use traits::MetaOptimizer;pub use traits::MomentumOptimizer;pub use traits::OptimizerFactory;pub use traits::ScheduledOptimizer;pub use traits::SecondOrderOptimizer;pub use traits::SerializableOptimizer;pub use traits::StalenessCompensation;pub use traits::StatefulOptimizer;pub use zero::all_gather_gradients;pub use zero::gather_parameters;pub use zero::partition_gradients;pub use zero::partition_parameters;pub use zero::reduce_scatter_gradients;pub use zero::GradientBuffer;pub use zero::ParameterGroup;pub use zero::ParameterPartition;pub use zero::ZeROConfig;pub use zero::ZeROImplementationStage;pub use zero::ZeROMemoryStats;pub use zero::ZeROOptimizer;pub use zero::ZeROStage;pub use zero::ZeROStage1;pub use zero::ZeROStage2;pub use zero::ZeROStage3;pub use zero::ZeROState;pub use multinode::MultiNodeConfig;pub use multinode::MultiNodeStats;pub use multinode::MultiNodeTrainer;pub use novograd::MemoryEfficiencyStats;pub use novograd::NovoGrad;pub use novograd::NovoGradConfig;pub use novograd::NovoGradStats;pub use onnx_export::ONNXExportConfig;pub use onnx_export::ONNXGraph;pub use onnx_export::ONNXModel;pub use onnx_export::ONNXNode;pub use onnx_export::ONNXOptimizerExporter;pub use onnx_export::ONNXOptimizerMetadata;pub use onnx_export::OptimizerConfig;pub use parallel::BatchUpdate;pub use parallel::ParallelAdam;pub use parallel::ParallelConfig;pub use parallel::ParallelStats;
Modules§
- adafactor_
new - AdaFactor Optimizer
- adafisher_
simple - AdaFisher: Adaptive Second Order Optimization via Fisher Information (Simplified)
- adam
- Adam and AdamW Optimizers
- adam_v2
- Standardized Adam and AdamW Optimizers
- adamax_
plus - AdaMax+ Optimizer
- adan
- Adan Optimizer
- adaptive
- Advanced Adaptive Optimizers
- ademamix
- AdEMAMix Optimizer
- advanced_
2025_ research - Advanced 2025 Research Optimizers
- advanced_
distributed_ features - Advanced Distributed Training Features
- advanced_
features - Advanced Optimizer Features
- amacp
- aMacP: Adaptive Momentum and Consecutive Parameters Optimizer
- async_
optim - Asynchronous Optimization Methods
- averaged_
adam - Averaged Adam Optimizer (2025)
- bge_
adam - BGE-Adam Optimizer
- bge_
adam_ optimized - Optimized BGE-Adam Optimizer
- cache_
friendly - Cache-friendly optimization algorithms for improved memory performance.
- came
- CAME Optimizer
- common
- Common optimization operations and utilities.
- compression
- continual_
learning - Continual Learning Optimizers
- convergence
- Convergence Improvement Methods
- cpu_
offload - CPU-Offloaded Optimizers
- cross_
framework - Cross-Framework Optimizer Conversion
- deep_
distributed_ qp - DeepDistributedQP: Deep Learning-Aided Distributed Optimization
- enhanced_
distributed_ training - Enhanced Multi-GPU Distributed Training Framework
- eva
- EVA Optimizer
- federated
- Federated Learning Optimization
- fusion
- Optimizer Fusion Techniques
- genie_
stub - GENIE: Generalization-ENhancing Iterative Equalizer Optimizer (Stub Implementation)
- gradient_
processing - Gradient Processing Enhancements
- hardware_
aware - hierarchical_
aggregation - hn_adam
- HN-Adam Optimizer
- hyperparameter_
tuning - Automated Hyperparameter Tuning Framework
- jax_
compat - JAX Optimizer API Compatibility Layer
- kernel_
fusion - GPU kernel fusion optimizations for high-performance optimization.
- lamb
- lancbio
- LancBiO: Dynamic Lanczos-aided Bilevel Optimization
- lion
- Lion Optimizer (EvoLved Sign Momentum)
- lookahead
- Lookahead Optimizer
- lora
- Low-Rank Adaptation (LoRA) Optimizers
- lora_
rite_ stub - LoRA-RITE: LoRA Done RITE Optimizer (Stub Implementation)
- memory_
layout - Memory layout optimizations for improved cache performance.
- microadam
- MicroAdam Optimizer
- monitoring
- Optimizer Monitoring and Analysis Tools
- multinode
- Multi-Node Distributed Training Support
- muon
- Muon Optimizer
- novograd
- NovoGrad: Memory-Efficient Adaptive Optimizer
- onnx_
export - ONNX Optimizer Export
- optimizer
- parallel
- Parallel optimization algorithms for multi-threaded training.
- pde_
aware - performance_
validation - Comprehensive Performance Validation Framework
- prodigy
- Prodigy Optimizer
- pytorch_
compat - PyTorch Optimizer API Compatibility Layer
- quantized
- quantized_
advanced - Advanced Quantization Techniques
- quantum_
inspired - Quantum-Inspired Optimization Algorithms
- schedule_
free - Schedule-Free Optimizers
- scheduler
- Learning Rate Schedulers
- second_
order - sgd
- simd_
optimizations - SIMD Optimizations for Optimizers
- sofo_
stub - SOFO: Second-Order Forward Optimizer (Stub Implementation)
- sophia
- Sophia Optimizer (Second-order Clipped Stochastic Optimization)
- sparse
- Sparse Momentum Methods
- task_
specific - tensorflow_
compat - TensorFlow Optimizer API Compatibility Layer
- traits
- Advanced optimizer trait hierarchy for TrustformeRS.
- zero
- ZeRO (Zero Redundancy Optimizer) Implementation for TrustformeRS