axonml-optim 0.6.2

Optimizers and learning rate schedulers for the Axonml ML framework
Documentation

axonml-optim

Overview

axonml-optim provides optimization algorithms for training neural networks in the AxonML framework: five optimizers (SGD, Adam, AdamW, RMSprop, LAMB), seven learning-rate schedulers, a dynamic GradScaler for mixed-precision training, and a training health monitor that watches the run for NaNs, explosion/vanishing gradients, and stalled convergence.

Features

  • SGD - Stochastic Gradient Descent with optional momentum, Nesterov acceleration, weight decay, and dampening.
  • Adam - Adaptive Moment Estimation with bias correction and optional AMSGrad variant.
  • AdamW - Adam with decoupled weight decay regularization for improved generalization.
  • RMSprop - Root Mean Square Propagation with optional momentum and centered gradient normalization.
  • LAMB - Layer-wise Adaptive Moments for large-batch training (32k+ batches); Adam plus a per-layer trust ratio.
  • Learning Rate Schedulers - StepLR, MultiStepLR, ExponentialLR, CosineAnnealingLR, OneCycleLR, WarmupLR, and ReduceLROnPlateau.
  • GradScaler - Dynamic loss scaling for AMP; doubles on a healthy growth interval, halves on inf/NaN. Pairs with autocast / AutocastGuard from axonml-autograd::amp.
  • Builder Pattern - Fluent API (Adam::new(...).betas(...).eps(...).weight_decay(...).amsgrad(true)) for configuring optimizer hyperparameters.
  • Unified Interface - Common Optimizer trait (step, zero_grad, get_lr, set_lr) for interoperability.
  • Fused Optimizer Loops - Adam, SGD, and RMSprop apply momentum, weight decay, and parameter updates in a single pass per tensor, reducing memory traffic.
  • GPU-Resident State - Optimizer state (e.g. LAMB's exp_avg / exp_avg_sq) is allocated on the same device as the parameter; no CPU round-trips.
  • Training Health Monitor - TrainingMonitor records per-step loss, gradient norm, and LR; emits TrainingAlerts (NaN, exploding/vanishing grad, stalled loss) with AlertSeverity; exports HealthReport including LossTrend and a convergence score with suggested LR.

Modules

Module Description
optimizer Core Optimizer trait
sgd SGD with momentum, Nesterov, dampening, weight decay
adam Adam and AdamW
rmsprop RMSprop with optional centering and momentum
lamb LAMB layer-wise adaptive moments
lr_scheduler LRScheduler trait + seven concrete schedulers
grad_scaler GradScaler / GradScalerState for AMP loss scaling
health TrainingMonitor, MonitorConfig, HealthReport, TrainingAlert, AlertKind, AlertSeverity, LossTrend

Cargo Features

Feature Purpose
cuda Forwards CUDA support to axonml-core / axonml-tensor so optimizer state stays GPU-resident

Usage

Add to your Cargo.toml:

[dependencies]
axonml-optim = "0.6.1"

Basic Training Loop

use axonml_optim::prelude::*;
use axonml_nn::{Linear, Module, Sequential, MSELoss};
use axonml_autograd::Variable;
use axonml_tensor::Tensor;

// Create model
let model = Sequential::new()
    .add(Linear::new(784, 128))
    .add(Linear::new(128, 10));

// Create optimizer
let mut optimizer = Adam::new(model.parameters(), 0.001);
let loss_fn = MSELoss::new();

// Training loop
for epoch in 0..100 {
    let output = model.forward(&input);
    let loss = loss_fn.compute(&output, &target);

    optimizer.zero_grad();
    loss.backward();
    optimizer.step();
}

SGD with Momentum

use axonml_optim::{SGD, Optimizer};

// Basic SGD
let mut optimizer = SGD::new(model.parameters(), 0.01);

// SGD with momentum
let mut optimizer = SGD::new(model.parameters(), 0.01)
    .momentum(0.9)
    .weight_decay(0.0001)
    .nesterov(true);

Adam with Custom Configuration

use axonml_optim::{Adam, AdamW, Optimizer};

// Adam with custom betas
let mut optimizer = Adam::new(model.parameters(), 0.001)
    .betas((0.9, 0.999))
    .eps(1e-8)
    .weight_decay(0.01)
    .amsgrad(true);

// AdamW for decoupled weight decay
let mut optimizer = AdamW::new(model.parameters(), 0.001)
    .weight_decay(0.01);

LAMB for Large-Batch Training

use axonml_optim::{LAMB, Optimizer};

let mut optimizer = LAMB::new(model.parameters(), 0.001);
// LAMB scales each parameter's update by a per-layer trust ratio,
// enabling stable training at batch sizes of 32k+.

Learning Rate Scheduling

use axonml_optim::{SGD, StepLR, CosineAnnealingLR, OneCycleLR, LRScheduler};

let mut optimizer = SGD::new(model.parameters(), 0.1);

// Step decay every 10 epochs
let mut scheduler = StepLR::new(&optimizer, 10, 0.1);

// Cosine annealing
let mut scheduler = CosineAnnealingLR::new(&optimizer, 100);

// One-cycle policy for super-convergence
let mut scheduler = OneCycleLR::new(&optimizer, 0.1, 1000);

// In training loop
for epoch in 0..epochs {
    // ... training ...
    scheduler.step(&mut optimizer);
}

ReduceLROnPlateau

use axonml_optim::{SGD, ReduceLROnPlateau};

let mut optimizer = SGD::new(model.parameters(), 0.1);
let mut scheduler = ReduceLROnPlateau::with_options(
    &optimizer,
    "min",    // mode: minimize metric
    0.1,      // factor: reduce LR by 10x
    10,       // patience: wait 10 epochs
    1e-4,     // threshold
    0,        // cooldown
    1e-6,     // min_lr
);

// Step with validation loss
scheduler.step_with_metric(&mut optimizer, val_loss);

Mixed Precision (AMP) with GradScaler

use axonml_optim::{Adam, GradScaler, Optimizer};
use axonml_autograd::autocast;
use axonml_core::DType;

let mut optimizer = Adam::new(model.parameters(), 1e-3);
let mut scaler = GradScaler::new();

for batch in batches {
    optimizer.zero_grad();

    // Forward in F16
    let loss = autocast(DType::F16, || loss_fn.compute(&model.forward(&batch.x), &batch.y));

    // Scale the loss before backward to avoid F16 underflow
    let scaled = scaler.scale(&loss);
    scaled.backward();

    // Unscale and step if no inf/NaN; update scale factor
    scaler.step(&mut optimizer);
    scaler.update();
}

Training Health Monitor

The optimizer monitors its own training health — detects problems before they ruin the run.

use axonml_optim::{TrainingMonitor, MonitorConfig, AlertSeverity};

let mut monitor = TrainingMonitor::new(MonitorConfig::default());

// Record metrics each training step
for step in 0..1000 {
    let loss = train_step(&model, &batch);
    let grad_norm = compute_grad_norm(&model);

    monitor.record_step(loss, grad_norm, optimizer.get_lr());

    // Check for alerts
    for alert in monitor.alerts_since_last_check() {
        match alert.severity {
            AlertSeverity::Critical => eprintln!("CRITICAL: {}", alert.message),
            AlertSeverity::Warning => eprintln!("WARNING: {}", alert.message),
            _ => {}
        }
    }
}

// Analyze training health
let report = monitor.health_report();
println!("Loss trend: {:?}", report.loss_trend);
println!("Convergence: {:.2}", monitor.convergence_score());
println!("Suggested LR: {:?}", monitor.suggest_lr());
println!("{}", monitor.summary());

Tests

Run the test suite:

cargo test -p axonml-optim

License

Licensed under either of:

at your option.


Last updated: 2026-04-16 (v0.6.1)