trustformers-optim 0.1.0-alpha.2

# trustformers-optim TODO List

## Overview

The `trustformers-optim` crate provides comprehensive optimization algorithms and learning rate schedulers
for training transformer models in the TrustformeRS ecosystem. It implements 20+ state-of-the-art optimizers
including standard methods (SGD, Adam), modern variants (AdamW, LAMB), and cutting-edge research algorithms
(Lion, Sophia, SAM).

**Key Responsibilities:**
- Optimization algorithms (SGD, Adam, AdamW, Lion, Sophia, LAMB, etc.)
- Learning rate schedulers (Linear, Cosine, OneCycle, etc.)
- Gradient clipping and normalization
- Weight decay (L2 regularization and decoupled)
- Optimizer state management (save/load checkpoints)
- Parameter groups for layer-wise learning rates
- Mixed precision training support

---

## Current Status

### Implementation Status
✅ **PRODUCTION-READY** - All major optimizers implemented and tested
✅ **COMPREHENSIVE TEST COVERAGE** - 417 tests with 100% pass rate
✅ **CUTTING-EDGE ALGORITHMS** - Latest research optimizers (2023-2024)
✅ **ZERO COMPILATION ERRORS** - Clean compilation across all platforms
✅ **MEMORY EFFICIENT** - Optimized for large models

### Test Metrics
- **Test Count:** 417 unit tests
- **Pass Rate:** 100%
- **Coverage:** Optimizer convergence, scheduler validation, gradient clipping, state save/load
- **Numerical Stability:** Extensive testing with edge cases

---

## Completed Optimizer Implementations

### Standard Optimizers

#### SGD (Stochastic Gradient Descent)

**Classic first-order optimization**

- ✅ **Algorithm**
  - Update: `θ ← θ - lr * ∇L(θ)`
  - With momentum: `v ← β * v + ∇L(θ), θ ← θ - lr * v`
  - Momentum coefficient β (typically 0.9)

- ✅ **Features**
  - Momentum support for acceleration
  - Nesterov momentum option
  - Weight decay (L2 regularization)
  - Dampening for momentum

- ✅ **Use Cases**
  - Simple baseline
  - Works well with large batch sizes
  - Computer vision tasks

- ✅ **Hyperparameters**
  - Learning rate: 0.1 - 0.001 (typical range)
  - Momentum: 0.9 (standard)
  - Weight decay: 1e-4 - 1e-5

**Example:**
```rust
use trustformers_optim::SGD;

let optimizer = SGD::new(
    model.parameters(),
    lr: 0.01,
    momentum: 0.9,
    weight_decay: 1e-4,
)?;
```

---

#### Adam (Adaptive Moment Estimation)

**Adaptive learning rate optimizer with momentum**

- ✅ **Algorithm**
  - First moment: `m ← β1 * m + (1 - β1) * ∇L`
  - Second moment: `v ← β2 * v + (1 - β2) * ∇L²`
  - Bias correction: `m̂ ← m / (1 - β1^t), v̂ ← v / (1 - β2^t)`
  - Update: `θ ← θ - lr * m̂ / (√v̂ + ε)`

- ✅ **Features**
  - Per-parameter adaptive learning rates
  - Momentum on gradients (first moment)
  - Momentum on squared gradients (second moment)
  - Bias correction for initial timesteps

- ✅ **Use Cases**
  - Default optimizer for many tasks
  - Works well with sparse gradients
  - NLP and transformers

- ✅ **Hyperparameters**
  - Learning rate: 1e-3 (default), 1e-4 - 3e-4 (transformers)
  - β1: 0.9 (momentum for gradients)
  - β2: 0.999 (momentum for squared gradients)
  - ε: 1e-8 (numerical stability)

**Example:**
```rust
use trustformers_optim::Adam;

let optimizer = Adam::new(
    model.parameters(),
    lr: 1e-3,
    betas: (0.9, 0.999),
    eps: 1e-8,
    weight_decay: 0.0,
)?;
```

---

#### AdamW (Adam with Decoupled Weight Decay)

**Adam with proper weight decay**

- ✅ **Algorithm**
  - Same as Adam for moment estimates
  - Decoupled weight decay: `θ ← θ - lr * λ * θ` (applied after Adam update)
  - Fixes weight decay in Adam (L2 regularization ≠ weight decay for adaptive methods)

- ✅ **Features**
  - Proper weight decay (not L2 regularization)
  - Better generalization than Adam
  - Recommended for transformer training

- ✅ **Use Cases**
  - Transformer pretraining and fine-tuning
  - Large language models (BERT, GPT, etc.)
  - Default choice for modern NLP

- ✅ **Hyperparameters**
  - Learning rate: 1e-4 (transformers), 3e-4 (LLMs)
  - β1: 0.9, β2: 0.999 (or 0.98 for LLMs)
  - Weight decay: 0.01 - 0.1 (typical)

**Example:**
```rust
use trustformers_optim::AdamW;

let optimizer = AdamW::new(
    model.parameters(),
    lr: 1e-4,
    betas: (0.9, 0.999),
    eps: 1e-8,
    weight_decay: 0.01,
)?;
```

---

#### AdaGrad (Adaptive Gradient)

**Adaptive learning rate based on historical gradients**

- ✅ **Algorithm**
  - Accumulate squared gradients: `G ← G + ∇L²`
  - Update: `θ ← θ - lr * ∇L / (√G + ε)`

- ✅ **Features**
  - Per-parameter learning rates
  - Larger updates for infrequent parameters
  - Good for sparse data

- ✅ **Use Cases**
  - Sparse features (NLP with large vocabularies)
  - Online learning
  - Convex optimization

- ✅ **Limitations**
  - Learning rate decays too aggressively
  - May stop learning prematurely

---

#### RMSProp (Root Mean Square Propagation)

**Adaptive learning rate with exponential moving average**

- ✅ **Algorithm**
  - Exponential moving average: `E[∇L²] ← α * E[∇L²] + (1 - α) * ∇L²`
  - Update: `θ ← θ - lr * ∇L / (√E[∇L²] + ε)`

- ✅ **Features**
  - Fixes AdaGrad's aggressive decay
  - Momentum on squared gradients only
  - Good for non-stationary objectives

- ✅ **Use Cases**
  - RNNs and LSTMs
  - Non-stationary problems
  - Online learning

- ✅ **Hyperparameters**
  - Learning rate: 1e-3
  - α (decay): 0.99
  - ε: 1e-8

---

### Advanced Optimizers

#### Lion (Evolved Sign Momentum)

**Sign-based optimizer from evolutionary search**

- ✅ **Algorithm**
  - Update: `θ ← θ - lr * sign(β1 * m + (1 - β1) * ∇L)`
  - Momentum: `m ← β2 * m + (1 - β2) * ∇L`
  - Uses sign of interpolation between momentum and gradient

- ✅ **Features**
  - Memory-efficient (only stores momentum, not variance)
  - Discovered via evolutionary algorithm
  - Competitive or better than AdamW

- ✅ **Use Cases**
  - Large language model training
  - Vision transformers
  - Memory-constrained scenarios

- ✅ **Hyperparameters**
  - Learning rate: 1e-4 (typically 3-10x smaller than AdamW)
  - β1: 0.9, β2: 0.99
  - Weight decay: 0.01 - 0.1 (typically larger than AdamW)

**Example:**
```rust
use trustformers_optim::Lion;

let optimizer = Lion::new(
    model.parameters(),
    lr: 1e-4,
    betas: (0.9, 0.99),
    weight_decay: 0.1,
)?;
```

---

#### Sophia (Second-Order Optimizer)

**Scalable second-order optimizer for LLMs**

- ✅ **Algorithm**
  - Uses Hessian diagonal approximation
  - Hutchinson's estimator for diagonal estimation
  - Pre-conditioned gradient descent
  - Update: `θ ← θ - lr * ∇L / (h + ε)` where h is Hessian diagonal estimate

- ✅ **Features**
  - Second-order information without full Hessian
  - Scalable to billion-parameter models
  - Better convergence than first-order methods

- ✅ **Use Cases**
  - Large language model pretraining
  - When compute budget allows
  - Better sample efficiency

- ✅ **Hyperparameters**
  - Learning rate: 2e-4 - 1e-3
  - Hessian update frequency: every 10-100 steps
  - γ (EMA coefficient for Hessian): 0.95

---

#### LAMB (Layer-wise Adaptive Moments for Batch training)

**Optimizer for very large batch training**

- ✅ **Algorithm**
  - Compute Adam-like update: `u = m / (√v + ε)`
  - Layer-wise trust ratio: `r = ||θ|| / ||u||`
  - Update: `θ ← θ - lr * r * u`
  - Normalizes update by layer norm ratio

- ✅ **Features**
  - Enables large batch training (32k+)
  - Layer-wise learning rate adaptation
  - Maintains accuracy with large batches

- ✅ **Use Cases**
  - Large-scale distributed training
  - BERT pretraining with large batches
  - When wall-clock time is critical

- ✅ **Hyperparameters**
  - Learning rate: 1e-3 - 2e-3
  - β1: 0.9, β2: 0.999
  - Weight decay: 0.01

---

#### SAM (Sharpness-Aware Minimization)

**Optimizer for flat minima (better generalization)**

- ✅ **Algorithm**
  1. Compute gradient: `∇L(θ)`
  2. Compute adversarial perturbation: `ε = ρ * ∇L / ||∇L||`
  3. Compute gradient at perturbed point: `∇L(θ + ε)`
  4. Update: `θ ← θ - lr * ∇L(θ + ε)`
  - Seeks flat minima in loss landscape

- ✅ **Features**
  - Better generalization (lower test error)
  - Robust to label noise
  - Finds flatter minima

- ✅ **Use Cases**
  - Computer vision (state-of-the-art on ImageNet)
  - When generalization is critical
  - Fine-tuning tasks

- ✅ **Hyperparameters**
  - Learning rate: same as base optimizer (SGD or Adam)
  - ρ (perturbation radius): 0.05 - 0.1
  - Base optimizer: typically SGD with momentum

---

#### EVA (Exponential Moving Average with Variance Adaptation)

**Adam variant with better convergence**

- ✅ **Algorithm**
  - Enhanced variance adaptation
  - Exponential moving averages for both moments
  - Adaptive variance scaling

- ✅ **Features**
  - Better convergence than Adam on some tasks
  - Reduced variance in gradients
  - Stable training

---

#### Adan (Adaptive Nesterov Momentum)

**Nesterov momentum with adaptive learning rate**

- ✅ **Algorithm**
  - Combines Nesterov momentum with adaptive learning rates
  - Three-stage gradient estimation
  - Better convergence than Adam in some settings

- ✅ **Features**
  - Faster convergence
  - Better generalization
  - Stable across learning rates

---

### Specialized Optimizers

#### LARS (Layer-wise Adaptive Rate Scaling)

**For very large batch distributed training**

- ✅ **Algorithm**
  - Layer-wise learning rate: `lr_layer = lr * ||θ|| / (||∇L|| + λ||θ||)`
  - Normalizes learning rate by layer weight/gradient norms
  - Polynomial warmup

- ✅ **Features**
  - Enables batch sizes of 32k+
  - Linear scaling rule for learning rate
  - Used in ResNet training

- ✅ **Use Cases**
  - Computer vision with large batches
  - Distributed training across many GPUs
  - When throughput is critical

---

#### Lookahead

**Wrapper optimizer for improved stability**

- ✅ **Algorithm**
  - Maintain slow and fast weights
  - Fast weights updated by inner optimizer (e.g., Adam)
  - Every k steps: `θ_slow ← θ_slow + α * (θ_fast - θ_slow)`
  - Interpolate between fast and slow weights

- ✅ **Features**
  - More stable training
  - Better generalization
  - Can wrap any optimizer

- ✅ **Use Cases**
  - Wrap around Adam/SGD for stability
  - Tasks with noisy gradients
  - Improve generalization

- ✅ **Hyperparameters**
  - k (lookahead steps): 5 - 10
  - α (slow weight step size): 0.5

---

#### RAdam (Rectified Adam)

**Adam with variance rectification**

- ✅ **Algorithm**
  - Rectifies variance in early training
  - Automatic warmup based on variance
  - Reduces need for manual warmup

- ✅ **Features**
  - More stable than Adam in early training
  - No manual warmup required
  - Better convergence on some tasks

---

#### Ranger (RAdam + Lookahead)

**Combination of RAdam and Lookahead**

- ✅ **Features**
  - Benefits of both RAdam and Lookahead
  - Stable and robust training
  - Good default choice

---

### Learning Rate Schedulers

#### Linear Scheduler

- ✅ **Algorithm**
  - Warmup: linearly increase from 0 to lr over warmup_steps
  - Decay: linearly decrease from lr to 0 over remaining steps
  - `lr(t) = lr * min(t / warmup, 1 - (t - warmup) / (total - warmup))`

- ✅ **Use Cases**
  - Transformer training
  - BERT-style pretraining
  - Simple and effective

**Example:**
```rust
use trustformers_optim::schedulers::LinearScheduler;

let scheduler = LinearScheduler::new(
    optimizer,
    warmup_steps: 10000,
    total_steps: 100000,
)?;
```

---

#### Cosine Annealing

- ✅ **Algorithm**
  - Warmup: linear increase to lr
  - Annealing: cosine decay to min_lr
  - `lr(t) = min_lr + 0.5 * (lr - min_lr) * (1 + cos(π * t / T))`

- ✅ **Features**
  - Smooth decay
  - Popular for vision tasks
  - Can use restarts (cosine annealing with warm restarts)

- ✅ **Use Cases**
  - Computer vision
  - Long training runs
  - Fine-tuning

---

#### OneCycle Policy

- ✅ **Algorithm**
  - Phase 1: Increase lr from max_lr/div_factor to max_lr
  - Phase 2: Decrease lr from max_lr to max_lr/final_div_factor
  - Also cycles momentum inversely
  - Single cycle over entire training

- ✅ **Features**
  - Fast convergence
  - Discovered empirically
  - Works well with SGD

- ✅ **Use Cases**
  - Fast training
  - Computer vision
  - Short training schedules

---

#### Polynomial Decay

- ✅ **Algorithm**
  - `lr(t) = lr * (1 - t / T)^power`
  - Power typically 1.0 (linear) or 2.0 (quadratic)

- ✅ **Use Cases**
  - Object detection (e.g., YOLO)
  - Semantic segmentation
  - Custom decay schedules

---

#### Step Decay

- ✅ **Algorithm**
  - Multiply lr by gamma every step_size steps
  - `lr(t) = lr * gamma^floor(t / step_size)`

- ✅ **Use Cases**
  - Traditional computer vision
  - Milestone-based training
  - Simple baseline

---

#### Exponential Decay

- ✅ **Algorithm**
  - `lr(t) = lr * gamma^t`
  - Continuous exponential decay

- ✅ **Use Cases**
  - Reinforcement learning
  - Online learning
  - Continuous decay needed

---

## Advanced Features

### Gradient Clipping

- ✅ **By Norm**
  - Clip gradient norm to maximum value
  - `∇L ← ∇L * max_norm / max(||∇L||, max_norm)`
  - Prevents exploding gradients

- ✅ **By Value**
  - Clip each gradient element to [-max_val, max_val]
  - Element-wise clipping

**Example:**
```rust
optimizer.clip_grad_norm(max_norm: 1.0)?;
optimizer.clip_grad_value(max_value: 0.5)?;
```

---

### Weight Decay

- ✅ **L2 Regularization**
  - Add to loss: `L_total = L + λ/2 * ||θ||²`
  - Gradient includes regularization term

- ✅ **Decoupled Weight Decay**
  - Update: `θ ← θ - lr * λ * θ` (separate from gradient)
  - Correct for adaptive optimizers (AdamW)

---

### Gradient Accumulation

- ✅ **Simulate Large Batches**
  - Accumulate gradients over multiple forward/backward passes
  - Update only after N accumulations
  - Effective batch size = batch_size * accumulation_steps

**Example:**
```rust
for (i, batch) in dataloader.enumerate() {
    let loss = model.forward(batch)?;
    loss.backward()?;

    if (i + 1) % accumulation_steps == 0 {
        optimizer.step()?;
        optimizer.zero_grad()?;
    }
}
```

---

### Mixed Precision Training

- ✅ **FP16/BF16 Support**
  - Forward/backward in half precision
  - Optimizer state in FP32
  - Loss scaling to prevent underflow

- ✅ **Automatic Mixed Precision (AMP)**
  - Dynamic loss scaling
  - Automatic precision management

---

### Optimizer State Management

- ✅ **Save State**
  ```rust
  optimizer.save_state("checkpoint.pt")?;
  ```

- ✅ **Load State**
  ```rust
  optimizer.load_state("checkpoint.pt")?;
  ```

- ✅ **State Dictionary**
  - Includes all optimizer buffers
  - Learning rate, momentum buffers, variance buffers
  - Step count for schedulers

---

### Parameter Groups

- ✅ **Layer-wise Learning Rates**
  ```rust
  let param_groups = vec![
      ParamGroup {
          params: embeddings.parameters(),
          lr: 1e-5,
          weight_decay: 0.01,
      },
      ParamGroup {
          params: encoder.parameters(),
          lr: 1e-4,
          weight_decay: 0.01,
      },
      ParamGroup {
          params: decoder.parameters(),
          lr: 3e-4,
          weight_decay: 0.0,
      },
  ];

  let optimizer = AdamW::new(param_groups)?;
  ```

- ✅ **Use Cases**
  - Fine-tuning (smaller lr for pretrained layers)
  - Different regularization per layer
  - Discriminative learning rates

---

## Testing

### Test Coverage

- ✅ **417 Unit Tests** - 100% pass rate
- ✅ **Optimizer Convergence** - Verify convergence on toy problems
- ✅ **Scheduler Validation** - Check learning rate schedules
- ✅ **Gradient Clipping** - Verify clipping correctness
- ✅ **State Save/Load** - Round-trip state verification
- ✅ **Memory Leak Detection** - No memory leaks
- ✅ **Numerical Stability** - Edge cases (zero gradients, NaN, Inf)

### Test Categories

1. **Correctness Tests**
   - Optimizer updates match reference implementations
   - Schedulers produce correct learning rates
   - Gradient clipping works as expected

2. **Convergence Tests**
   - Optimizers converge on convex problems
   - Compare convergence speed across optimizers
   - Test on simple neural networks

3. **State Tests**
   - Save/load produces identical state
   - Resuming training continues correctly
   - State dictionary completeness

4. **Edge Case Tests**
   - Zero gradients handled correctly
   - NaN/Inf gradients detected and handled
   - Empty parameter groups
   - Single parameter optimization

---

## Known Limitations

- Some very recent optimizers (2024+) not yet implemented
- Distributed optimizer state sharding not fully optimized

---

## Future Enhancements

### High Priority
- Additional cutting-edge optimizers as they emerge
- Enhanced distributed optimizer state management
- Automatic hyperparameter tuning

### Performance
- Further optimization of optimizer state storage
- Fused optimizer kernels for GPU
- Lazy optimizer state allocation

### Features
- More sophisticated schedulers (cyclic with restarts)
- Automatic learning rate finder
- Optimizer surgery (change optimizer mid-training)

---

## Development Guidelines

### Code Standards
- **Use trustformers-core abstractions only** (no external deps directly)
- **File size limit:** <2000 lines per file
- **Error handling:** Use `Result<T, TrustformersError>`
- **Testing:** Convergence tests required for new optimizers
- **Naming:** snake_case for all identifiers

### Adding a New Optimizer

**Checklist:**

1. **Implement Optimizer Trait**
   ```rust
   impl StatefulOptimizer for NewOptimizer {
       fn step(&mut self) -> Result<()>;
       fn zero_grad(&mut self) -> Result<()>;
       fn state_dict(&self) -> StateDict;
       fn load_state_dict(&mut self, state: StateDict) -> Result<()>;
   }
   ```

2. **Add State Buffers**
   - Momentum buffers, variance buffers, etc.
   - Per-parameter state

3. **Implement Update Rule**
   - Follow algorithm from paper
   - Handle edge cases (zero gradients, first step, etc.)

4. **Add Tests**
   - Convergence test on toy problem
   - State save/load test
   - Compare with reference implementation

5. **Document**
   - Algorithm description
   - Hyperparameter recommendations
   - Use cases
   - Example code

### Build & Test Commands

```bash
# Run all tests
cargo nextest run -p trustformers-optim --all-features

# Test specific optimizer
cargo test -p trustformers-optim test_adam

# Benchmark
cargo bench -p trustformers-optim

# Check compilation
cargo check -p trustformers-optim --all-features
```

---

**Last Updated:** Refactored for alpha.1 release
**Status:** Production-ready optimization
**Test Coverage:** 417 tests, 100% pass rate