aprender 0.31.2

Next-generation ML framework in pure Rust — `cargo install aprender` for the `apr` CLI
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
<!-- PCU: ml-fundamentals-advanced-optimizers | contract: contracts/apr-page-ml-fundamentals-advanced-optimizers-v1.yaml -->
<!-- Example: cargo run -p aprender-core --example none -->
<!-- Status: enforced -->

# Advanced Optimizers Theory

Modern optimizers go beyond vanilla gradient descent by adapting learning rates, incorporating momentum, and using gradient statistics to achieve faster and more stable convergence. This chapter covers state-of-the-art optimization algorithms used in deep learning and machine learning.

## Why Advanced Optimizers?

Standard SGD with momentum works well but has limitations:

1. **Fixed learning rate**: Same η for all parameters
   - Problem: Different parameters may need different learning rates
   - Example: Rare features need larger updates than frequent ones

2. **Manual tuning required**: Finding optimal η is time-consuming
   - Grid search expensive
   - Different datasets need different learning rates

3. **Slow convergence**: Without careful tuning, training can be slow
   - Especially in non-convex landscapes
   - High-dimensional parameter spaces

**Solution**: Adaptive optimizers that automatically adjust learning rates per parameter.

## Optimizer Comparison Table

| Optimizer | Key Feature | Best For | Pros | Cons |
|-----------|-------------|----------|------|------|
| **SGD + Momentum** | Velocity accumulation | General purpose | Simple, well-understood | Requires manual tuning |
| **AdaGrad** | Per-parameter lr | Sparse gradients | Adapts to data | lr decays too aggressively |
| **RMSprop** | Exponential moving average | RNNs, non-stationary | Fixes AdaGrad decay | No bias correction |
| **Adam** | Momentum + RMSprop | Deep learning (default) | Fast, robust | Can overfit on small data |
| **AdamW** | Adam + decoupled weight decay | Transformers | Better generalization | Slightly slower |
| **Nadam** | Adam + Nesterov momentum | Computer vision | Faster convergence | More complex |

## AdaGrad: Adaptive Gradient Algorithm

**Key idea**: Accumulate squared gradients and divide learning rate by their square root, giving smaller updates to frequently updated parameters.

### Algorithm

```
Initialize:
  θ₀ = initial parameters
  G₀ = 0  (accumulated squared gradients)
  η = learning rate (typically 0.01)
  ε = 1e-8 (numerical stability)

For t = 1, 2, ...
  g_t = ∇L(θ_{t-1})             // Compute gradient
  G_t = G_{t-1} + g_t ⊙ g_t      // Accumulate squared gradients
  θ_t = θ_{t-1} - η / √(G_t + ε) ⊙ g_t  // Adaptive update
```

Where **⊙** denotes element-wise multiplication.

### How It Works

**Per-parameter learning rate**:
```
η_i(t) = η / √(Σ(g_i^2) + ε)
                s=1..t
```

- Frequently updated parameters → large accumulated gradient → small effective η
- Infrequently updated parameters → small accumulated gradient → large effective η

### Example

Consider two parameters with gradients:

```
Parameter θ₁: Gradients = [10, 10, 10, 10]  (frequent updates)
Parameter θ₂: Gradients = [1, 0, 0, 1]      (sparse updates)

After 4 iterations (η = 0.1):

θ₁: G = 10² + 10² + 10² + 10² = 400
    Effective η₁ = 0.1 / √400 = 0.1 / 20 = 0.005  (small)

θ₂: G = 1² + 0² + 0² + 1² = 2
    Effective η₂ = 0.1 / √2 = 0.1 / 1.41 ≈ 0.071  (large)
```

**Result**: θ₂ gets ~14x larger updates despite having smaller gradients!

### Advantages

1. **Automatic learning rate adaptation**: No manual tuning per parameter
2. **Great for sparse data**: NLP, recommender systems
3. **Handles different scales**: Features with different ranges

### Disadvantages

1. **Learning rate decay**: Accumulation never decreases
   - Eventually η → 0, stopping learning
   - Problem for deep learning (many iterations)

2. **Requires careful initialization**: Poor initial η can hurt performance

### When to Use

- **Sparse gradients**: NLP (word embeddings), recommender systems
- **Convex optimization**: Guaranteed convergence for convex functions
- **Short training**: If iteration count is small

**Not recommended for**: Deep neural networks (use RMSprop or Adam instead)

## RMSprop: Root Mean Square Propagation

**Key idea**: Fix AdaGrad's aggressive learning rate decay by using exponential moving average instead of sum.

### Algorithm

```
Initialize:
  θ₀ = initial parameters
  v₀ = 0  (moving average of squared gradients)
  η = learning rate (typically 0.001)
  β = decay rate (typically 0.9)
  ε = 1e-8

For t = 1, 2, ...
  g_t = ∇L(θ_{t-1})                    // Compute gradient
  v_t = β·v_{t-1} + (1-β)·(g_t ⊙ g_t)  // Exponential moving average
  θ_t = θ_{t-1} - η / √(v_t + ε) ⊙ g_t  // Adaptive update
```

### Key Difference from AdaGrad

**AdaGrad**: `G_t = G_{t-1} + g_t²` (sum, always increasing)
**RMSprop**: `v_t = β·v_{t-1} + (1-β)·g_t²` (exponential moving average)

The **exponential moving average** forgets old gradients, allowing learning rate to increase again if recent gradients are small.

### Effect of Decay Rate β

```
β = 0.9 (typical):
  - Averages over ~10 iterations
  - Balance between stability and adaptivity

β = 0.99:
  - Averages over ~100 iterations
  - More stable, slower adaptation

β = 0.5:
  - Averages over ~2 iterations
  - Fast adaptation, more noise
```

### Advantages

1. **No learning rate decay problem**: Can train indefinitely
2. **Works well for RNNs**: Handles non-stationary problems
3. **Less sensitive to initialization**: Compared to AdaGrad

### Disadvantages

1. **No bias correction**: Early iterations biased toward 0
2. **Still requires tuning**: η and β hyperparameters

### When to Use

- **RNNs and LSTMs**: Originally designed for this
- **Non-stationary problems**: Changing data distributions
- **Deep learning**: Better than AdaGrad for many epochs

## Adam: Adaptive Moment Estimation

**The most popular optimizer** in modern deep learning. Combines the best ideas from momentum and RMSprop.

### Core Concept

Adam maintains two moving averages:
1. **First moment** (m): Exponential moving average of gradients (momentum)
2. **Second moment** (v): Exponential moving average of squared gradients (RMSprop)

### Algorithm

```
Initialize:
  θ₀ = initial parameters
  m₀ = 0  (first moment: mean gradient)
  v₀ = 0  (second moment: uncentered variance)
  η = 0.001  (learning rate)
  β₁ = 0.9   (exponential decay for first moment)
  β₂ = 0.999 (exponential decay for second moment)
  ε = 1e-8

For t = 1, 2, ...
  g_t = ∇L(θ_{t-1})                     // Gradient

  m_t = β₁·m_{t-1} + (1-β₁)·g_t         // Update first moment
  v_t = β₂·v_{t-1} + (1-β₂)·(g_t ⊙ g_t)  // Update second moment

  m̂_t = m_t / (1 - β₁^t)                // Bias correction for m
  v̂_t = v_t / (1 - β₂^t)                // Bias correction for v

  θ_t = θ_{t-1} - η · m̂_t / (√v̂_t + ε)  // Parameter update
```

### Why Bias Correction?

Initially, m and v are zero. Exponential moving averages are biased toward zero at the start.

**Example** (β₁ = 0.9, g₁ = 1.0):
```
Without bias correction:
  m₁ = 0.9 × 0 + 0.1 × 1.0 = 0.1  (underestimates true mean of 1.0)

With bias correction:
  m̂₁ = 0.1 / (1 - 0.9¹) = 0.1 / 0.1 = 1.0  (correct!)
```

The correction factor `1 - β^t` approaches 1 as t increases, so correction matters most early in training.

### Hyperparameters

**Default values** (from paper, work well in practice):
- **η = 0.001**: Learning rate (most important to tune)
- **β₁ = 0.9**: First moment decay (rarely changed)
- **β₂ = 0.999**: Second moment decay (rarely changed)
- **ε = 1e-8**: Numerical stability

**Tuning guidelines**:
1. Start with defaults
2. If unstable: reduce η to 0.0001
3. If slow: increase η to 0.01
4. Adjust β₁ for more/less momentum (rarely needed)

### aprender Implementation

```rust
use aprender::optim::{Adam, Optimizer};

// Create Adam optimizer with default hyperparameters
let mut adam = Adam::new(0.001)  // learning rate
    .with_beta1(0.9)             // optional: momentum coefficient
    .with_beta2(0.999)           // optional: RMSprop coefficient
    .with_epsilon(1e-8);         // optional: numerical stability

// Training loop
for epoch in 0..num_epochs {
    for batch in data_loader {
        // Forward pass
        let predictions = model.predict(&batch.x);
        let loss = loss_fn(predictions, batch.y);

        // Compute gradients
        let grads = compute_gradients(&model, &batch);

        // Adam step (handles momentum + adaptive lr internally)
        adam.step(&mut model.params, &grads);
    }
}
```

**Key methods**:
- `Adam::new(η)`: Create with learning rate
- `with_beta1(β₁)`, `with_beta2(β₂)`: Set moment decay rates
- `step(&mut params, &grads)`: Perform one update step
- `reset()`: Reset moment buffers (for multiple training runs)

### Advantages

1. **Robust**: Works well with default hyperparameters
2. **Fast convergence**: Combines momentum + adaptive lr
3. **Memory efficient**: Only 2x parameter memory (m and v)
4. **Well-studied**: Extensive empirical validation

### Disadvantages

1. **Can overfit**: On small datasets or with insufficient regularization
2. **Generalization**: Sometimes SGD with momentum generalizes better
3. **Memory overhead**: 2x parameter count

### When to Use

- **Default choice**: For most deep learning problems
- **Fast prototyping**: Converges quickly, minimal tuning
- **Large-scale training**: Handles high-dimensional problems well

**When to avoid**:
- Very small datasets (<1000 samples): Try SGD + momentum
- Need best generalization: Consider SGD with learning rate schedule

## AdamW: Adam with Decoupled Weight Decay

**Problem with Adam**: Weight decay (L2 regularization) interacts badly with adaptive learning rates.

**Solution**: Decouple weight decay from gradient-based optimization.

### Standard Adam with Weight Decay (Wrong)

```
g_t = ∇L(θ_{t-1}) + λ·θ_{t-1}  // Add L2 penalty to gradient
// ... normal Adam update with modified gradient
```

**Problem**: Weight decay gets adapted by second moment estimate, weakening regularization.

### AdamW (Correct)

```
// Normal Adam update (no λ in gradient)
m_t = β₁·m_{t-1} + (1-β₁)·g_t
v_t = β₂·v_{t-1} + (1-β₂)·(g_t ⊙ g_t)
m̂_t = m_t / (1 - β₁^t)
v̂_t = v_t / (1 - β₂^t)

// Separate weight decay step
θ_t = θ_{t-1} - η · (m̂_t / (√v̂_t + ε) + λ·θ_{t-1})
```

Weight decay applied directly to parameters, not through adaptive learning rates.

### When to Use

- **Transformers**: Essential for BERT, GPT models
- **Large models**: Better generalization on big networks
- **Transfer learning**: Fine-tuning pre-trained models

**Hyperparameters**:
- Same as Adam, plus:
- **λ = 0.01**: Weight decay coefficient (typical)

## Optimizer Selection Guide

### Decision Tree

```
Start
  │
  ├─ Need fast prototyping?
  │    └─ YES → Adam (default: η=0.001)
  │
  ├─ Training RNN/LSTM?
  │    └─ YES → RMSprop (default: η=0.001, β=0.9)
  │
  ├─ Working with transformers?
  │    └─ YES → AdamW (η=0.001, λ=0.01)
  │
  ├─ Sparse gradients (NLP embeddings)?
  │    └─ YES → AdaGrad (η=0.01)
  │
  ├─ Need best generalization?
  │    └─ YES → SGD + momentum (η=0.1, β=0.9) + lr schedule
  │
  └─ Small dataset (<1000 samples)?
       └─ YES → SGD + momentum (less overfitting)
```

### Practical Recommendations

| Task | Recommended Optimizer | Learning Rate | Notes |
|------|----------------------|---------------|-------|
| Image classification (CNN) | Adam or SGD+momentum | 0.001 (Adam), 0.1 (SGD) | SGD often better final accuracy |
| NLP (word embeddings) | AdaGrad or Adam | 0.01 (AdaGrad), 0.001 (Adam) | AdaGrad for sparse features |
| RNN/LSTM | RMSprop or Adam | 0.001 | RMSprop traditional choice |
| Transformers | AdamW | 0.0001-0.001 | Essential for BERT, GPT |
| Small dataset | SGD + momentum | 0.01-0.1 | Less prone to overfitting |
| Reinforcement learning | Adam or RMSprop | 0.0001-0.001 | Non-stationary problem |

## Learning Rate Schedules

Even with adaptive optimizers, **learning rate schedules** improve performance.

### 1. Step Decay

Reduce η by factor every K epochs:

```rust
let initial_lr = 0.001;
let decay_factor = 0.1;
let decay_epochs = 30;

for epoch in 0..num_epochs {
    let lr = initial_lr * decay_factor.powi((epoch / decay_epochs) as i32);
    let mut adam = Adam::new(lr);
    // ... training
}
```

### 2. Exponential Decay

Smooth exponential reduction:

```rust
let initial_lr = 0.001;
let decay_rate = 0.96;

for epoch in 0..num_epochs {
    let lr = initial_lr * decay_rate.powi(epoch as i32);
    let mut adam = Adam::new(lr);
    // ... training
}
```

### 3. Cosine Annealing

Smooth reduction following cosine curve:

```rust
use std::f32::consts::PI;

let lr_max = 0.001;
let lr_min = 0.00001;
let T_max = 100; // periods

for epoch in 0..num_epochs {
    let lr = lr_min + 0.5 * (lr_max - lr_min) *
             (1.0 + f32::cos(PI * (epoch as f32) / (T_max as f32)));
    let mut adam = Adam::new(lr);
    // ... training
}
```

### 4. Warm-up + Decay

Start small, increase, then decay (used in transformers):

```rust
fn learning_rate_schedule(step: usize, d_model: usize, warmup_steps: usize) -> f32 {
    let d_model = d_model as f32;
    let step = step as f32;
    let warmup_steps = warmup_steps as f32;

    let arg1 = 1.0 / step.sqrt();
    let arg2 = step * warmup_steps.powf(-1.5);

    d_model.powf(-0.5) * arg1.min(arg2)
}
```

## Comparison: SGD vs Adam

### When SGD + Momentum is Better

**Advantages**:
- Better final generalization (lower test error)
- Flatter minima (more robust to perturbations)
- Less memory (no moment estimates)

**Requirements**:
- Careful learning rate tuning
- Learning rate schedule essential
- More training time may be needed

### When Adam is Better

**Advantages**:
- Faster initial convergence
- Minimal hyperparameter tuning
- Works across many problem types

**Trade-offs**:
- Can overfit more easily
- May find sharper minima
- Slightly worse generalization on some tasks

### Empirical Rule

**Adam for**:
- Fast prototyping and experimentation
- Baseline models
- Large-scale problems (many parameters)

**SGD + momentum for**:
- Final production models (after tuning)
- When computational budget allows careful tuning
- Small to medium datasets

## Debugging Optimizer Issues

### Loss Not Decreasing

**Possible causes**:
1. Learning rate too small
   - **Fix**: Increase η by 10x
2. Vanishing gradients
   - **Fix**: Check gradient norms, adjust architecture
3. Bug in gradient computation
   - **Fix**: Use gradient checking

### Loss Exploding (NaN)

**Possible causes**:
1. Learning rate too large
   - **Fix**: Reduce η by 10x
2. Gradient explosion
   - **Fix**: Gradient clipping, better initialization

### Slow Convergence

**Possible causes**:
1. Poor learning rate
   - **Fix**: Try different optimizer (Adam if using SGD)
2. No momentum
   - **Fix**: Add momentum (β=0.9)
3. Suboptimal batch size
   - **Fix**: Try 32, 64, 128

### Overfitting

**Possible causes**:
1. Optimizer too aggressive (Adam on small data)
   - **Fix**: Switch to SGD + momentum
2. No regularization
   - **Fix**: Add weight decay (AdamW)

## aprender Optimizer Example

```rust
use aprender::optim::{Adam, SGD, Optimizer};
use aprender::linear_model::LogisticRegression;
use aprender::prelude::*;

// Example: Comparing Adam vs SGD
fn compare_optimizers(x_train: &Matrix<f32>, y_train: &Vector<i32>) {
    // Optimizer 1: Adam (fast convergence)
    let mut model_adam = LogisticRegression::new();
    let mut adam = Adam::new(0.001);

    println!("Training with Adam...");
    for epoch in 0..50 {
        let loss = train_epoch(&mut model_adam, x_train, y_train, &mut adam);
        if epoch % 10 == 0 {
            println!("  Epoch {}: Loss = {:.4}", epoch, loss);
        }
    }

    // Optimizer 2: SGD + momentum (better generalization)
    let mut model_sgd = LogisticRegression::new();
    let mut sgd = SGD::new(0.1).with_momentum(0.9);

    println!("\nTraining with SGD + Momentum...");
    for epoch in 0..50 {
        let loss = train_epoch(&mut model_sgd, x_train, y_train, &mut sgd);
        if epoch % 10 == 0 {
            println!("  Epoch {}: Loss = {:.4}", epoch, loss);
        }
    }
}

fn train_epoch<O: Optimizer>(
    model: &mut LogisticRegression,
    x: &Matrix<f32>,
    y: &Vector<i32>,
    optimizer: &mut O,
) -> f32 {
    // Compute loss and gradients
    let predictions = model.predict_proba(x);
    let loss = compute_cross_entropy_loss(&predictions, y);
    let grads = compute_gradients(model, x, y);

    // Update parameters
    optimizer.step(&mut model.coefficients_mut(), &grads);

    loss
}
```

## Further Reading

**Seminal Papers**:
- **Adam**: Kingma & Ba (2015). "Adam: A Method for Stochastic Optimization"
- **AdamW**: Loshchilov & Hutter (2019). "Decoupled Weight Decay Regularization"
- **RMSprop**: Hinton (unpublished, Coursera lecture)
- **AdaGrad**: Duchi et al. (2011). "Adaptive Subgradient Methods"

**Practical Guides**:
- Ruder, S. (2016). "An overview of gradient descent optimization algorithms"
- CS231n Stanford: Optimization notes

## Related Chapters

- [Gradient Descent Theory]./gradient-descent.md - Foundation for all optimizers
- [Optimizer Demo]../examples/optimizer-demo.md - Visual comparison of SGD and Adam
- [Regularized Regression]../examples/regularized-regression.md - Coordinate descent alternative

## Summary

| Optimizer | Core Innovation | When to Use | aprender Support |
|-----------|----------------|-------------|------------------|
| **AdaGrad** | Per-parameter learning rates | Sparse gradients, convex problems | Not yet (v0.5.0) |
| **RMSprop** | Exponential moving average of squared gradients | RNNs, non-stationary | Not yet (v0.5.0) |
| **Adam** | Momentum + RMSprop + bias correction | Default choice, deep learning | ✅ Implemented |
| **AdamW** | Adam + decoupled weight decay | Transformers, large models | Not yet (v0.5.0) |

**Key Takeaways**:
1. **Adam is the default** for most deep learning: fast, robust, minimal tuning
2. **SGD + momentum** often achieves better final accuracy with proper tuning
3. **Learning rate schedules** improve all optimizers
4. **AdamW essential** for training transformers
5. **Monitor convergence**: Loss curves reveal optimizer issues

Modern optimizers dramatically accelerate machine learning by adapting learning rates automatically. Understanding their trade-offs enables choosing the right tool for each problem.