# Gradient Descent Theory
Gradient descent is the fundamental optimization algorithm used to train machine learning models. It iteratively adjusts model parameters to minimize a loss function by following the direction of steepest descent.
## Mathematical Foundation
### The Core Idea
Given a differentiable loss function **L(θ)** where **θ** represents model parameters, gradient descent finds parameters that minimize the loss:
```
θ* = argmin L(θ)
θ
```
The algorithm works by repeatedly taking steps proportional to the **negative gradient** of the loss function:
```
θ(t+1) = θ(t) - η ∇L(θ(t))
```
Where:
- **θ(t)**: Parameters at iteration t
- **η**: Learning rate (step size)
- **∇L(θ(t))**: Gradient of loss with respect to parameters
### Why the Negative Gradient?
The gradient **∇L(θ)** points in the direction of **steepest ascent** (maximum increase in loss). By moving in the **negative gradient direction**, we move toward the steepest descent (minimum loss).
**Intuition**: Imagine standing on a mountain in thick fog. You can feel the slope beneath your feet but can't see the valley. Gradient descent is like repeatedly taking a step in the direction that slopes most steeply downward.
## Variants of Gradient Descent
### 1. Batch Gradient Descent (BGD)
Computes the gradient using the **entire training dataset**:
```
∇L(θ) = (1/N) Σ ∇L_i(θ)
i=1..N
```
**Advantages**:
- Stable convergence (smooth trajectory)
- Guaranteed to converge to global minimum (convex functions)
- Theoretical guarantees
**Disadvantages**:
- Slow for large datasets (must process all samples)
- Memory intensive
- May converge to poor local minima (non-convex functions)
**When to use**: Small datasets (N < 10,000), convex optimization problems
### 2. Stochastic Gradient Descent (SGD)
Computes gradient using a **single random sample** at each iteration:
```
∇L(θ) ≈ ∇L_i(θ) where i ~ Uniform(1..N)
```
**Advantages**:
- Fast updates (one sample per iteration)
- Can escape shallow local minima (noise helps exploration)
- Memory efficient
- Online learning capable
**Disadvantages**:
- Noisy convergence (zig-zagging trajectory)
- May not converge exactly to minimum
- Requires learning rate decay
**When to use**: Large datasets, online learning, non-convex optimization
**aprender implementation**:
```rust
use aprender::optim::SGD;
let mut optimizer = SGD::new(0.01) // learning rate = 0.01
.with_momentum(0.9); // momentum coefficient
// In training loop:
let gradients = compute_gradients(¶ms, &data);
optimizer.step(&mut params, &gradients);
```
### 3. Mini-Batch Gradient Descent
Computes gradient using a **small batch of samples** (typically 32-256):
```
∇L(θ) ≈ (1/B) Σ ∇L_i(θ) where B << N
i∈batch
```
**Advantages**:
- Balance between BGD stability and SGD speed
- Vectorized operations (GPU/SIMD acceleration)
- Reduced variance compared to SGD
- Memory efficient
**Disadvantages**:
- Batch size is a hyperparameter to tune
- Still has some noise
**When to use**: Default choice for most ML problems, deep learning
**Batch size guidelines**:
- Small batches (32-64): Better generalization, more noise
- Large batches (128-512): Faster convergence, more stable
- Powers of 2: Better hardware utilization
## The Learning Rate
The learning rate **η** is the **most critical hyperparameter** in gradient descent.
### Too Small Learning Rate
```
η = 0.001 (very small)
Loss over time:
1000 ┤
900 ┤
800 ┤
700 ┤●
600 ┤ ●
500 ┤ ●
400 ┤ ●●●●●●●●●●●●●●● ← Slow convergence
└─────────────────────→
Iterations (10,000)
```
**Problem**: Training is very slow, may not converge within time budget.
### Too Large Learning Rate
```
η = 1.0 (very large)
Loss over time:
1000 ┤ ●
900 ┤ ● ●
800 ┤ ● ●
700 ┤ ● ● ← Oscillation
600 ┤● ●
└──────────→
Iterations
```
**Problem**: Loss oscillates or diverges, never converges to minimum.
### Optimal Learning Rate
```
η = 0.1 (just right)
Loss over time:
1000 ┤●
800 ┤ ●●
600 ┤ ●●●
400 ┤ ●●●● ← Smooth, fast convergence
200 ┤ ●●●●
└───────────────→
Iterations
```
**Guidelines for choosing η**:
1. Start with **η = 0.1** and adjust by factors of 10
2. Use **learning rate schedules** (decay over time)
3. Monitor loss: if exploding → reduce η; if stagnating → increase η
4. Try **adaptive methods** (Adam, RMSprop) that auto-tune η
## Convergence Analysis
### Convex Functions
For **convex loss functions** (e.g., linear regression with MSE), gradient descent with fixed learning rate converges to the **global minimum**:
```
L(θ(t)) - L(θ*) ≤ C / t
```
Where **C** is a constant. The gap to the optimal loss decreases as **1/t**.
**Convergence rate**: O(1/t) for fixed learning rate
### Non-Convex Functions
For **non-convex functions** (e.g., neural networks), gradient descent may converge to:
- Local minimum
- Saddle point
- Plateau region
**No guarantees** of finding the global minimum, but SGD's noise helps escape poor local minima.
### Stopping Criteria
**When to stop iterating?**
3. **Maximum iterations**: Stop after T iterations
- Prevents infinite loops
4. **Validation loss**: Stop when validation loss stops improving
- Prevents overfitting
**aprender example**:
```rust
// SGD with convergence monitoring
let mut optimizer = SGD::new(0.01);
let mut prev_loss = f32::INFINITY;
let tolerance = 1e-4;
for epoch in 0..max_epochs {
let loss = compute_loss(&model, &data);
// Check convergence
if (prev_loss - loss).abs() < tolerance {
println!("Converged at epoch {}", epoch);
break;
}
let gradients = compute_gradients(&model, &data);
optimizer.step(&mut model.params, &gradients);
prev_loss = loss;
}
```
## Common Pitfalls and Solutions
### 1. Exploding Gradients
**Problem**: Gradients become very large, causing parameters to explode.
**Symptoms**:
- Loss becomes NaN or infinity
- Parameters grow to extreme values
- Occurs in deep networks or RNNs
**Solutions**:
- Reduce learning rate
- Gradient clipping: `g = min(g, threshold)`
- Use batch normalization
- Better initialization (Xavier, He)
### 2. Vanishing Gradients
**Problem**: Gradients become very small, preventing parameter updates.
**Symptoms**:
- Loss stops decreasing but hasn't converged
- Parameters barely change
- Occurs in very deep networks
**Solutions**:
- Use ReLU activation (instead of sigmoid/tanh)
- Skip connections (ResNet architecture)
- Batch normalization
- Better initialization
### 3. Learning Rate Decay
**Strategy**: Start with large learning rate, gradually decrease it.
**Common schedules**:
```rust
// 1. Step decay: Reduce by factor every K epochs
η(t) = η₀ × 0.1^(floor(t / K))
// 2. Exponential decay: Smooth reduction
η(t) = η₀ × e^(-λt)
// 3. 1/t decay: Theoretical convergence guarantee
η(t) = η₀ / (1 + λt)
// 4. Cosine annealing: Cyclical with restarts
η(t) = η_min + 0.5(η_max - η_min)(1 + cos(πt/T))
```
**aprender pattern** (manual implementation):
```rust
let initial_lr = 0.1;
let decay_rate = 0.95;
for epoch in 0..num_epochs {
let lr = initial_lr * decay_rate.powi(epoch as i32);
let mut optimizer = SGD::new(lr);
// Training step
optimizer.step(&mut params, &gradients);
}
```
### 4. Saddle Points
**Problem**: Gradient is zero but point is not a minimum.
```
Surface shape at saddle point:
╱╲ (upward in one direction)
╱ ╲
╱ ╲
╱______╲ (downward in another)
```
**Solutions**:
- Add momentum (helps escape saddle points)
- Use SGD noise to explore
- Second-order methods (Newton, L-BFGS)
## Momentum Enhancement
Standard SGD can be slow in regions with:
- High curvature (steep in some directions, flat in others)
- Noisy gradients
**Momentum** accelerates convergence by accumulating past gradients:
```
v(t) = βv(t-1) + ∇L(θ(t)) // Velocity accumulation
θ(t+1) = θ(t) - η v(t) // Update with velocity
```
Where **β ∈ [0, 1]** is the momentum coefficient (typically 0.9).
**Effect**:
- Smooths out noisy gradients
- Accelerates in consistent directions
- Dampens oscillations
**Analogy**: A ball rolling down a hill builds momentum, doesn't stop at small bumps.
**aprender implementation**:
```rust
let mut optimizer = SGD::new(0.01)
.with_momentum(0.9); // β = 0.9
// Momentum is applied automatically in step()
optimizer.step(&mut params, &gradients);
```
## Practical Guidelines
### Choosing Gradient Descent Variant
| N < 1,000 | Batch GD | Fast enough, stable convergence |
| N = 1K-100K | Mini-batch GD (32-128) | Good balance |
| N > 100K | Mini-batch GD (128-512) | Leverage vectorization |
| Streaming data | SGD | Online learning required |
### Hyperparameter Tuning Checklist
1. **Learning rate η**:
- Start: 0.1
- Grid search: [0.001, 0.01, 0.1, 1.0]
- Use learning rate finder
2. **Momentum β**:
- Default: 0.9
- Range: [0.5, 0.9, 0.99]
3. **Batch size B**:
- Default: 32 or 64
- Range: [16, 32, 64, 128, 256]
- Powers of 2 for hardware efficiency
4. **Learning rate schedule**:
- Option 1: Fixed (simple baseline)
- Option 2: Step decay every 10-30 epochs
- Option 3: Cosine annealing (state-of-the-art)
### Debugging Convergence Issues
**Loss increasing**: Learning rate too large
→ Reduce η by 10x
**Loss stagnating**: Learning rate too small or stuck in local minimum
→ Increase η by 2x or add momentum
**Loss NaN**: Exploding gradients
→ Reduce η, clip gradients, check data preprocessing
**Slow convergence**: Poor learning rate or no momentum
→ Use adaptive optimizer (Adam), add momentum
## Connection to aprender
The `aprender::optim::SGD` implementation provides:
```rust
use aprender::optim::{SGD, Optimizer};
// Create SGD optimizer
let mut sgd = SGD::new(learning_rate)
.with_momentum(momentum_coef);
// In training loop:
for epoch in 0..num_epochs {
for batch in data_loader {
// 1. Forward pass
let predictions = model.predict(&batch.x);
// 2. Compute loss and gradients
let loss = loss_fn(predictions, batch.y);
let grads = compute_gradients(&model, &batch);
// 3. Update parameters using gradient descent
sgd.step(&mut model.params, &grads);
}
}
```
**Key methods**:
- `SGD::new(η)`: Create optimizer with learning rate
- `with_momentum(β)`: Add momentum coefficient
- `step(&mut params, &grads)`: Perform one gradient descent step
- `reset()`: Reset momentum buffers
## Further Reading
- **Theory**: Bottou, L. (2010). "Large-Scale Machine Learning with Stochastic Gradient Descent"
- **Momentum**: Polyak, B. T. (1964). "Some methods of speeding up the convergence of iteration methods"
- **Adaptive methods**: See [Advanced Optimizers Theory](./advanced-optimizers.md)
## Related Examples
- [Optimizer Demo](../examples/optimizer-demo.md) - Visualizing SGD with momentum
- [Logistic Regression](../examples/logistic-regression.md) - SGD for classification
- [Regularized Regression](../examples/regularized-regression.md) - Coordinate descent vs SGD
## Summary
| **Core algorithm** | θ(t+1) = θ(t) - η ∇L(θ(t)) |
| **Learning rate** | Most critical hyperparameter; start with 0.1 |
| **Variants** | Batch (stable), SGD (fast), Mini-batch (best of both) |
| **Momentum** | Accelerates convergence, smooths gradients |
| **Convergence** | Guaranteed for convex functions with proper η |
| **Debugging** | Loss ↑ → reduce η; Loss flat → increase η or add momentum |
Gradient descent is the workhorse of machine learning optimization. Understanding its variants, hyperparameters, and convergence properties is essential for training effective models.