# Optimizers — Online Parameter Optimization
Streaming parameter optimization. The user provides gradients,
the library manages the update rule. Designed for background threads,
not the hot path.
| `OnlineGdF64` | O(d) | 8d bytes | Fixed learning rate gradient descent |
| `AdaGradF64` | O(d) | 16d bytes | Per-coordinate adaptive rates |
| `AdamF64` | O(d) | 24d bytes | Adaptive moments + optional weight decay |
All require `alloc`. AdaGrad and Adam additionally require `std|libm`
(for `sqrt`). F64 only — optimization precision matters.
## What They Do
Optimizers adjust a parameter vector to minimize a loss function,
one gradient step at a time. You compute the gradient (how the loss
changes with each parameter), hand it to the optimizer, and it
updates the parameters using its specific strategy.
**Online GD** — simplest. Steps by `learning_rate × gradient`.
Same rate for every parameter, every step.
**AdaGrad** — adapts the learning rate per parameter. Parameters
with frequently large gradients get smaller rates. Parameters
with rare gradients keep large rates.
**Adam** — tracks both momentum (mean of recent gradients) and
adaptive rates (mean of recent squared gradients). Includes
bias correction for early steps. With `weight_decay > 0`, applies
decoupled weight decay (AdamW).
## When to Use Them
**Use an optimizer when:**
- You have a custom loss function (not squared error or cross-entropy)
- You need regularization (AdamW weight decay)
- You're tuning parameters that don't fit the regression/classifier API
**Use LinearRegression/RLS instead when:**
- Your loss is squared error — these are the optimal solution
- You need per-update speed (RLS is specialized, optimizer is general)
**Use LogisticRegression instead when:**
- Your loss is cross-entropy with binary outcomes — hardcoded is faster
**These are NOT hot-path types.** Run them on a background thread.
Accumulate gradients from a batch of observations, then step once.
## How They Work
### Gradient Descent
```
params[i] -= learning_rate * gradient[i]
```
### AdaGrad
```
sum_sq_grad[i] += gradient[i]²
params[i] -= learning_rate * gradient[i] / (sqrt(sum_sq_grad[i]) + epsilon)
```
The accumulated squared gradient makes the effective learning rate
shrink for frequently-updated parameters.
### Adam
```
m[i] = beta1 * m[i] + (1 - beta1) * gradient[i] // momentum
v[i] = beta2 * v[i] + (1 - beta2) * gradient[i]² // adaptive rate
m_hat = m[i] / (1 - beta1^t) // bias correction
v_hat = v[i] / (1 - beta2^t)
params[i] -= learning_rate * m_hat / (sqrt(v_hat) + epsilon)
// Optional weight decay (AdamW):
params[i] -= weight_decay * params[i]
```
## Configuration
### Online GD
```rust
use nexus_stats::learning::OnlineGdF64;
let mut gd = OnlineGdF64::builder()
.dimensions(5)
.learning_rate(0.01)
.build()
.unwrap();
// Compute gradient externally, then step:
gd.step(&gradient).unwrap();
println!("params: {:?}", gd.parameters());
```
### AdaGrad
```rust
use nexus_stats::learning::AdaGradF64;
let mut ag = AdaGradF64::builder()
.dimensions(5)
.learning_rate(0.01)
.epsilon(1e-8) // default
.build()
.unwrap();
ag.step(&gradient).unwrap();
```
### Adam (with optional weight decay)
```rust
use nexus_stats::learning::AdamF64;
// Standard Adam:
let mut adam = AdamF64::builder()
.dimensions(5)
.learning_rate(0.001) // default
.beta1(0.9) // default
.beta2(0.999) // default
.build()
.unwrap();
// AdamW (with weight decay):
let mut adamw = AdamF64::builder()
.dimensions(5)
.weight_decay(0.01)
.build()
.unwrap();
adam.step(&gradient).unwrap();
```
## Examples by Domain
### Monitoring — Adaptive Signal Combination
```rust
// 3 monitoring signals, learn optimal combination weights
let mut optimizer = AdamF64::builder()
.dimensions(3)
.learning_rate(0.001)
.build()
.unwrap();
// Periodically (not every tick):
let features = [signal_a, signal_b, signal_c];
let prediction: f64 = optimizer.parameters().iter()
.zip(&features)
.map(|(w, f)| w * f)
.sum();
// When outcome is observed:
let error = prediction - actual;
let gradient: Vec<f64> = features.iter()
.map(|f| 2.0 * error * f)
.collect();
optimizer.step(&gradient).unwrap();
```
### Networking — Adaptive Threshold Tuning
```rust
// Tune an alerting threshold based on false positive/negative feedback
let mut opt = OnlineGdF64::builder()
.dimensions(1)
.learning_rate(0.001)
.build()
.unwrap();
// On each alert evaluation:
let threshold = opt.parameters()[0];
if was_false_positive {
// Gradient: increase threshold (positive gradient)
opt.step(&[1.0]).unwrap();
} else if was_missed_alert {
// Gradient: decrease threshold (negative gradient)
opt.step(&[-1.0]).unwrap();
}
```
## Error Handling
All `step()` methods return `Result<(), DataError>`. NaN or Inf
in any gradient element is rejected — the parameters are unchanged
and the error is returned. This is a data error (the user's loss
function produced NaN), not a programmer error.
Dimension mismatch (gradient length != dimensions) panics.
## Choosing an Optimizer
| First attempt, no tuning | Adam (defaults work well) |
| Sparse gradients | AdaGrad |
| Need regularization | Adam with weight_decay |
| Simplest possible | OnlineGD |
| Squared error loss | Skip optimizer, use RLS/LinearRegression |
| Binary classification | Skip optimizer, use LogisticRegression |
## Performance
| `OnlineGdF64::step` | ~15 cycles |
| `AdaGradF64::step` | ~30 cycles |
| `AdamF64::step` | ~50 cycles |
These are cheap — the cost is in computing the gradient, not
applying it.