Expand description
Information Geometry for Attention
Natural gradient methods using Fisher information metric.
§Key Concepts
- Fisher Metric: F = diag(p) - p*p^T on probability simplex
- Natural Gradient: Solve Fdelta = grad, then update params -= lrdelta
- Conjugate Gradient: Efficient solver for Fisher system
§Use Cases
- Training attention weights with proper geometry
- Routing probabilities in MoE
- Softmax logit optimization
Structs§
- Fisher
Config - Fisher metric configuration
- Fisher
Metric - Fisher metric operations
- Natural
Gradient - Natural gradient optimizer
- Natural
Gradient Config - Natural gradient configuration