pub struct Muon {
pub lr: f32,
pub momentum: f32,
pub nesterov: bool,
pub weight_decay: f32,
pub ns_steps: u32,
pub ns_coeffs: (f32, f32, f32),
/* private fields */
}Expand description
Muon — Momentum-Orthogonalized-by-Newton-Schulz.
Per-tensor state: one momentum buffer per matrix (half of Adam’s footprint, like Lion).
Fields§
§lr: f32Learning rate. The Newton–Schulz update has roughly unit
Frobenius norm per column, so this is on the same scale as
SGD’s lr — typically 2e-2 to 5e-2.
momentum: f32Polyak momentum coefficient. Default 0.95.
nesterov: boolUse Nesterov lookahead inside the matrix being orthogonalized.
Default true.
weight_decay: f32Decoupled weight-decay coefficient λ. Default 0.0.
ns_steps: u32Newton–Schulz iteration count. 5 is the published default;
3 is enough for most well-conditioned matrices.
ns_coeffs: (f32, f32, f32)(a, b, c) coefficients of the cubic Newton–Schulz iteration
X ← a·X + b·(XXᵀ)X + c·(XXᵀ)²X. Defaults match Jordan et al.
Implementations§
Source§impl Muon
impl Muon
Sourcepub fn new(lr: f32) -> Self
pub fn new(lr: f32) -> Self
Construct with (μ, nesterov, λ, ns_steps) = (0.95, true, 0.0, 5)
and the published NS coefficients.
Sourcepub fn with_momentum(self, mu: f32) -> Self
pub fn with_momentum(self, mu: f32) -> Self
Override the Polyak momentum coefficient.
Sourcepub fn with_weight_decay(self, wd: f32) -> Self
pub fn with_weight_decay(self, wd: f32) -> Self
Override the decoupled-decay coefficient.
Sourcepub fn with_ns_steps(self, n: u32) -> Self
pub fn with_ns_steps(self, n: u32) -> Self
Override the Newton–Schulz iteration count.
Trait Implementations§
Source§impl Optimizer for Muon
impl Optimizer for Muon
fn step(&mut self, name: &str, shape: &[usize], param: &mut [f32], grad: &[f32])
Source§fn end_iteration(&mut self)
fn end_iteration(&mut self)
step], so most implementations leave this a no-op.Source§fn lr_scale(&self, _name: &str) -> f32
fn lr_scale(&self, _name: &str) -> f32
1.0 for every name. Override when wrapping this crate to
support per-name LR schedules (e.g. embedding-vs-attention
splits, or the Gaussian-splat attribute-typed LR setup). The
CPU impls in this crate currently honor this only when the
caller passes a pre-scaled lr for the relevant call —
backends are encouraged to consult it inside their fused
kernel.