pub struct Sophia {
pub lr: f32,
pub beta1: f32,
pub beta2: f32,
pub gamma: f32,
pub rho: f32,
pub eps: f32,
pub weight_decay: f32,
/* private fields */
}Expand description
Sophia-H — Hessian-diagonal second-order optimizer.
Fields§
§lr: f32Learning rate. Typically slightly larger than the AdamW LR you’d use on the same model, because the clip bounds the step.
beta1: f32First-moment EMA decay β₁. Default 0.965.
beta2: f32Hessian-diagonal EMA decay β₂. Default 0.99.
gamma: f32Hessian scale γ (Liu et al. default 0.01). Multiplies the
Hessian estimate before forming the denominator.
rho: f32Per-coordinate clip threshold ρ. Default 0.04 — the
dimensionless cap on each step’s magnitude.
eps: f32Denominator floor. Default 1e-12.
weight_decay: f32Decoupled weight-decay coefficient λ. Default 0.1 (large by
AdamW standards — Sophia tolerates more decay).
Implementations§
Source§impl Sophia
impl Sophia
Sourcepub fn new(lr: f32) -> Self
pub fn new(lr: f32) -> Self
Construct with (β₁, β₂, γ, ρ, ε, λ) = (0.965, 0.99, 0.01, 0.04, 1e-12, 0.1).
Sourcepub fn with_betas(self, b1: f32, b2: f32) -> Self
pub fn with_betas(self, b1: f32, b2: f32) -> Self
Override (β₁, β₂).
Sourcepub fn with_weight_decay(self, wd: f32) -> Self
pub fn with_weight_decay(self, wd: f32) -> Self
Override the decoupled-decay coefficient.
Sourcepub fn update_hessian(&mut self, name: &str, h_hat: &[f32])
pub fn update_hessian(&mut self, name: &str, h_hat: &[f32])
Update the diagonal-Hessian estimate for parameter name.
h_hat should be a fresh estimate (typically H_diag from a
Hutchinson estimator or g² from a Gauss-Newton approximation).
Trait Implementations§
Source§impl Optimizer for Sophia
impl Optimizer for Sophia
fn step( &mut self, name: &str, _shape: &[usize], param: &mut [f32], grad: &[f32], )
Source§fn end_iteration(&mut self)
fn end_iteration(&mut self)
step], so most implementations leave this a no-op.Source§fn lr_scale(&self, _name: &str) -> f32
fn lr_scale(&self, _name: &str) -> f32
1.0 for every name. Override when wrapping this crate to
support per-name LR schedules (e.g. embedding-vs-attention
splits, or the Gaussian-splat attribute-typed LR setup). The
CPU impls in this crate currently honor this only when the
caller passes a pre-scaled lr for the relevant call —
backends are encouraged to consult it inside their fused
kernel.