rlx-optim
RLX training-step optimizers. Host-side f32 step functions for the
optimizer families surveyed in "A Systematic Review of Optimization
Algorithms for Modern Deep Learning" (arXiv:2509.02046v1).
| Struct | Family | Reference |
|---|---|---|
Sgd |
SGD ± momentum / Nesterov | Polyak '64 / Nesterov '83 |
Adam |
Adam | Kingma & Ba 2014 |
AdamW |
AdamW (decoupled decay) | Loshchilov & Hutter 2017 |
NAdamW |
Nesterov AdamW | Dozat 2016 + AdamW |
RAdam |
Rectified Adam | Liu et al. 2019 |
QHAdamW |
Quasi-hyperbolic AdamW | Ma & Yarats 2019 |
Lamb |
LAMB (layer-wise adaptive) | You et al. 2019 |
Adafactor |
Adafactor (factored 2nd moment) | Shazeer & Stern 2018 |
Lion |
Lion (sign of EMA) | Chen et al. 2023 |
Soap |
SOAP (Shampoo-in-Adam-basis) | Vyas et al. 2024 |
KronPsgd |
Kron / PSGD | Li 2018 |
Muon |
Muon (Newton–Schulz orthogonal) | Jordan et al. 2024 |
Sophia |
Sophia-H (diagonal-Hessian) | Liu et al. 2023 |
Mars |
MARS (variance-reduced) | Yuan et al. 2024 |
Usage
use ;
let mut opt = new.with_weight_decay;
let shape = ;
let mut w = vec!;
let g = vec!;
for _ in 0..100
Per-parameter moments are keyed by name, so one optimizer instance
holds the state for every tensor in a model. Matrix-aware
optimizers (Adafactor, SOAP, Muon, Kron-PSGD) look at shape and fall
back to a plain elementwise rule for 1-D / higher-rank tensors.
Design notes
- No external dependencies. Reference Rust; backends that ship a
fused step kernel (see
rlx-metal::splat_adam) bypass this crate for their hot path. - Pure
&mut [f32]/&[f32]slices — call from anywhere holding a flat parameter buffer, includingrlx-umap::WeightStoreor a hand-rolled training loop. forbid(unsafe_code).
Implementing for a backend
The Optimizer trait is intentionally minimal — (name, shape, &mut [f32], &[f32])
— so backends can write a fused step kernel and impl the trait
without owning host buffers:
use Optimizer;
The existing rlx-metal::splat_adam kernel is the canonical
fused-step example. It currently exposes a free function rather than
an Optimizer impl because it carries per-attribute scaling specific
to Gaussian splat training; a thin adapter struct in rlx-metal
could wrap it into the trait if you want a uniform interface from a
generic trainer.
Cross-crate integration
| Caller | Path |
|---|---|
rlx prelude |
rlx::optim::* behind feature optim |
rlx-umap |
rlx_umap::optim_adapter::step_weight_store behind feature optim (bridges WeightStore ↔ any Optimizer) |
Performance
Enable the parallel feature to dispatch the elementwise inner loops
of Adam, AdamW and Lion to rayon when a tensor crosses 64k elements.
LAMB and MARS cache their scratch buffers across iterations, so a
trainer running for thousands of steps allocates exactly once per
parameter (not per step).
Status
| Property | Notes |
|---|---|
| Numerical reference | Yes; matches PyTorch / Optax conventions |
| CPU parallelism | Optional via parallel feature (rayon) |
| Backend-fused kernels | Trait is impl'able from any backend crate; see "Implementing for a backend" above |
| Distributed reductions | No (single-host) |
| Mixed precision | Caller-side (cast to f32 before stepping) |
License
GPL-3.0-only.