1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
// RLX — versatile ML compiler + runtime.
// Copyright (C) 2026 Eugene Hauptmann, Nataliya Kosmyna.
//
// This program is free software: you can redistribute it and/or modify
// it under the terms of the GNU General Public License as published by
// the Free Software Foundation, version 3.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU General Public License for more details.
//
// You should have received a copy of the GNU General Public License
// along with this program. If not, see <https://www.gnu.org/licenses/>.
//! Lion — EvoLved Sign Momentum (Chen et al., 2023, "Symbolic
//! Discovery of Optimization Algorithms").
//!
//! # Idea
//!
//! Lion was *discovered* by a program-synthesis search over candidate
//! optimizer expressions. The found rule is shockingly simple — one
//! momentum buffer, and the update is the **sign** of an
//! interpolation between the momentum and the gradient.
//!
//! # Update rule
//!
//! ```text
//! c_t = β₁·m_{t-1} + (1 − β₁)·g_t
//! θ_t = θ_{t-1} − lr · ( sign(c_t) + λ·θ_{t-1} )
//! m_t = β₂·m_{t-1} + (1 − β₂)·g_t // note: different β₂!
//! ```
//!
//! Two distinct betas: `β₁` shapes the *update direction* (faster
//! adaptation), `β₂` shapes the *carried momentum* (slower memory).
//!
//! # When to use
//!
//! Half the memory of Adam (one buffer instead of two), often
//! converges to similar quality on transformers when the LR is
//! tuned 3–10× lower than the corresponding AdamW LR. Sign updates
//! get coarse on tiny problems — favor large-batch / large-model
//! regimes.
use HashMap;
use crateOptimizer;
use crate;
/// EvoLved sign-momentum optimizer.
///
/// Per-tensor state: **one** `f32` buffer (half of Adam's footprint).