1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
// [#780 line-count gate] Cohesive softmax-entropy Gershgorin majorizer leaf
// helpers split out of `construction.rs` (which crossed the 10k-line gate).
// These are the #1410 per-row active-atom majorizer / dense-entropy-Hessian /
// logit-derivative entry functions: pure leaf math over a softmax row, no
// struct-private coupling. Included via `include!` from `construction.rs` so
// they keep the SAME module scope (`use super::*`), visibility, and the debug
// oracles that pin them to the dense library routines.
/// #1410 — single active-atom entry of the per-row softmax-entropy Gershgorin
/// Loewner majorizer `D_kk = Σ_j |H_kj|` (#1419), computed WITHOUT materialising
/// a full-`K` diagonal `d`.
///
/// The compact softmax assembly / θ-adjoint only ever read `D_kk` for the
/// `≤ top_k` active atoms, yet
/// [`SoftmaxAssignmentSparsityPenalty::psd_majorizer_abs_row_sums`] returns the
/// FULL-`K` `d` vector (and the SAE callers were additionally copying the
/// row's logits into a fresh length-`K` `Vec` just to feed it). At the SAE LLM
/// shape (`K ≈ 100k`) that is two `O(K)` per-row scratch allocations on the
/// compact (`O(top_k·d)`-per-token) path the whole #1408/#1409/#1450 contract
/// exists to keep `K`-free. This helper consumes the per-row softmax
/// assignments `a` (already in hand — it IS the softmax row) and an explicit
/// active atom `kk`, and returns only that atom's majorizer diagonal, allocating
/// nothing.
///
/// It reproduces `psd_majorizer_abs_row_sums` EXACTLY (same `(a, l, m)`
/// algebra, same `ENTROPY_LOG_PROBABILITY_FLOOR`, same scaled formula), so the
/// assembly, the criterion's `log|H|`, and the #1006 θ-adjoint still
/// differentiate ONE operator. The shared `m = Σ_j a_j l_j` is the only `O(K)`
/// pass; pass it in precomputed (`softmax_majorizer_log_mean`) so a row that
/// fills several active slots pays it once. A debug oracle
/// (`active_softmax_gershgorin_matches_dense_majorizer_1410`) pins this to the
/// dense `psd_majorizer_abs_row_sums` so the two cannot drift.
/// Single `(kk, jj)` entry of the exact per-row dense softmax-entropy Hessian
/// `H_kj = scale·a_k·(δ_kj·(m−l_k−1) + a_j·(l_k+l_j+1−2m))` (mirrors
/// [`SoftmaxAssignmentSparsityPenalty::row_dense_hessian`] entry-for-entry). Used
/// by the #1418 exact-Hessian (`A = B + ΔC`) correction so the compact path can
/// read only the active `≤ top_k × top_k` sub-block of `H_entropy` without
/// materialising the full `K×K` dense block per row (#1410). `m` is the shared
/// [`softmax_majorizer_log_mean`]; `O(1)` per entry, zero allocation.
/// Active-atom diagonal `D_kk` of the softmax-entropy Gershgorin majorizer; see
/// [`softmax_majorizer_log_mean`]. `a` is the per-row softmax assignment vector,
/// `kk` the (global) atom index, `m` the precomputed `Σ_j a_j l_j`, and `scale`
/// the `λ/τ²` penalty scale. `O(K)` time, zero allocation.
/// Active-atom diagonal entry `∂D_kk/∂z_w = Σ_j sign(H_kj)·∂H_kj/∂z_w` of the
/// softmax-entropy Gershgorin majorizer derivative (mirrors
/// [`SoftmaxAssignmentSparsityPenalty::row_psd_majorizer_logit_derivative`]'s
/// `out[[kk, kk]]` entry-for-entry — that operator's output is DIAGONAL, so only
/// `kk == kk` entries are nonzero). The compact #1006 θ-adjoint needs this only
/// for the row's `≤ top_k` active atoms paired with its active logits, so this
/// computes one diagonal entry directly from the softmax row `a` instead of
/// materialising the full `K×K` derivative matrix per (row, logit) (#1410).
///
/// `a` is the per-row softmax row, `kk` the (global) atom index, `w` the (global)
/// logit being differentiated, `m` the shared [`softmax_majorizer_log_mean`],
/// `scale = λ/τ²`, and `inv_tau = 1/τ`. Uses the SAME `∂a_r/∂z_w =
/// a_r(δ_rw − a_w)/τ` convention as the dense library routine, so value and
/// adjoint stay on one operator (pinned by
/// `active_softmax_majorizer_logit_derivative_matches_dense_1410`). `O(K)` time,
/// zero allocation.