Expand description
RowMetric — the single provenance-carrying per-row inner product shared by
the SAE-manifold likelihood (residual whitening) and the gauge
(isometry pullback weight).
§Why this exists
The SAE-manifold machine historically carried two independent inner products:
- the likelihood measured reconstruction residuals isotropically — a
single scalar dispersion
φ̂ = RSS / residual-dof, the data-fit loop summing the bare½ rᵀr; there was no per-row metric at all; and - the gauge carried its own per-row metric in
IsometryPenalty.weight: WeightField— a low-rankW_n = U_n U_nᵀpullbackg_n = J_nᵀ W_n J_n, settable independently of anything the likelihood saw.
Nothing structurally forced “the metric the likelihood whitens by” to equal “the metric the gauge pulls back through”. That is exactly the objective↔gradient-desync bug class wearing geometry clothing: a likelihood-metric ≠ gauge-metric state was representable.
RowMetric collapses the two into one object. The likelihood whitens
through it; the gauge WeightField is constructed from it. A
divergent-metric state is therefore unrepresentable — there is only one
per-row factor stack U_n, with one MetricProvenance tag.
§Magic-by-default selector
There is no flag. The provenance is chosen by whether per-row Fisher factors exist:
- no factors supplied ⇒
MetricProvenance::Euclidean;W_n = I_p; whitening is the identity, soφ̂and the data-fit loop are bit-for-bit the prior isotropic path; and - per-row Fisher factors supplied ⇒
MetricProvenance::OutputFisher; the residual is whitened byU_nᵀand the gauge pulls back through the sameU_n.
§Validation
Every metric block is constructed through
crate::normalize_fisher_rao_blocks, which
broadcasts and eigenvalue-validates PSD-ness. RowMetric does not
reimplement that validation; it materializes W_n = U_n U_nᵀ (which is PSD
by construction) and runs it through the shared normalizer as the
single point of truth for “is this a valid precision metric”.
Any rank floor used to make a block invertible for an internal solve is
solver-only (mirroring RidgePolicy::solver_only, #747): it never enters
the residual the objective sums, so δ cannot bias the criterion.
§Rung 1 — the behavioral metric in the reconstruction loss (nats currency)
MetricProvenance::OutputFisher installs the output-Fisher inner product
as a gauge metric only: it whitens nothing (whitens_likelihood() is
false), by deliberate #980 contract, so reconstruction stays the isotropic
½‖r‖². That answers “what coordinate is canonical”, not “what does a
reconstruction error cost”.
MetricProvenance::BehavioralFisher is the opposite deliberate choice:
the same low-rank output-Fisher factors, but installed as the
reconstruction likelihood weight. Plain MSE prices a reconstruction error
e = x − x̂ by its Euclidean size; the model, however, reads the activation
only through the rest of the network, so the behavioral cost of e is the
KL between the clean and corrupted next-token distributions,
KL ≈ ½ eᵀ G(x) e with G = JᵀFJ the network-Jacobian pullback of the
output Fisher F (units: nats). Minimizing (x−x̂)ᵀ G (x−x̂) instead of
‖x−x̂‖² is generalized least squares: for a fixed per-row G it is
still a linear Gaussian model in the coefficients, so the entire
REML/evidence/EDF/certificate stack survives verbatim — this is why the
metric rides the identical whitens_likelihood() plumbing the
MetricProvenance::WhitenedStructured noise model uses, and why the G=I
limit reproduces the plain-MSE fit bit-for-bit (see the module tests).
This is the principled form of Braun’s end-to-end KL + MSE objective.
Anchoring to the activation keeps it reconstruction (it does not collapse
to “match the logits by any means” — the decoder still has to reproduce x),
while pricing the residual in nats through G. The payoff is automatic
selection for mattering: G’s null directions — activation structure the
rest of the network cannot read — are penalized nothing, because
eᵀ G e = 0 there. MSE in a behaviorally-inert direction goes free, which is
the correct behavior, not a bug: nothing downstream changes, so nothing
should be paid.
The d×d G is never materialized. G is sketched by s random probes,
vᵢ = Jᵀ F^{1/2} uᵢ (uᵢ iid, s ≈ 4…16), computed by s backward passes
per token at harvest time (the model-interaction boundary) and stored as
the columns of the per-row factor U_n = [v₁ … v_s] ∈ ℝ^{p×s}. Then
G ≈ Σᵢ vᵢ vᵢᵀ = U_n U_nᵀ and the criterion-facing
eᵀ G e ≈ Σᵢ (vᵢᵀ e)² = ‖U_nᵀ e‖² is exactly what
RowMetric::quad_form / RowMetric::whiten_residual_row already
compute — zero train-time model cost, O(p·s) per row. See
RowMetric::behavioral_fisher and the probe-packing helper
pack_probe_factors.
Structs§
- RowMetric
- The single per-row metric object. Holds one low-rank factor stack
U_n(or none, for Euclidean) plus the validated PSD blocks, tagged with itsMetricProvenance.
Enums§
- Metric
Provenance - Where the per-row metric came from — the provenance that makes “likelihood-metric ≠ gauge-metric” diagnosable instead of silent.
- Weight
Field - Per-observation behavioral-metric field
W_n ∈ ℝ^{p × p}, stored in low-rank factored formW_n = U_n U_n^TwithU_n ∈ ℝ^{p × r_n}.
Functions§
- pack_
probe_ factors - Pack a harvest-emitted probe stack into the row-major factor layout
RowMetric::behavioral_fisherexpects.