Expand description
steer_delta — the steering primitive with output dosimetry: the
actionable LLM payload of the SAE-manifold machine.
§What this computes
Given a fitted SaeManifoldTerm and the per-row output-Fisher
RowMetric, a steering move is “drive atom k’s latent coordinate from
t_from to t_to”. The atom’s decoder curve g_k(t) = Φ_k(t) B_k maps that
latent move to an activation-space delta — the actual vector you add to
the residual stream / reconstruction to realize the move on the manifold:
δ = a · ( g_k(t_to) − g_k(t_from) ) (the on-manifold move)where a is the atom’s amplitude (how loudly the atom is expressed). This is
the thing a downstream consumer adds to a hidden state.
§Dosimetry — how big is this push, in nats?
The headline number is the predicted output effect: how much behavioral
change (in nats of KL on the model’s output distribution) the move induces.
For a locally-quadratic output readout the KL of a parameter move Δ is
½ Δᵀ F Δ with F the output-Fisher information — exactly the inner product
RowMetric carries. The dose is the Fisher quadratic form of the move,
integrated along the decoder curve rather than read only at the endpoints:
predicted_nats = ½ ∫_{t_from}^{t_to} a² · g_k'(t)ᵀ M_n g_k'(t) dtevaluated in small steps via the per-row pullback / fisher-mass methods. The path integral is the honest dose: it follows the curved surface, so a long arc that doubles back is not under-counted the way a straight endpoint chord would be.
§Validity radius — where local linearization stops being trusted
A consumer must know how far the move can be trusted as a linear push. The
validity radius is the latent step size at which the path-integrated dose
diverges from the straight endpoint quadratic form
½ a² δ̂ᵀ M δ̂ (the local-linear prediction) by more than
[VALIDITY_DIVERGENCE_FRACTION]. Beyond it the surface has curved enough that
the endpoint chord no longer represents the move. We report it; we do not
silently clip to it.
§Off-manifold guard
δ is, by construction, a chord of the decoder curve, so it should lie in the
atom’s local tangent/frame at t_from (up to second-order curvature). The
off-manifold norm projects δ onto the span of the local decoder tangents
∂g_k/∂t at t_from and reports the residual norm — a self-check that the
steering move stays on the learned surface. It is ≈ 0 for small steps and
grows with arc curvature; a large value means the requested move left the
manifold and the dose number is not to be trusted.
§Read-only / no loss contact
This module is a pure read over the fitted term and the metric. It calls
only g_k(t) evaluation ([SaeManifoldAtom]’s decoder + installed
[SaeBasisEvaluator]) and the criterion-facing
RowMetric::fisher_mass / RowMetric::pullback. It never mutates the
model, never touches a likelihood / criterion / penalty, and the solver floor
δ of RowMetric never enters any number it reports (the fisher-mass /
pullback face is δ-free, #747).
Structs§
- Steer
Plan - The actionable output of a steering query over one atom.
Functions§
- predicted_
response - The model’s predicted output-mean response to an applied activation push
δ, under the LOCAL-LINEAR reading of its fitted surface: the projection ofδonto the span of atomatom_k’s decoder tangents∂g_k/∂tat the operating pointt_at. A dictionary “predicts” exactly the component of a push it can carry along its learned surface; the transverse component is off-manifold and predicted to die (this is the same local model the off-manifold guard and the dosimetry chord trust, used in the same radius). - steer_
delta - Build a
SteerPlanfor driving atomatom_kfromt_fromtot_to.