Module steering

Expand description

steer_delta — the steering primitive with output dosimetry: the actionable LLM payload of the SAE-manifold machine.

§What this computes

Given a fitted SaeManifoldTerm and the per-row output-Fisher RowMetric, a steering move is “drive atom k’s latent coordinate from t_from to t_to”. The atom’s decoder curve g_k(t) = Φ_k(t) B_k maps that latent move to an activation-space delta — the actual vector you add to the residual stream / reconstruction to realize the move on the manifold:

δ = a · ( g_k(t_to) − g_k(t_from) )          (the on-manifold move)

where a is the atom’s amplitude (how loudly the atom is expressed). This is the thing a downstream consumer adds to a hidden state.

§Dosimetry — how big is this push, in nats?

The headline number is the predicted output effect: how much behavioral change (in nats of KL on the model’s output distribution) the move induces. For a locally-quadratic output readout the KL of a parameter move Δ is ½ Δᵀ F Δ with F the output-Fisher information — exactly the inner product RowMetric carries. The dose is the Fisher quadratic form of the move, integrated along the decoder curve rather than read only at the endpoints:

predicted_nats = ½ ∫_{t_from}^{t_to} a² · g_k'(t)ᵀ M_n g_k'(t) dt

evaluated in small steps via the per-row pullback / fisher-mass methods. The path integral is the honest dose: it follows the curved surface, so a long arc that doubles back is not under-counted the way a straight endpoint chord would be.

§Validity radius — where local linearization stops being trusted

A consumer must know how far the move can be trusted as a linear push. The validity radius is the latent step size at which the path-integrated dose diverges from the straight endpoint quadratic form ½ a² δ̂ᵀ M δ̂ (the local-linear prediction) by more than [VALIDITY_DIVERGENCE_FRACTION]. Beyond it the surface has curved enough that the endpoint chord no longer represents the move. We report it; we do not silently clip to it.

§Off-manifold guard

δ is, by construction, a chord of the decoder curve, so it should lie in the atom’s local tangent/frame at t_from (up to second-order curvature). The off-manifold norm projects δ onto the span of the local decoder tangents ∂g_k/∂t at t_from and reports the residual norm — a self-check that the steering move stays on the learned surface. It is ≈ 0 for small steps and grows with arc curvature; a large value means the requested move left the manifold and the dose number is not to be trusted.

§Read-only / no loss contact

This module is a pure read over the fitted term and the metric. It calls only g_k(t) evaluation ([SaeManifoldAtom]’s decoder + installed [SaeBasisEvaluator]) and the criterion-facing RowMetric::fisher_mass / RowMetric::pullback. It never mutates the model, never touches a likelihood / criterion / penalty, and the solver floor δ of RowMetric never enters any number it reports (the fisher-mass / pullback face is δ-free, #747).

Structs§

SteerPlan: The actionable output of a steering query over one atom.

Functions§

predicted_response: The model’s predicted output-mean response to an applied activation push δ, under the LOCAL-LINEAR reading of its fitted surface: the projection of δ onto the span of atom atom_k’s decoder tangents ∂g_k/∂t at the operating point t_at. A dictionary “predicts” exactly the component of a push it can carry along its learned surface; the transverse component is off-manifold and predicted to die (this is the same local model the off-manifold guard and the dosimetry chord trust, used in the same radius).
steer_delta: Build a SteerPlan for driving atom atom_k from t_from to t_to.

Module steering

Module steering Copy item path

§What this computes

§Dosimetry — how big is this push, in nats?

§Validity radius — where local linearization stops being trusted

§Off-manifold guard

§Read-only / no loss contact

Structs§

Functions§

Module steering