# Rung 3 — interventions in the loop (calibration, guarded): DESIGN
Status: design only. Implementation is gated on (a) Rung-2's calibrated
two-block charts (unit-speed on the behavior block, Inc4) and (b) the DOSE
patch harness (model loading + activation splicing). This document pins the
estimator, the data contract, the Goodhart guards, and the Rust surfaces so
implementation is a translation exercise, not a design exercise.
## 1. What Rung 3 is (and is not)
Rungs 1 and 2 put behavior into the *estimator*: Rung 1 prices reconstruction
error in nats through the pulled-back output Fisher
(`MetricProvenance::BehavioralFisher`, `½ eᵀG e` with `G = JᵀFJ`); Rung 2 fits
a behavioral block jointly with the activation block, sharing gates and latent
coordinates, so the chart is oriented and unit-sped by how the output moves.
Both are **local**: they use the Fisher metric *at the data*, i.e. the
second-order Taylor expansion of KL around the clean activation.
Rung 3 closes the loop with **realized** behavior: sample `(token n, atom k,
dose Δt)`, decode the moved point `x'(Δt) = x_n + [g_k(t_nk + Δt) − g_k(t_nk)]
a_nk`, splice `x'` into the model at layer ℓ, run the rest of the network, and
record the realized `KL(p_clean ‖ p_patched)`. The model itself grades the
chart's currency.
Two modes, in order of priority:
* **Calibration mode (the default, and the only mode this design commits to):**
fit the map from *predicted* nats to *measured* nats and fold the correction
into the chart's coordinate speed. No gradients flow through the LM; the LM
is an oracle that is queried, never differentiated. This is where the
quadratic (Fisher) prediction is honest for small doses and degrades for
large ones — calibration measures exactly that degradation and re-speeds the
chart so `Δt = 0.1` means the same realized nats everywhere.
* **Training-gradient mode (explicitly out of scope here):** using realized KL
as a training signal. Deferred until calibration mode has demonstrated the
measurement is stable; any future design must re-derive the Goodhart
analysis of §4 from scratch for the training case.
What Rung 3 is **not**: it is not DAS. DAS searches an unconstrained supervised
rotation for a direction that moves a probe. Rung 3 restricts the
interventional query to **already-fitted, certificate-passing charts** produced
by the unsupervised, identifiable Rung-1/2 fit. The intervention never
*selects* structure (see guard G1); it only calibrates units on, and validates,
structure that earned its place through reconstruction + behavior evidence.
## 2. Predicted nats — the quantity being calibrated
Every intervention carries a prediction made *before* the model is queried,
from objects the fit already owns:
* **Rung-1 prediction (p-space):** with the per-row behavioral metric
`G_n = U_n U_nᵀ` (the s-probe sketch), a decoded move `Δx` predicts
`ν̂₁ = ½ Δxᵀ G_n Δx = ½ ‖U_nᵀ Δx‖²`. Computable for any Δx, needs only the
Rung-1 harvest shard.
* **Rung-2 prediction (chart-space):** with the behavior decoder `Ψ_k C_k` on
the √p-sphere tangent, a coordinate move Δt predicts
`ν̂₂ = 2 ‖Ψ_k'(t) C_k Δt‖²` (locally, KL ≈ 2‖Δq‖²). Under the Inc4
unit-speed gauge this is `ν̂₂ ≈ ‖Δt‖²·(unit speed)` — the "Δt is nats"
promise being tested.
Both predictions are recorded in every intervention record. Calibration mode
fits measured-vs-predicted for each; the *gap between the two predictions* is
itself a diagnostic (Rung-2's behavior decoder disagreeing with Rung-1's
pullback flags a chart whose behavioral block under-fit).
## 3. The calibration estimator (all existing gam machinery)
Let `ν` = realized KL (measured, nats) and `ν̂` = predicted nats for one
intervention. The calibration model is a GAM — our own machinery, REML-fitted,
nothing new:
```text
log ν_i = β₀ + f(log ν̂_i) + b_{k(i)} + ε_i
```
* `f` — a monotone smooth (existing monotone-smooth machinery), capturing the
systematic quadratic-approximation decay with dose. Its departure from the
identity IS the calibration curve.
* `b_k` — a per-atom random effect (an ordinary penalized factor term): atom
k's log speed error. The fitted `exp(b_k/2)` is the **chart re-speed
factor** `s_k` folded back into the chart: `t ← s_k · t` (a scalar per
1-d atom; for d>1, a per-axis version of the same, one random effect per
axis). This is the ONLY thing calibration writes back (guard G1).
* Family: Gaussian on log-KL to first order; the realized-KL measurement noise
floor (§5, the Δt=0 controls) enters as a known lower bound on the response
variance, not as a tuned constant.
REML selects every smoothing/shrinkage level. No grid, no magic constants, no
wall-clock budgets — SPEC-compliant by construction because the estimator IS a
gam fit.
**Why log-log:** the quadratic prediction is exact as dose→0, so the curve
passes through the identity at small ν̂ and bends below it at large ν̂ (the
Fisher over-predicts once the softmax saturates). Log-log makes both the
identity anchoring and the bend low-order.
## 4. The Goodhart guard (the reason for the structure)
"Predicted = measured" has a degenerate global optimum: prefer atoms where
both are zero. A fit allowed to *select* on calibration error will fill the
dictionary with behaviorally inert atoms that calibrate perfectly. Three
structural guards, all load-bearing:
* **G1 — the causal signal selects nothing.** Atom birth/death/gating stays
entirely Rung-1/2 evidence (reconstruction + behavior block, REML). Rung-3
writes back exactly one object: the per-atom coordinate re-speed `s_k`
(a gauge transformation — it changes units, not structure, not membership,
not decoders, not gates). Enforced by construction: the calibration output
type carries only `s_k` and diagnostics; there is no API path from realized
KL to the fit criterion.
* **G2 — the held-out intervention set is never trained on, eval forever.**
Interventions are split at the *document/question* level (matching the
harvest split manifest), before any calibration fit. The held-out half never
enters any fit, ever, across refits — it is the standing measurement of
"predicted nats mean what they say", reported as held-out calibration error
(nats, and as a fraction of realized effect).
* **G3 — min-effect floors, estimated not chosen.** An atom enters the
calibration fit only if its realized effect at the reference dose exceeds
the measurement floor. The floor is NOT a constant: it is the null
distribution of the measurement itself, estimated from **Δt = 0 control
interventions** (splice the *unmoved* decoded point; any nonzero measured
KL is reconstruction error + numerical noise). The floor is a quantile of
that null (the same one-sided evidence convention the certificates use).
Atoms below floor are reported as "unmeasurable at reference dose" — a
finding (possibly dormant/inert), never silently calibrated.
## 5. Experimental design (what gets sampled)
* **Tokens:** the designed subsample discipline of the two-tier harvest
(`RowSamplingMeasure::designed_subsample`) reused verbatim — calibration is
an estimation role, it needs a designed few-thousand rows, not the corpus.
* **Atoms:** all atoms above the G3 floor screen (screening uses a pilot dose
at each atom, controls included).
* **Doses:** per-atom, at fixed quantiles of that atom's fitted coordinate
distribution `t_k` (e.g. moves spanning the interquartile range of occupied
chart territory). Quantile placement is measurement design tied to the data
distribution, not a hyperparameter search; the dose ladder is logarithmic in
predicted nats so the calibration curve is identified across scales.
* **Controls:** every batch interleaves Δt = 0 splices (the G3 null) and
repeat-doses (measurement repeatability).
* **Budget:** expressed in intervention *count* (a few thousand forward
passes), never wall-clock. This is the "short interventional phase" — the
e2e lesson that a few percent of total budget spent on interventions buys
the calibration, and more buys nothing further.
## 6. Data contract (mirrors the harvest shard discipline)
One `.npz` intervention shard, emitted by the Python patch runner (the
model-interaction boundary, DOSE's harness), consumed by the Rust calibration
fit:
```text
row_id (m,) int64 — corpus row of the token
atom (m,) int64 — atom index k
dose (m, d) f64 — Δt applied (0 for controls)
nu_hat_1 (m,) f64 — Rung-1 predicted nats (½‖UᵀΔx‖²)
nu_hat_2 (m,) f64 — Rung-2 predicted nats (behavior decoder), NaN if no y-block
nu_measured (m,) f64 — realized KL(clean‖patched), nats
group (m,) int64 — document/question id (the G2 split unit)
is_control (m,) bool — Δt = 0 splice
layer, seed scalars
```
The patch runner reuses `_capture_activations`' splice hook (`gamfit/torch/
harvest.py`) — patching IS the splice path the downstream harvest already
exercises, with the spliced row now `x + Δ` instead of a probe.
## 7. Rust surfaces (reusable, small)
* `intervention_shard.rs` — load/validate the shard (the `load_harvest_shard`
discipline: f32/f64 promotion at the boundary, schema asserts, group-level
split with a persisted, seeded manifest).
* `calibration_fit.rs` — assemble the §3 GAM from a shard's training split and
fit with the existing engine; output type
`ChartCalibration { respeed: Vec<f64> /* s_k per atom-axis */,
curve: MonotoneSmoothSummary, floor_nats: f64,
heldout_error: CalibrationHeldout }`. No method on it can touch a term's
gates/decoders — re-speed applies through the same chart-transfer path
`chart_canonicalization` uses for gauge moves (it IS a gauge move).
* `calibration_certificate.rs` — the G2 held-out report + G3 floor provenance,
attached beside the existing fit certificates.
## 8. Acceptance tests (planned with the implementation)
1. **Synthetic oracle:** a toy "model" whose true KL under splice is computable
in closed form; check the calibration recovers a known per-atom speed
distortion `s_k` and that held-out error shrinks accordingly.
2. **Guard tests:** (G1) the public API provably cannot route realized KL into
the fit criterion — type-level check plus a test that a calibration run
leaves gates/decoders/membership bit-identical; (G2) the split manifest is
stable across refits; (G3) the floor equals the Δt=0 null quantile, and
below-floor atoms are excluded with the "unmeasurable" tag.
3. **Consistency:** as dose→0, measured/predicted → 1 within the control-null
band (the Fisher metric is the correct local limit — this doubles as an
end-to-end validation of the Rung-1 sketch on the real model).