Skip to main content

gam_terms/inference/
structure_evidence.rs

1//! Anytime-valid structure discovery: e-process gates, universal-inference
2//! atom tests, e-BH error control, and KL-optimal steering probes.
3//!
4//! Interpretability as sequential experimental design.
5//!
6//! # The thesis
7//!
8//! The dictionary-learning stack (#974–#981) DISCOVERS structure: atom
9//! birth/death/fission/fusion (#976), geometry adjudication — circle vs
10//! clusters vs line (#907), feature binding (#975). Today those decisions
11//! are made by evidence heuristics (likelihood-ratio ladders, BIC-flavored
12//! gates). Three facts, never previously combined, say that is not merely
13//! informal but WRONG in a specific, fixable way — and that fixing it
14//! upgrades the whole capstone from observational description to
15//! error-controlled experimental science:
16//!
17//! 1. **Atom existence is a NON-REGULAR testing problem.** "Does a K+1-th
18//!    dictionary atom exist?" is testing K vs K+1 mixture components — the
19//!    textbook boundary/loss-of-identifiability case where the classical
20//!    likelihood-ratio χ² asymptotics FAIL (the null sits on the boundary
21//!    of the alternative; the nuisance parameters of the new atom vanish
22//!    under the null — Davies' problem). Every SAE/dictionary paper that
23//!    thresholds a likelihood improvement is running this broken test.
24//!    **Universal inference** (Wasserman–Ramdas–Balakrishnan 2020) is the
25//!    modern resolution: a split-likelihood-ratio e-value that is valid in
26//!    finite samples with NO regularity conditions whatsoever — exactly
27//!    the irregular regime atom birth lives in.
28//!
29//! 2. **Discovery happens on streams with optional stopping.** Dictionaries
30//!    are trained until the features "look right" — data-dependent
31//!    stopping that p-hacks any fixed-sample test by construction.
32//!    **E-processes** (nonnegative supermartingales under the null,
33//!    `E[E_τ] ≤ 1` at every stopping time) are immune: by Ville's
34//!    inequality `P(sup_t E_t ≥ 1/α) ≤ α`, the guarantee survives stopping
35//!    whenever you like, peeking included, streaming corpora (#973)
36//!    included. Evidence is a RUNNING PRODUCT, resumable across shards.
37//!
38//! 3. **This laboratory is INTERVENTIONAL.** The steering primitive with
39//!    output dosimetry and a validity radius is landed
40//!    (`crate::inference::steering`); the per-token output-Fisher harvest
41//!    (#980) gives the local information geometry of the model's output.
42//!    So "which probe next?" is not a vibe — it is OPTIMAL EXPERIMENTAL
43//!    DESIGN: choose the steering intervention that maximizes the expected
44//!    log-growth of the e-process deciding a contested structural claim.
45//!    Under the local Gaussian output-Fisher model, that growth rate IS a
46//!    KL divergence with a closed form (below). The model chooses its own
47//!    next experiment, optimally, inside its certified validity radius.
48//!
49//! The deliverable shape: **every discovered atom ships an e-value; every
50//! dictionary ships an e-BH FDR certificate over its claimed structure;
51//! contested claims get design-optimal probes until evidence resolves.**
52//! No other interpretability stack has finite-sample, optional-stopping-
53//! safe error control over its discovered structure. This module is the
54//! statistical substrate; the SAE structure search plugs its gates in.
55//!
56//! The instruments, bottom-up: [`EProcess`] (running evidence, Ville
57//! semantics) → [`PredictablePluginEProcess`] (streaming universal
58//! inference) → [`AtomBirthGate`] + [`run_atom_birth_gate`] (the K vs K+1
59//! gate with demote-never-reject [`GateVerdict`]s; the runner enforces
60//! the predictability contract by call order) → [`StructureLedger`] (one
61//! e-process per claim, serializable across #973 shards) →
62//! [`StructureLedger::certify`] (the e-BH [`StructureCertificate`],
63//! shipped beside the gauge report via
64//! `crate::terms::sae::identifiability::dictionary_report`) →
65//! [`plan_probe_for_contested_claim`] (the design loop: contested claims
66//! get a [`ProbePlan`] whose δ runs through
67//! `crate::inference::steering::steer_delta` and whose per-hypothesis
68//! μ₀/μ₁ come from `crate::inference::steering::predicted_response`).
69//!
70//! # The math, fixed here so implementations cannot drift
71//!
72//! **E-value / e-process.** A nonnegative statistic `E` with `E_{H0}[E] ≤ 1`.
73//! Calibration to tests: reject at level α when `E ≥ 1/α` (Markov). An
74//! e-PROCESS compounds multiplicatively: `E_t = Π_{s≤t} e_s` where each
75//! `e_s` is conditionally valid given the past (`E[e_s | F_{s−1}] ≤ 1`
76//! under H0). Ville: `P_{H0}(∃t: E_t ≥ 1/α) ≤ α` — anytime validity.
77//! Always accumulate in log space; evidence products underflow doubles.
78//!
79//! **Universal inference (the atom-birth test).** Split the data (or the
80//! token stream) into D₀ (evaluation) and D₁ (estimation). Fit the
81//! K+1-atom alternative on D₁ by ANY method (the production fitter, warm
82//! starts, GPU, anything — no conditions). Fit the K-atom null by
83//! CONSTRAINED MLE ON D₀ (this is the one side that must be honest). Then
84//!
85//! ```text
86//!   E = L(θ̂₁ ; D₀) / sup_{θ ∈ H0} L(θ ; D₀)
87//! ```
88//!
89//! satisfies `E_{H0}[E] ≤ 1` in finite samples, mixtures and boundaries
90//! and all (the proof is three lines of Markov + tower; no asymptotics).
91//! Sequential version: at each batch t, the alternative plug-in is fit on
92//! data BEFORE t (predictable), the null sup is over the batch; the
93//! product is an e-process. This is `SplitLikelihoodEValue` /
94//! `PredictablePluginEProcess` below, generic over log-likelihood
95//! closures so the SAE stack passes its own (manifold likelihoods,
96//! superposition-aware residual models #974, whatever exists then).
97//!
98//! **Bayes factors are e-values (the #907 bridge).** A Bayes factor
99//! `BF = ∫ L(θ) dΠ₁(θ) / L_{H0}` with a FIXED (data-independent) prior Π₁
100//! and a SIMPLE (or sup-dominated) null has `E_{H0}[BF] ≤ 1` — the #907
101//! geometry-adjudication harness (circle vs clusters vs line, with its
102//! discrete-mixture null) is therefore ONE PRIOR-FREEZE away from anytime
103//! validity. The integration contract: route its per-batch BFs through
104//! [`EProcess::absorb`] instead of comparing a final BF to a threshold,
105//! and geometry claims inherit optional-stopping safety for free.
106//!
107//! **e-BH (the dictionary certificate).** Given e-values e_1..e_m for m
108//! structural claims (one per atom/edge/binding), sort descending and
109//! reject the top k* where `k* = max{ k : e_(k) ≥ m/(α·k) }`. This
110//! controls FDR ≤ α under ARBITRARY dependence between the e-values
111//! (Wang–Ramdas 2022) — no independence assumptions about atoms sharing
112//! tokens, which is good because they all share every token. That
113//! arbitrary-dependence robustness is WHY the certificate uses e-BH and
114//! not a p-value BH: the p-version needs PRDS, which atom statistics
115//! flagrantly violate.
116//!
117//! **Design-optimal probing.** For a contested claim with competing
118//! structural hypotheses H₀/H₁ (e.g. "feature f is one curved atom" vs
119//! "two flat atoms"), a candidate steering intervention δ (within its
120//! certified validity radius, `steering.rs`) produces predicted output
121//! distributions P₀^δ, P₁^δ. The expected per-observation log-growth of
122//! the likelihood-ratio e-process under H₁ is exactly `KL(P₁^δ ‖ P₀^δ)`,
123//! so the optimal next experiment is
124//!
125//! ```text
126//!   δ* = argmax_{δ : ‖δ‖ ≤ r_valid}  KL(P₁^δ ‖ P₀^δ)
127//!       ≈ argmax_δ  ½ (μ₁(δ) − μ₀(δ))ᵀ F (μ₁(δ) − μ₀(δ))
128//! ```
129//!
130//! under the local Gaussian model with output-Fisher metric F (#980
131//! harvest) — the SAME quadratic form the steering dosimetry already
132//! computes, repurposed: dosimetry measures nats delivered, design
133//! maximizes nats of DISCRIMINATION. Probes that maximize raw effect are
134//! not optimal; probes that maximize the *disagreement between the
135//! hypotheses' predictions*, weighted by output information, are. (Full
136//! KL-optimal design over the probe manifold is a research arc; the greedy
137//! quadratic rule below is the correct first instrument and is exact in
138//! the local model.)
139//!
140//! # What this kills
141//!
142//! - Birth/death gates that are threshold heuristics → replaced by
143//!   anytime-valid tests with declared error rates (#976's "detectable,
144//!   correctable misspecification" becomes a literal hypothesis test).
145//! - "We trained until the features looked interpretable" → optional
146//!   stopping is SAFE; the certificate survives it.
147//! - "We found N features" → "we found N features at FDR ≤ α, certificate
148//!   attached, reproducible from the e-value ledger."
149//! - Probe selection by intuition → probe selection by information.
150
151use ndarray::{Array1, Array2};
152use serde::{Deserialize, Serialize};
153
154/// Running anytime-valid evidence against one null hypothesis, in log
155/// space. Multiplicative absorption of conditionally-valid e-values;
156/// Ville's inequality converts the running product into a sequential test
157/// that survives optional stopping. Serializable so evidence is resumable
158/// across corpus shards (#973): persist, reload, keep absorbing.
159#[derive(Clone, Debug, Serialize, Deserialize)]
160pub struct EProcess {
161    /// log E_t — the running log-evidence. Starts at 0 (E_0 = 1).
162    log_e: f64,
163    /// Number of absorbed batches (the ledger length).
164    steps: usize,
165    /// Running maximum of log E_t — Ville's inequality applies to the
166    /// supremum, so a claim once proven at level α stays proven even if
167    /// later evidence retreats (evidence is not p-hackable in reverse).
168    log_e_max: f64,
169}
170
171impl EProcess {
172    pub fn new() -> Self {
173        Self {
174            log_e: 0.0,
175            steps: 0,
176            log_e_max: 0.0,
177        }
178    }
179
180    /// Absorb one conditionally-valid e-value (NOT in log space; must be
181    /// ≥ 0; `E[e | past] ≤ 1` under H0 is the caller's contract — e.g. a
182    /// universal-inference batch ratio or a fixed-prior Bayes factor).
183    pub fn absorb(&mut self, e_value: f64) -> Result<(), String> {
184        if e_value.is_nan() || e_value < 0.0 {
185            return Err(format!("e-value must be in [0, ∞], got {e_value}"));
186        }
187        self.absorb_log(e_value.ln())
188    }
189
190    /// Absorb a batch e-value supplied in log space (the only numerically
191    /// honest interface for long streams).
192    pub fn absorb_log(&mut self, log_e_value: f64) -> Result<(), String> {
193        let next_log_e = checked_log_e_sum(self.log_e, log_e_value)?;
194        self.log_e = next_log_e;
195        self.steps += 1;
196        if self.log_e > self.log_e_max {
197            self.log_e_max = self.log_e;
198        }
199        Ok(())
200    }
201
202    pub fn log_evidence(&self) -> f64 {
203        self.log_e
204    }
205
206    pub fn steps(&self) -> usize {
207        self.steps
208    }
209
210    /// Anytime-valid rejection at level α: by Ville,
211    /// `P_{H0}(sup_t E_t ≥ 1/α) ≤ α`, so crossing 1/α at ANY time —
212    /// including data-dependent stopping times — proves the claim with
213    /// type-I error ≤ α. Uses the running supremum: once crossed, always
214    /// rejected.
215    pub fn rejects_at(&self, alpha: f64) -> bool {
216        alpha > 0.0 && self.log_e_max >= -(alpha.ln())
217    }
218
219    /// The e-value to hand to [`e_benjamini_hochberg`] for the
220    /// dictionary-level FDR certificate (current evidence, not the sup —
221    /// e-BH's guarantee is stated for e-values at the chosen stopping
222    /// time).
223    pub fn current_e_value_log(&self) -> f64 {
224        self.log_e
225    }
226}
227
228impl Default for EProcess {
229    fn default() -> Self {
230        Self::new()
231    }
232}
233
234fn checked_log_e_sum(current: f64, increment: f64) -> Result<f64, String> {
235    if current.is_nan() {
236        return Err("EProcess invariant violation: current log evidence is NaN".to_string());
237    }
238    if increment.is_nan() {
239        return Err("log e-value must not be NaN".to_string());
240    }
241    if current.is_infinite()
242        && increment.is_infinite()
243        && current.is_sign_positive() != increment.is_sign_positive()
244    {
245        return Err(format!(
246            "cannot combine opposing infinite log e-values: current {current}, increment {increment}"
247        ));
248    }
249    Ok(current + increment)
250}
251
252/// One universal-inference (split-likelihood-ratio) e-value: finite-sample
253/// valid with NO regularity conditions — the correct instrument for atom
254/// birth (K vs K+1 components, boundary/Davies regime where χ² fails).
255///
256/// `log_lik_alternative_on_eval`: log-likelihood of the EVALUATION fold
257/// under the alternative fitted on the ESTIMATION fold (any fitter — the
258/// production manifold/SAE fit, warm-started, GPU; zero conditions on it).
259/// `log_lik_null_sup_on_eval`: the SUPREMUM of the evaluation-fold
260/// log-likelihood over the NULL model class (the honest side: a real
261/// constrained fit on the eval fold, e.g. the K-atom dictionary refit on
262/// D₀). Then `log E = ℓ_alt(D₀) − sup_{H0} ℓ(D₀)` and `E_{H0}[E] ≤ 1`
263/// exactly.
264pub fn split_likelihood_log_e_value(
265    log_lik_alternative_on_eval: f64,
266    log_lik_null_sup_on_eval: f64,
267) -> f64 {
268    // Degenerate halves yield a well-defined, conservative e-value of 1
269    // (`log E = 0`, no evidence for the alternative) rather than a NaN that
270    // would poison the e-process product and the downstream FDR certificate:
271    //   * Either log-likelihood is NaN — the model could not be evaluated on
272    //     this shard (a numerically degenerate fit). The honest reading is
273    //     "no information", i.e. `E = 1`.
274    //   * Both halves are the SAME signed infinity — e.g. a shard/outcome
275    //     with zero density under both the alternative and the null
276    //     (`−∞ − (−∞)`), or both `+∞`. The ratio is undefined; again `E = 1`.
277    // This keeps `log E` finite so `absorb_log` never banks a NaN and the
278    // e-BH certificate sees a contributing-nothing claim instead of panicking.
279    if log_lik_alternative_on_eval.is_nan() || log_lik_null_sup_on_eval.is_nan() {
280        return 0.0;
281    }
282    if log_lik_alternative_on_eval.is_infinite()
283        && log_lik_null_sup_on_eval.is_infinite()
284        && log_lik_alternative_on_eval.is_sign_positive()
285            == log_lik_null_sup_on_eval.is_sign_positive()
286    {
287        return 0.0;
288    }
289    log_lik_alternative_on_eval - log_lik_null_sup_on_eval
290}
291
292/// Sequential universal inference over a stream of batches with a
293/// PREDICTABLE plug-in: at batch t the alternative parameters were fit
294/// using only data before t, the null sup is taken on batch t, and the
295/// per-batch ratios compound into an e-process. This is the streaming /
296/// optional-stopping form the corpus-scale pipeline (#973 shards) needs —
297/// evidence is resumable: serialize `EProcess`, keep absorbing on the next
298/// shard.
299#[derive(Clone, Debug, Serialize, Deserialize)]
300pub struct PredictablePluginEProcess {
301    pub process: EProcess,
302}
303
304impl PredictablePluginEProcess {
305    pub fn new() -> Self {
306        Self {
307            process: EProcess::new(),
308        }
309    }
310
311    /// Absorb one batch. The caller guarantees the alternative was fit
312    /// WITHOUT batch-t data (predictability — this is what makes the
313    /// product a supermartingale; violating it voids the guarantee, which
314    /// is why the SAE integration must hand this function the PREVIOUS
315    /// shard's fitted dictionary, never the current one).
316    pub fn try_absorb_batch(
317        &mut self,
318        log_lik_alternative_prefit: f64,
319        log_lik_null_sup_on_batch: f64,
320    ) -> Result<(), String> {
321        self.process.absorb_log(split_likelihood_log_e_value(
322            log_lik_alternative_prefit,
323            log_lik_null_sup_on_batch,
324        ))
325    }
326}
327
328impl Default for PredictablePluginEProcess {
329    fn default() -> Self {
330        Self::new()
331    }
332}
333
334/// The anytime-valid verdict on one structural claim. Deliberately
335/// two-valued — there is NO "rejected" arm. Demote-never-reject (#969
336/// philosophy): an e-process that has not crossed 1/α has failed to prove
337/// the claim, not disproven it; the claim stays contested, keeps its
338/// evidence, and earns a design-optimal probe budget instead of being
339/// silently dropped (or worse, silently accepted the way a threshold gate
340/// accepts whatever clears it).
341#[derive(Clone, Copy, Debug, PartialEq, Serialize, Deserialize)]
342pub enum GateVerdict {
343    /// The running supremum crossed 1/α: the claim is proven with type-I
344    /// error ≤ α, permanently (Ville applies to the sup, so later evidence
345    /// retreat cannot un-prove it).
346    Certified { log_e: f64 },
347    /// Not (yet) proven. Carries the CURRENT log-evidence — the value the
348    /// dictionary certificate's e-BH consumes, and the state a probe loop
349    /// resumes from.
350    Contested { log_e: f64 },
351}
352
353/// The atom-birth gate (#976's threshold comparison, replaced): a
354/// universal-inference e-process over corpus shards deciding "does the
355/// K+1-th atom exist?", the boundary/Davies-regime question where the χ²
356/// gate every dictionary paper runs is broken.
357///
358/// Per shard t the integration contract is exactly the work plan's:
359/// - `log_lik_alternative_prefit`: the K+1-atom dictionary fit on shards
360///   BEFORE t (the PREVIOUS shard's fit — predictability is the one rule;
361///   handing in the current shard's fit voids the guarantee), evaluated on
362///   shard t. Any fitter, warm starts, GPU — no conditions.
363/// - `log_lik_null_sup_on_shard`: the K-atom dictionary REFIT on shard t
364///   (the honest constrained sup on the evaluation data).
365///
366/// The gate never rejects: [`GateVerdict::Contested`] is the only
367/// alternative to certification, and a contested atom's next move is a
368/// probe plan ([`plan_probe_for_contested_claim`]), not deletion.
369#[derive(Clone, Debug, Serialize, Deserialize)]
370pub struct AtomBirthGate {
371    pub test: PredictablePluginEProcess,
372    /// The level the certificate is claimed at; fixed at construction so a
373    /// verdict can never be shopped across α after seeing the evidence.
374    alpha: f64,
375}
376
377impl AtomBirthGate {
378    pub fn new(alpha: f64) -> Result<Self, String> {
379        if !(alpha > 0.0 && alpha < 1.0) {
380            return Err(format!(
381                "AtomBirthGate: alpha must be in (0,1), got {alpha}"
382            ));
383        }
384        Ok(Self {
385            test: PredictablePluginEProcess::new(),
386            alpha,
387        })
388    }
389
390    pub fn alpha(&self) -> f64 {
391        self.alpha
392    }
393
394    /// Absorb one shard's split-likelihood ratio (see type-level contract).
395    pub fn try_absorb_shard(
396        &mut self,
397        log_lik_alternative_prefit: f64,
398        log_lik_null_sup_on_shard: f64,
399    ) -> Result<(), String> {
400        self.test
401            .try_absorb_batch(log_lik_alternative_prefit, log_lik_null_sup_on_shard)
402    }
403
404    pub fn absorb_shard(
405        &mut self,
406        log_lik_alternative_prefit: f64,
407        log_lik_null_sup_on_shard: f64,
408    ) {
409        self.try_absorb_shard(log_lik_alternative_prefit, log_lik_null_sup_on_shard)
410            .expect("AtomBirthGate received invalid log evidence");
411    }
412
413    pub fn verdict(&self) -> GateVerdict {
414        if self.test.process.rejects_at(self.alpha) {
415            GateVerdict::Certified {
416                log_e: self.test.process.log_evidence(),
417            }
418        } else {
419            GateVerdict::Contested {
420                log_e: self.test.process.log_evidence(),
421            }
422        }
423    }
424}
425
426/// Run the atom-birth gate over a shard stream with the predictability
427/// contract enforced BY CONSTRUCTION: on each shard the alternative is
428/// evaluated strictly before it is refit with that shard, so the plug-in
429/// is always predictable and the product is always a supermartingale
430/// under H0. This is the orchestration the SAE structure search calls;
431/// the closures are the only fitter-specific surface.
432///
433/// - `alternative_log_lik(alt, shard)`: evaluation-fold log-likelihood of
434///   shard under the CURRENT alternative state (fit on prior shards only —
435///   guaranteed here by call order).
436/// - `null_sup_log_lik(shard)`: the honest constrained sup — the K-atom
437///   null REFIT on this shard (any fitter; this side must genuinely
438///   maximize over H0 on the shard, an under-maximized null inflates the
439///   e-value and voids validity).
440/// - `refit_alternative(alt, shard)`: fold the shard into the alternative
441///   (warm-started production fit, GPU, anything — zero conditions).
442///
443/// `initial_alternative` is the K+1 fit from data BEFORE the stream (or a
444/// prior-driven init; validity never depends on its quality — a bad init
445/// only costs power). Stops absorbing early once certified (the crossing
446/// is permanent; further shards only cost compute), but still folds the
447/// remaining shards into the alternative so the returned state has seen
448/// the whole stream. Returns the gate (verdict + resumable evidence) and
449/// the final alternative state.
450pub fn run_atom_birth_gate<S, A>(
451    alpha: f64,
452    initial_alternative: A,
453    shards: impl IntoIterator<Item = S>,
454    mut alternative_log_lik: impl FnMut(&A, &S) -> f64,
455    mut null_sup_log_lik: impl FnMut(&S) -> f64,
456    mut refit_alternative: impl FnMut(A, &S) -> A,
457) -> Result<(AtomBirthGate, A), String> {
458    let mut gate = AtomBirthGate::new(alpha)?;
459    let mut alt = initial_alternative;
460    for shard in shards {
461        if !matches!(gate.verdict(), GateVerdict::Certified { .. }) {
462            let log_lik_alt = alternative_log_lik(&alt, &shard);
463            let log_lik_null = null_sup_log_lik(&shard);
464            gate.try_absorb_shard(log_lik_alt, log_lik_null)?;
465        }
466        alt = refit_alternative(alt, &shard);
467    }
468    Ok((gate, alt))
469}
470
471/// e-BH: FDR control over m structural claims under ARBITRARY dependence
472/// (Wang–Ramdas). Input: per-claim log-e-values. Output: indices of
473/// rejected (i.e. CONFIRMED-STRUCTURE) claims, FDR ≤ α.
474///
475/// Sort e-values descending; reject the top k* where
476/// `k* = max{ k : e_(k) ≥ m/(α·k) }`.
477///
478/// This is the "dictionary certificate": run one e-process per claimed
479/// atom (and per claimed binding edge, #975), call this at the chosen
480/// stopping time, and the dictionary ships with an FDR-controlled list of
481/// which of its claimed structures are statistically real. No
482/// independence assumptions — atoms sharing every token is fine; that is
483/// exactly the case p-value BH cannot legally handle (PRDS violation) and
484/// e-BH can.
485pub fn e_benjamini_hochberg(log_e_values: &[f64], alpha: f64) -> Vec<usize> {
486    let m = log_e_values.len();
487    if m == 0 || !(alpha.is_finite() && alpha > 0.0) {
488        return Vec::new();
489    }
490    // Robustness: a degenerate claim can bank a non-finite log e-value (a
491    // NaN from an upstream `(−∞) − (−∞)` split-LR that escaped the source
492    // guard). This is the documented honest FDR surface, so it must NEVER
493    // panic on such input. Treat any NaN as least-evidence (`−∞`): an
494    // undefined/contested claim contributes no evidence, sorts last, and can
495    // never be among the rejections. Finite/±∞ values pass through unchanged;
496    // `total_cmp` then gives a total order on the sanitized keys.
497    let sanitized: Vec<f64> = log_e_values
498        .iter()
499        .map(|&v| if v.is_nan() { f64::NEG_INFINITY } else { v })
500        .collect();
501    let mut order: Vec<usize> = (0..m).collect();
502    order.sort_by(|&a, &b| sanitized[b].total_cmp(&sanitized[a]));
503    let m_f = m as f64;
504    let mut k_star = 0usize;
505    for (rank0, &idx) in order.iter().enumerate() {
506        let k = (rank0 + 1) as f64;
507        // e_(k) ≥ m / (α k)  ⟺  log e_(k) ≥ log m − log α − log k
508        if sanitized[idx] >= m_f.ln() - alpha.ln() - k.ln() {
509            k_star = rank0 + 1;
510        }
511    }
512    order.truncate(k_star);
513    order
514}
515
516/// What one structural claim asserts about the dictionary. One e-process
517/// runs per claim; the kinds mirror the discovery stack's claim surface:
518/// atom existence (#976 birth), binding edges (#975), geometry
519/// adjudication (#907). `Custom` keeps the ledger open to claim types
520/// that do not exist yet without an enum churn per new discovery gate.
521#[derive(Clone, Debug, PartialEq, Eq, Hash, Serialize, Deserialize)]
522pub enum ClaimKind {
523    /// "Atom `atom` is statistically real" — the K vs K+1 birth claim.
524    AtomExists { atom: usize },
525    /// "Atoms `a` and `b` are bound" — a #975 binding edge.
526    BindingEdge { a: usize, b: usize },
527    /// "Atom `atom`'s latent geometry is `kind`" (e.g. "circle",
528    /// "clusters", "line") — a #907 adjudication claim.
529    GeometryKind { atom: usize, kind: String },
530    /// Any other structural claim, labeled.
531    Custom { label: String },
532}
533
534/// One claim plus its running evidence.
535#[derive(Clone, Debug, Serialize, Deserialize)]
536pub struct StructuralClaim {
537    pub kind: ClaimKind,
538    pub evidence: EProcess,
539}
540
541/// The dictionary's claim ledger: every structural claim the discovery
542/// stack makes, each with its own e-process. Serializable — evidence
543/// resumes across corpus shards (#973) by persisting the ledger, not by
544/// refitting. Calling [`StructureLedger::certify`] at ANY data-dependent
545/// stopping time yields a valid certificate; that is the entire point.
546#[derive(Clone, Debug, Default, Serialize, Deserialize)]
547pub struct StructureLedger {
548    claims: Vec<StructuralClaim>,
549}
550
551impl StructureLedger {
552    pub fn new() -> Self {
553        Self { claims: Vec::new() }
554    }
555
556    /// Register a claim and return its ledger index. Idempotent on the
557    /// claim kind: re-registering an existing claim (a resumed shard loop
558    /// re-announcing its claim surface) returns the existing index and
559    /// PRESERVES its accumulated evidence — a fresh e-process here would
560    /// silently discard the stream's history.
561    pub fn register(&mut self, kind: ClaimKind) -> usize {
562        if let Some(idx) = self.claims.iter().position(|c| c.kind == kind) {
563            return idx;
564        }
565        self.claims.push(StructuralClaim {
566            kind,
567            evidence: EProcess::new(),
568        });
569        self.claims.len() - 1
570    }
571
572    /// Absorb one conditionally-valid log e-value for claim `idx` (a
573    /// universal-inference shard ratio, a frozen-prior log-BF — the
574    /// caller's contract is per-source, documented on the producing gate).
575    pub fn absorb_log(&mut self, idx: usize, log_e_value: f64) -> Result<(), String> {
576        let n = self.claims.len();
577        let claim = self.claims.get_mut(idx).ok_or_else(|| {
578            format!("StructureLedger: claim index {idx} out of range ({n} claims)")
579        })?;
580        claim.evidence.absorb_log(log_e_value)
581    }
582
583    pub fn claims(&self) -> &[StructuralClaim] {
584        &self.claims
585    }
586
587    /// The likelihood half of the probe-design loop (work-plan step 4):
588    /// after running a planned probe ([`ProbePlan`] →
589    /// `crate::inference::steering::steer_delta`), evaluate the REALIZED
590    /// outcomes under both hypotheses' predictive densities and absorb the
591    /// log-ratio into the contested claim's e-process.
592    ///
593    /// Validity contract: both predictive densities must be FROZEN before the
594    /// probe outcome is observed — which the design loop satisfies by
595    /// construction, since both hypotheses' dictionaries were fitted before
596    /// the probe was even chosen. For a composite null, the null density must
597    /// be the honest constrained fit (the same rule as
598    /// [`split_likelihood_log_e_value`], which this delegates to); for a
599    /// simple null the predictive density is the sup. Probe outcomes are new
600    /// data by construction (the model was steered to produce them), so they
601    /// compound validly with the claim's prior shard evidence.
602    pub fn absorb_probe_outcome(
603        &mut self,
604        idx: usize,
605        log_lik_alt_on_outcome: f64,
606        log_lik_null_on_outcome: f64,
607    ) -> Result<(), String> {
608        self.absorb_log(
609            idx,
610            split_likelihood_log_e_value(log_lik_alt_on_outcome, log_lik_null_on_outcome),
611        )
612    }
613
614    /// The dictionary certificate: e-BH over the ledger's CURRENT
615    /// e-values at level α. FDR ≤ α over the confirmed set under arbitrary
616    /// dependence — atoms sharing every token is fine — and valid at any
617    /// stopping time because each entry is an e-process. Claims not
618    /// confirmed are CONTESTED, never rejected (demote-never-reject); they
619    /// keep their evidence and are the inputs to the probe-design loop.
620    pub fn certify(&self, alpha: f64) -> StructureCertificate {
621        let log_e: Vec<f64> = self
622            .claims
623            .iter()
624            .map(|c| c.evidence.current_e_value_log())
625            .collect();
626        let confirmed_idx = e_benjamini_hochberg(&log_e, alpha);
627        let mut entries: Vec<CertificateEntry> = self
628            .claims
629            .iter()
630            .zip(&log_e)
631            .map(|(c, &le)| CertificateEntry {
632                kind: c.kind.clone(),
633                log_e: le,
634                steps: c.evidence.steps(),
635                confirmed: false,
636            })
637            .collect();
638        for &i in &confirmed_idx {
639            entries[i].confirmed = true;
640        }
641        StructureCertificate { alpha, entries }
642    }
643}
644
645/// One line of the certificate's e-value ledger: the claim, its
646/// log-evidence at the stop, how many batches produced it, and the e-BH
647/// outcome. The full entry list IS the reproducibility artifact: anyone
648/// holding it can re-run [`e_benjamini_hochberg`] and re-derive the
649/// confirmed set.
650#[derive(Clone, Debug, Serialize, Deserialize)]
651pub struct CertificateEntry {
652    pub kind: ClaimKind,
653    pub log_e: f64,
654    pub steps: usize,
655    pub confirmed: bool,
656}
657
658/// The deliverable: "we found N structures at FDR ≤ α, certificate
659/// attached". Ships next to the identifiability certificate
660/// ([`crate::terms::sae::identifiability::residual_gauge`], #981) — that one says
661/// what the GAUGE cannot distinguish, this one says what the DATA can.
662#[derive(Clone, Debug, Serialize, Deserialize)]
663pub struct StructureCertificate {
664    pub alpha: f64,
665    pub entries: Vec<CertificateEntry>,
666}
667
668impl StructureCertificate {
669    pub fn confirmed(&self) -> impl Iterator<Item = &CertificateEntry> {
670        self.entries.iter().filter(|e| e.confirmed)
671    }
672
673    pub fn contested(&self) -> impl Iterator<Item = &CertificateEntry> {
674        self.entries.iter().filter(|e| !e.confirmed)
675    }
676}
677
678/// Calibrate one (super)uniform p-value into a single e-value, in log
679/// space: `e(p) = ½ p^{−1/2}` (the κ = ½ member of the calibrator family
680/// `e_κ(p) = κ p^{κ−1}`; `∫₀¹ e_κ(p) dp = 1`, so `E_{H0}[e(P)] ≤ 1` for
681/// any valid p — superuniformity only, no other conditions).
682///
683/// This is the bridge from p-value-shaped instruments into the ledger —
684/// e.g. the feature-binding Wald test (`terms::structure::anova_atom::carve`'s
685/// `edge_p_value` → a [`ClaimKind::BindingEdge`] entry). It spends
686/// calibration slack (a p of 0.01 becomes e = 5, not 100), which is the
687/// honest price of converting a fixed-sample test into anytime-valid
688/// currency; instruments that can produce e-values natively should.
689/// CONTRACT: one calibrated e-value per INDEPENDENT data batch — feeding
690/// repeated tests of the same accumulating data into one e-process is the
691/// p-hacking this module exists to kill.
692pub fn log_e_from_p_calibrator(p_value: f64) -> Result<f64, String> {
693    if !(p_value > 0.0) || p_value > 1.0 {
694        return Err(format!("p-value must be in (0, 1], got {p_value}"));
695    }
696    Ok(0.5f64.ln() - 0.5 * p_value.ln())
697}
698
699/// A candidate steering probe for resolving one contested structural
700/// claim: the intervention direction (in the steering primitive's
701/// coordinates), and the two hypotheses' PREDICTED output-mean responses
702/// to it.
703pub struct CandidateProbe {
704    /// Steering displacement δ, to be applied via
705    /// `crate::inference::steering` (which enforces its own validity
706    /// radius and reports realized dosimetry).
707    pub delta: Array1<f64>,
708    /// Predicted output-mean response under the null structure, μ₀(δ).
709    pub predicted_mean_null: Array1<f64>,
710    /// Predicted output-mean response under the alternative, μ₁(δ).
711    pub predicted_mean_alt: Array1<f64>,
712}
713
714/// Greedy KL-optimal experimental design under the local Gaussian
715/// output-Fisher model: pick the probe maximizing
716/// `½ (μ₁(δ) − μ₀(δ))ᵀ F (μ₁(δ) − μ₀(δ))` — the expected per-observation
717/// log-growth of the deciding e-process under the alternative.
718///
719/// `fisher` is the output-Fisher metric at the operating point (#980
720/// harvest; the same object steering dosimetry contracts against). Probes
721/// whose hypotheses predict the SAME response score zero no matter how
722/// large their raw effect — the design rule selects for DISCRIMINATION,
723/// not impact, which is the entire point: a maximally-steered output that
724/// both hypotheses predict identically teaches nothing.
725///
726/// Returns the index of the best probe and its expected log-growth (nats
727/// per observation), or None if no probe discriminates.
728pub fn select_probe_by_expected_evidence(
729    probes: &[CandidateProbe],
730    fisher: &Array2<f64>,
731) -> Option<(usize, f64)> {
732    let mut best: Option<(usize, f64)> = None;
733    for (idx, probe) in probes.iter().enumerate() {
734        let diff = &probe.predicted_mean_alt - &probe.predicted_mean_null;
735        if diff.len() != fisher.nrows() {
736            continue;
737        }
738        let f_diff = fisher.dot(&diff);
739        let growth = 0.5 * diff.dot(&f_diff);
740        if growth.is_finite() && growth > 0.0 {
741            match best {
742                Some((_, g)) if g >= growth => {}
743                _ => best = Some((idx, growth)),
744            }
745        }
746    }
747    best
748}
749
750/// Expected number of observations for the chosen probe to push a claim's
751/// e-process across the 1/α Ville threshold, under the alternative: the
752/// design-time budget `log(1/α) / growth_rate`. This is what turns the
753/// abstract guarantee into an experiment plan ("this probe should resolve
754/// the claim in ~N tokens; if it hasn't, the alternative is weaker than
755/// hypothesized — itself evidence").
756pub fn expected_resolution_budget(alpha: f64, growth_nats_per_obs: f64) -> Option<f64> {
757    if alpha <= 0.0 || alpha >= 1.0 || growth_nats_per_obs <= 0.0 {
758        return None;
759    }
760    Some(-(alpha.ln()) / growth_nats_per_obs)
761}
762
763/// The experiment plan for one contested claim: which probe to run, the
764/// expected per-observation evidence growth under the alternative, and the
765/// design-time resolution budget. This is the loop's actionable output —
766/// hand `probes[probe]`'s δ to `crate::inference::steering::steer_delta`
767/// (which enforces the validity radius and reports realized dosimetry),
768/// evaluate both hypotheses' likelihoods on the realized outputs, absorb
769/// the log-ratio into the claim's e-process, re-certify.
770#[derive(Clone, Debug, PartialEq)]
771pub struct ProbePlan {
772    /// Index into the candidate probe list.
773    pub probe: usize,
774    /// Expected log-growth of the deciding e-process, nats/observation,
775    /// under the alternative (the KL of the hypotheses' predicted
776    /// responses in the output-Fisher metric).
777    pub expected_log_growth: f64,
778    /// Expected observations to cross 1/α from ZERO evidence — the
779    /// conservative from-scratch budget.
780    pub budget_from_scratch: f64,
781    /// Expected observations to cross 1/α from the claim's CURRENT
782    /// log-evidence — the remaining budget; 0 when already across.
783    pub budget_remaining: f64,
784}
785
786/// Close the design loop for one contested claim: pick the probe whose
787/// predicted hypothesis-disagreement (not raw effect) buys evidence
788/// fastest, and convert the claim's current evidence into a remaining
789/// budget — "this probe should resolve the claim in ~N more observations
790/// at level α; if it does not, the alternative is weaker than
791/// hypothesized, which is itself evidence."
792///
793/// `current_log_e` is the contested claim's running log-evidence (from its
794/// [`StructuralClaim`] / [`GateVerdict::Contested`]). Returns `None` when
795/// no probe discriminates (all candidates score zero growth: the
796/// hypotheses agree on everything reachable inside the validity radius —
797/// the claim is undecidable by steering and needs a different instrument,
798/// which is a finding, not a failure).
799pub fn plan_probe_for_contested_claim(
800    probes: &[CandidateProbe],
801    fisher: &Array2<f64>,
802    alpha: f64,
803    current_log_e: f64,
804) -> Option<ProbePlan> {
805    let (probe, expected_log_growth) = select_probe_by_expected_evidence(probes, fisher)?;
806    let budget_from_scratch = expected_resolution_budget(alpha, expected_log_growth)?;
807    let nats_remaining = (-(alpha.ln()) - current_log_e).max(0.0);
808    Some(ProbePlan {
809        probe,
810        expected_log_growth,
811        budget_from_scratch,
812        budget_remaining: nats_remaining / expected_log_growth,
813    })
814}
815
816#[cfg(test)]
817mod tests {
818    use super::*;
819    use ndarray::array;
820
821    /// e-BH on a hand-checkable configuration.
822    #[test]
823    fn e_bh_rejects_exactly_the_qualifying_prefix() {
824        // m = 4, α = 0.1 → thresholds m/(αk) = 40, 20, 13.33, 10.
825        let log_e = [45.0f64.ln(), 21.0f64.ln(), 12.0f64.ln(), 1.0f64.ln()];
826        let rejected = e_benjamini_hochberg(&log_e, 0.1);
827        // e_(1)=45 ≥ 40 ✓, e_(2)=21 ≥ 20 ✓, e_(3)=12 < 13.33 ✗ → k* = 2.
828        assert_eq!(rejected, vec![0, 1]);
829
830        // A weaker tail cannot drag in a stronger prefix decision.
831        let log_e2 = [45.0f64.ln(), 5.0f64.ln(), 2.0f64.ln(), 1.0f64.ln()];
832        assert_eq!(e_benjamini_hochberg(&log_e2, 0.1), vec![0]);
833    }
834
835    #[test]
836    fn split_likelihood_equal_impossibility_is_neutral_log_evidence() {
837        let log_e = split_likelihood_log_e_value(f64::NEG_INFINITY, f64::NEG_INFINITY);
838        assert_eq!(log_e, 0.0);
839        assert!(log_e.is_finite());
840
841        let mut proc = EProcess::new();
842        proc.absorb_log(log_e).unwrap();
843        assert_eq!(proc.log_evidence(), 0.0);
844        assert_eq!(proc.steps(), 1);
845    }
846
847    #[test]
848    fn e_bh_orders_infinite_log_e_values_without_comparator_panic() {
849        let log_e = [f64::NEG_INFINITY, f64::INFINITY, 45.0f64.ln(), 1.0f64.ln()];
850        assert_eq!(e_benjamini_hochberg(&log_e, 0.1), vec![1, 2]);
851    }
852
853    /// A NaN log e-value (a degenerate claim whose `(−∞) − (−∞)` split-LR
854    /// escaped the source guard) must NOT panic the e-BH comparator — the
855    /// honest FDR surface stays robust. The NaN claim is treated as
856    /// least-evidence (`−∞`): it is never rejected, and it cannot disturb the
857    /// rejection of a genuinely strong claim.
858    #[test]
859    fn e_bh_treats_nan_as_least_evidence_without_panicking() {
860        // m = 2, α = 0.1 → threshold for the top claim is m/(αk) = 2/0.1 = 20.
861        // Claim 0 has e = 45 ≥ 20 (rejected); claim 1 is NaN (no evidence).
862        let log_e = [45.0f64.ln(), f64::NAN];
863        let rejected = e_benjamini_hochberg(&log_e, 0.1);
864        assert_eq!(
865            rejected,
866            vec![0],
867            "strong claim survives; NaN claim never rejected"
868        );
869
870        // An all-NaN ledger yields an empty (no-rejection) certificate, not a
871        // panic.
872        let all_nan = [f64::NAN, f64::NAN, f64::NAN];
873        assert!(e_benjamini_hochberg(&all_nan, 0.1).is_empty());
874    }
875
876    /// The full source→consumer chain: a shard with zero density under both
877    /// the alternative and the null produces `(−∞) − (−∞)`, which the split-LR
878    /// resolves to neutral `log E = 0` rather than NaN; banking it and
879    /// certifying must not panic. A genuinely-NaN log-likelihood is likewise
880    /// resolved to neutral evidence at the source.
881    #[test]
882    fn degenerate_split_lr_flows_through_certify_without_nan_panic() {
883        // (−∞) − (−∞): zero density under both hypotheses → neutral.
884        let neutral = split_likelihood_log_e_value(f64::NEG_INFINITY, f64::NEG_INFINITY);
885        assert_eq!(neutral, 0.0);
886        // NaN log-likelihood (un-evaluable degenerate fit) → neutral, finite.
887        let from_nan = split_likelihood_log_e_value(f64::NAN, -3.0);
888        assert!(from_nan.is_finite());
889        assert_eq!(from_nan, 0.0);
890
891        let mut ledger = StructureLedger::new();
892        let degenerate = ledger.register(ClaimKind::AtomExists { atom: 0 });
893        let strong = ledger.register(ClaimKind::AtomExists { atom: 1 });
894        // Bank the neutral split-LR on the degenerate claim — no NaN reaches
895        // the e-process.
896        ledger.absorb_log(degenerate, neutral).unwrap();
897        ledger.absorb_log(strong, 45.0f64.ln()).unwrap();
898        // certify() runs e_benjamini_hochberg internally; must not panic.
899        let certificate = ledger.certify(0.1);
900        let degenerate_entry = certificate
901            .entries
902            .iter()
903            .find(|e| e.kind == ClaimKind::AtomExists { atom: 0 })
904            .expect("degenerate claim present");
905        // Neutral evidence (log_e = 0) never qualifies → contested, not confirmed.
906        assert!(!degenerate_entry.confirmed);
907        assert_eq!(degenerate_entry.log_e, 0.0);
908    }
909
910    #[test]
911    fn e_process_absorb_log_rejects_undefined_log_products() {
912        let mut proc = EProcess::new();
913        assert!(proc.absorb_log(f64::NAN).is_err());
914
915        proc.absorb_log(f64::INFINITY).unwrap();
916        assert!(proc.absorb_log(f64::NEG_INFINITY).is_err());
917        assert_eq!(proc.log_evidence(), f64::INFINITY);
918        assert_eq!(proc.steps(), 1);
919    }
920
921    /// Ville-style sanity: under H0 (simulated fair e-values from a
922    /// likelihood ratio of identical Gaussians), the e-process crosses
923    /// 1/α rarely; under a true alternative it crosses fast and the
924    /// crossing is PERMANENT (running-sup semantics).
925    #[test]
926    fn e_process_crossing_is_permanent_and_directional() {
927        // Deterministic "stream": per-batch log-LR of N(μ,1) vs N(0,1)
928        // evaluated at x drawn from the alternative: log e = μ x − μ²/2.
929        // Use a fixed quasi-random sequence; no RNG state needed.
930        let mu = 0.6f64;
931        let mut proc_alt = EProcess::new();
932        let mut crossed_at: Option<usize> = None;
933        for t in 0..200 {
934            // x_t ~ alternative-ish deterministic surrogate around μ
935            let x = mu + 0.9 * ((t as f64 * 0.7321).sin());
936            proc_alt.absorb_log(mu * x - 0.5 * mu * mu).unwrap();
937            if proc_alt.rejects_at(0.05) && crossed_at.is_none() {
938                crossed_at = Some(t);
939            }
940        }
941        let t_cross = crossed_at.expect("true alternative must cross 1/α");
942        assert!(t_cross < 100, "evidence should accumulate quickly");
943        // Permanence: rejection holds at the end even if late evidence dips.
944        assert!(proc_alt.rejects_at(0.05));
945
946        // Null stream: x centered at 0 → expected log e = −μ²/2 < 0.
947        let mut proc_null = EProcess::new();
948        for t in 0..200 {
949            let x = 0.9 * ((t as f64 * 0.7321).sin());
950            proc_null.absorb_log(mu * x - 0.5 * mu * mu).unwrap();
951        }
952        assert!(
953            !proc_null.rejects_at(0.05),
954            "null stream must not accumulate evidence (log E = {:.3})",
955            proc_null.log_evidence()
956        );
957    }
958
959    /// The design rule selects discrimination, not raw effect.
960    #[test]
961    fn probe_selection_prefers_discrimination_over_impact() {
962        let fisher = array![[2.0, 0.0], [0.0, 0.5]];
963        let probes = vec![
964            // Huge effect, but both hypotheses predict it identically.
965            CandidateProbe {
966                delta: array![1.0, 0.0],
967                predicted_mean_null: array![10.0, 10.0],
968                predicted_mean_alt: array![10.0, 10.0],
969            },
970            // Modest effect, hypotheses disagree along the informative axis.
971            CandidateProbe {
972                delta: array![0.0, 1.0],
973                predicted_mean_null: array![0.0, 0.0],
974                predicted_mean_alt: array![1.0, 0.2],
975            },
976        ];
977        let (idx, growth) =
978            select_probe_by_expected_evidence(&probes, &fisher).expect("a probe discriminates");
979        assert_eq!(idx, 1);
980        // ½·(1,0.2)ᵀ diag(2,0.5) (1,0.2) = ½·(2 + 0.02) = 1.01 nats/obs.
981        assert!((growth - 1.01).abs() < 1e-12);
982        // Budget: ~3 observations to certify at α=0.05.
983        let budget = expected_resolution_budget(0.05, growth).expect("budget");
984        assert!(budget > 2.0 && budget < 4.0);
985    }
986
987    /// The birth gate certifies under a true alternative, stays contested
988    /// under the null, and never emits anything but those two verdicts.
989    #[test]
990    fn birth_gate_certifies_alternative_and_demotes_never_rejects() {
991        let mut gate = AtomBirthGate::new(0.05).expect("valid alpha");
992        // Strong shards: alternative beats the honest null sup by 1 nat each.
993        for _ in 0..5 {
994            gate.absorb_shard(-100.0, -101.0);
995        }
996        match gate.verdict() {
997            GateVerdict::Certified { log_e } => assert!((log_e - 5.0).abs() < 1e-12),
998            v => panic!("5 nats must certify at α=0.05, got {v:?}"),
999        }
1000        // Permanence: a later evidence retreat cannot un-certify.
1001        gate.absorb_shard(-110.0, -100.0);
1002        assert!(matches!(gate.verdict(), GateVerdict::Certified { .. }));
1003
1004        // Null-ish stream: the prefit alternative loses to the on-shard sup
1005        // (it must, on average — the sup is fit on the eval shard itself).
1006        let mut null_gate = AtomBirthGate::new(0.05).expect("valid alpha");
1007        for _ in 0..50 {
1008            null_gate.absorb_shard(-100.3, -100.0);
1009        }
1010        match null_gate.verdict() {
1011            GateVerdict::Contested { log_e } => assert!(log_e < 0.0),
1012            v => panic!("null stream must stay contested, got {v:?}"),
1013        }
1014        assert!(AtomBirthGate::new(0.0).is_err());
1015        assert!(AtomBirthGate::new(1.0).is_err());
1016    }
1017
1018    /// Ledger: idempotent registration preserves evidence; the certificate
1019    /// splits confirmed/contested by e-BH and the entry list reproduces it.
1020    #[test]
1021    fn ledger_certificate_splits_confirmed_and_contested() {
1022        let mut ledger = StructureLedger::new();
1023        let a0 = ledger.register(ClaimKind::AtomExists { atom: 0 });
1024        let a1 = ledger.register(ClaimKind::AtomExists { atom: 1 });
1025        let edge = ledger.register(ClaimKind::BindingEdge { a: 0, b: 1 });
1026
1027        // m = 3, α = 0.1 → e-BH thresholds m/(αk) = 30, 15, 10.
1028        ledger.absorb_log(a0, 40.0f64.ln()).unwrap();
1029        ledger.absorb_log(a1, 20.0f64.ln()).unwrap();
1030        ledger.absorb_log(edge, 2.0f64.ln()).unwrap();
1031
1032        // Re-registering must return the same slot with evidence intact.
1033        let a0_again = ledger.register(ClaimKind::AtomExists { atom: 0 });
1034        assert_eq!(a0_again, a0);
1035        assert_eq!(ledger.claims()[a0].evidence.steps(), 1);
1036
1037        let cert = ledger.certify(0.1);
1038        // e_(1)=40 ≥ 30 ✓, e_(2)=20 ≥ 15 ✓, e_(3)=2 < 10 ✗ → atoms confirmed,
1039        // the binding edge stays contested.
1040        let confirmed: Vec<&ClaimKind> = cert.confirmed().map(|e| &e.kind).collect();
1041        assert_eq!(confirmed.len(), 2);
1042        assert!(confirmed.contains(&&ClaimKind::AtomExists { atom: 0 }));
1043        assert!(confirmed.contains(&&ClaimKind::AtomExists { atom: 1 }));
1044        let contested: Vec<&CertificateEntry> = cert.contested().collect();
1045        assert_eq!(contested.len(), 1);
1046        assert_eq!(contested[0].kind, ClaimKind::BindingEdge { a: 0, b: 1 });
1047
1048        assert!(ledger.absorb_log(99, 0.0).is_err());
1049    }
1050
1051    /// Resumability: a serialized ledger reloads with its evidence and
1052    /// keeps absorbing — the #973 shard contract.
1053    #[test]
1054    fn ledger_evidence_resumes_across_serialization() {
1055        let mut ledger = StructureLedger::new();
1056        let idx = ledger.register(ClaimKind::GeometryKind {
1057            atom: 3,
1058            kind: "circle".to_string(),
1059        });
1060        ledger.absorb_log(idx, 1.25).unwrap();
1061
1062        let persisted = serde_json::to_string(&ledger).expect("serialize ledger");
1063        let mut resumed: StructureLedger =
1064            serde_json::from_str(&persisted).expect("deserialize ledger");
1065        assert_eq!(resumed.claims()[idx].evidence.steps(), 1);
1066
1067        resumed.absorb_log(idx, 0.75).unwrap();
1068        let log_e = resumed.claims()[idx].evidence.log_evidence();
1069        assert!((log_e - 2.0).abs() < 1e-12);
1070    }
1071
1072    /// The probe plan discounts the remaining budget by evidence already
1073    /// banked, and floors at zero once the claim is across the line.
1074    #[test]
1075    fn probe_plan_discounts_remaining_budget_by_current_evidence() {
1076        let fisher = array![[2.0, 0.0], [0.0, 0.5]];
1077        let probes = vec![CandidateProbe {
1078            delta: array![0.0, 1.0],
1079            predicted_mean_null: array![0.0, 0.0],
1080            predicted_mean_alt: array![1.0, 0.2],
1081        }];
1082        // growth = 1.01 nats/obs (checked above); α=0.05 → need ln(20) ≈ 3.0 nats.
1083        let from_zero = plan_probe_for_contested_claim(&probes, &fisher, 0.05, 0.0).expect("plan");
1084        assert_eq!(from_zero.probe, 0);
1085        assert!((from_zero.budget_remaining - from_zero.budget_from_scratch).abs() < 1e-12);
1086
1087        let halfway = plan_probe_for_contested_claim(&probes, &fisher, 0.05, 1.5).expect("plan");
1088        assert!(halfway.budget_remaining < from_zero.budget_remaining);
1089        assert!((halfway.budget_remaining - (-(0.05f64.ln()) - 1.5) / 1.01).abs() < 1e-12);
1090
1091        let across = plan_probe_for_contested_claim(&probes, &fisher, 0.05, 10.0).expect("plan");
1092        assert_eq!(across.budget_remaining, 0.0);
1093
1094        // No discriminating probe → no plan (undecidable by steering).
1095        let blind = vec![CandidateProbe {
1096            delta: array![1.0, 0.0],
1097            predicted_mean_null: array![5.0, 5.0],
1098            predicted_mean_alt: array![5.0, 5.0],
1099        }];
1100        assert!(plan_probe_for_contested_claim(&blind, &fisher, 0.05, 0.0).is_none());
1101    }
1102
1103    /// The p→e calibrator on hand-checkable values, including its edges.
1104    #[test]
1105    fn p_to_e_calibrator_hand_values() {
1106        // e(p) = ½ p^{−1/2}: p = 1 → e = 0.5; p = 0.04 → e = 2.5; p = 1e-4 → e = 50.
1107        assert!((log_e_from_p_calibrator(1.0).unwrap() - 0.5f64.ln()).abs() < 1e-12);
1108        assert!((log_e_from_p_calibrator(0.04).unwrap() - 2.5f64.ln()).abs() < 1e-12);
1109        assert!((log_e_from_p_calibrator(1e-4).unwrap() - 50.0f64.ln()).abs() < 1e-12);
1110        assert!(log_e_from_p_calibrator(0.0).is_err());
1111        assert!(log_e_from_p_calibrator(1.5).is_err());
1112        assert!(log_e_from_p_calibrator(f64::NAN).is_err());
1113    }
1114
1115    /// The e-value validity condition: under the null `P ~ Uniform(0, 1]`,
1116    /// the calibrated e-value must satisfy `E_{H0}[e(P)] = ∫₀¹ e(p) dp ≤ 1`
1117    /// (Wang–Ramdas e-BH controls FDR ONLY for genuine e-values). The κ = ½
1118    /// member `e(p) = ½ p^{−1/2}` integrates to exactly 1, the boundary of
1119    /// admissibility. We verify this numerically with a midpoint Riemann sum
1120    /// over the unit interval (which UNDER-estimates `∫ p^{−1/2}` slightly
1121    /// because the integrand is convex, so the analytic value 1 sits just
1122    /// above the quadrature estimate — both safely ≤ 1 + tolerance).
1123    ///
1124    /// This is the property `e = 1/p` VIOLATES — `∫₀¹ (1/p) dp = ∞` — which
1125    /// is why the behavioral-head and anova-atom paths route through this
1126    /// calibrator rather than `−ln p`.
1127    #[test]
1128    fn p_to_e_calibrator_null_expectation_at_most_one() {
1129        let n = 2_000_000usize;
1130        let h = 1.0 / n as f64;
1131        let mut mean_e = 0.0_f64;
1132        for i in 0..n {
1133            // Midpoint of cell i: p = (i + 0.5)/n, always in (0, 1).
1134            let p = (i as f64 + 0.5) * h;
1135            let e = log_e_from_p_calibrator(p).unwrap().exp();
1136            mean_e += e * h;
1137        }
1138        // Analytic E_{H0}[e(P)] = 1; allow a small quadrature tolerance, but
1139        // it MUST NOT exceed 1 by more than that (an invalid calibrator like
1140        // 1/p would diverge here, not land near 1).
1141        assert!(
1142            mean_e <= 1.0 + 1e-3,
1143            "calibrated e-value null expectation {mean_e} exceeds 1 — not a valid e-value"
1144        );
1145        assert!(
1146            mean_e > 0.99,
1147            "calibrated e-value null expectation {mean_e} far below the analytic 1.0"
1148        );
1149    }
1150
1151    /// POWER STUDY, null side: the heuristic gate every dictionary paper
1152    /// runs — "accept the K+1-th atom the first time the cumulative
1153    /// likelihood ratio shows improvement" — versus the e-gate, on a
1154    /// family of NULL streams peeked at after every shard. The per-shard
1155    /// log-LR is `μ x_t − μ²/2` with `x_t = A sin(ω t + φ)` (a
1156    /// deterministic null surrogate: mean drift −μ²/2 < 0, bounded
1157    /// fluctuation). The naive gate's false-accept mechanism is exactly
1158    /// optional stopping: any phase whose partial sums wander above zero
1159    /// at ANY peek accepts a nonexistent atom. The e-gate needs
1160    /// log(1/α) ≈ 3.0 nats, and the partial-sum fluctuation is bounded by
1161    /// `μ·A/sin(ω/2) ≈ 1.51` nats (Dirichlet-kernel bound) BEFORE the
1162    /// negative drift — so it can never certify on any phase, which is
1163    /// Ville's inequality made concrete.
1164    #[test]
1165    fn power_study_null_naive_peeking_gate_false_accepts_e_gate_never() {
1166        let mu = 0.6f64;
1167        let amp = 0.9f64;
1168        let omega = 0.7321f64;
1169        let n_phases = 60usize;
1170        let n_shards = 200usize;
1171
1172        let mut naive_false_accepts = 0usize;
1173        let mut e_gate_false_accepts = 0usize;
1174        for k in 0..n_phases {
1175            let phase = 2.0 * std::f64::consts::PI * (k as f64) / (n_phases as f64);
1176            let mut gate = AtomBirthGate::new(0.05).expect("alpha");
1177            let mut cum_log_lr = 0.0f64;
1178            let mut naive_accepted = false;
1179            for t in 0..n_shards {
1180                let x = amp * ((t as f64) * omega + phase).sin();
1181                let log_lr = mu * x - 0.5 * mu * mu;
1182                cum_log_lr += log_lr;
1183                // The broken test: peek, accept on any improvement.
1184                if cum_log_lr > 0.0 {
1185                    naive_accepted = true;
1186                }
1187                gate.absorb_shard(log_lr, 0.0);
1188            }
1189            if naive_accepted {
1190                naive_false_accepts += 1;
1191            }
1192            if matches!(gate.verdict(), GateVerdict::Certified { .. }) {
1193                e_gate_false_accepts += 1;
1194            }
1195        }
1196        // The naive gate false-accepts on a large fraction of null phases
1197        // (any phase with early-positive partial sums); the e-gate on none.
1198        assert!(
1199            naive_false_accepts >= n_phases / 3,
1200            "the peeking gate should false-accept often under the null \
1201             (got {naive_false_accepts}/{n_phases})"
1202        );
1203        assert_eq!(
1204            e_gate_false_accepts, 0,
1205            "the e-gate must never certify under the null"
1206        );
1207    }
1208
1209    /// POWER STUDY, alternative side, through the orchestration harness:
1210    /// a planted K+1-th atom worth 0.5 nats/shard certifies in
1211    /// ⌈log(1/α)/0.5⌉ = 6 shards — matching the design-time
1212    /// `expected_resolution_budget` — after which the gate stops absorbing
1213    /// (the crossing is permanent) while the alternative keeps refitting
1214    /// on the remaining shards.
1215    #[test]
1216    fn power_study_planted_atom_certifies_at_the_predicted_budget() {
1217        let growth = 0.5f64;
1218        let (gate, alt_state) = run_atom_birth_gate(
1219            0.05,
1220            0usize, // alt state = number of shards folded into the fit
1221            0..20usize,
1222            |_, _| -99.5, // prefit alternative log-lik on the shard
1223            |_| -100.0,   // honest null sup on the shard
1224            |folded, _| folded + 1,
1225        )
1226        .expect("valid alpha");
1227
1228        match gate.verdict() {
1229            GateVerdict::Certified { log_e } => assert!((log_e - 3.0).abs() < 1e-12),
1230            v => panic!("planted atom must certify, got {v:?}"),
1231        }
1232        // Realized time-to-certification == the design-time budget, rounded up.
1233        let budget = expected_resolution_budget(0.05, growth).expect("budget");
1234        assert_eq!(gate.test.process.steps(), budget.ceil() as usize);
1235        assert_eq!(gate.test.process.steps(), 6);
1236        // The alternative state saw the whole stream despite early stopping.
1237        assert_eq!(alt_state, 20);
1238    }
1239
1240    /// Work-plan step 4, closed end-to-end: a contested claim gets a probe
1241    /// plan, the probe's realized outcomes are scored under both FROZEN
1242    /// hypotheses via [`StructureLedger::absorb_probe_outcome`], and the
1243    /// banked evidence flips the claim to confirmed within a small multiple
1244    /// of the plan's predicted resolution budget. Outcome noise is a
1245    /// deterministic bounded surrogate (zero-mean sinusoid), so under the
1246    /// true alternative each probe's expected log-growth is exactly the
1247    /// design value.
1248    #[test]
1249    fn design_loop_resolves_contested_claim_within_predicted_budget() {
1250        let mut ledger = StructureLedger::new();
1251        let idx = ledger.register(ClaimKind::GeometryKind {
1252            atom: 0,
1253            kind: "circle".to_string(),
1254        });
1255
1256        // Local Gaussian output model, unit-isotropic noise in the
1257        // Fisher-whitened coordinates: per-observation expected log-growth
1258        // under H1 is exactly the planned ½‖μ₁−μ₀‖²_F.
1259        let fisher = array![[1.0, 0.0], [0.0, 1.0]];
1260        let mu0 = array![0.0, 0.0];
1261        let mu1 = array![1.2, 0.5];
1262        let probes = vec![CandidateProbe {
1263            delta: array![0.0, 1.0],
1264            predicted_mean_null: mu0.clone(),
1265            predicted_mean_alt: mu1.clone(),
1266        }];
1267        let alpha = 0.05;
1268        let plan = plan_probe_for_contested_claim(&probes, &fisher, alpha, 0.0).expect("plan");
1269        assert_eq!(plan.probe, 0);
1270        // ½‖μ₁−μ₀‖² = ½(1.44 + 0.25) = 0.845 nats/obs; ln 20 ≈ 3.0 ⇒ ~3.6 obs.
1271        assert!((plan.expected_log_growth - 0.845).abs() < 1e-12);
1272        let budget = plan.budget_remaining.ceil().max(1.0) as usize;
1273
1274        // Run the probe loop: outcomes realized under the TRUE alternative
1275        // (mean μ₁ plus bounded zero-mean fluctuation); both hypotheses'
1276        // densities were frozen above, before any outcome existed.
1277        let mut observations = 0usize;
1278        while !ledger.claims()[idx].evidence.rejects_at(alpha) {
1279            observations += 1;
1280            assert!(
1281                observations <= 4 * budget,
1282                "claim must resolve within a small multiple of the predicted \
1283                 budget {budget}; still contested after {observations} probes"
1284            );
1285            let t = observations as f64;
1286            let eps0 = 0.8 * (t * 0.7321).sin();
1287            let eps1 = 0.8 * (t * 1.1173).cos();
1288            let y = array![mu1[0] + eps0, mu1[1] + eps1];
1289            // Unit-Gaussian log-densities under each frozen hypothesis; the
1290            // shared normalizer cancels in the ratio.
1291            let d1 = &y - &mu1;
1292            let d0 = &y - &mu0;
1293            ledger
1294                .absorb_probe_outcome(idx, -0.5 * d1.dot(&d1), -0.5 * d0.dot(&d0))
1295                .expect("absorb");
1296        }
1297        let cert = ledger.certify(alpha);
1298        assert!(
1299            cert.confirmed()
1300                .any(|e| matches!(e.kind, ClaimKind::GeometryKind { atom: 0, .. }))
1301        );
1302    }
1303}