gam_terms/inference/
structure_evidence.rs

1//! Anytime-valid structure discovery: e-process gates, universal-inference
2//! atom tests, e-BH error control, and KL-optimal steering probes.
3//!
4//! Interpretability as sequential experimental design.
5//!
6//! # The thesis
7//!
8//! The dictionary-learning stack (#974–#981) DISCOVERS structure: atom
9//! birth/death/fission/fusion (#976), geometry adjudication — circle vs
10//! clusters vs line (#907), feature binding (#975). Today those decisions
11//! are made by evidence heuristics (likelihood-ratio ladders, BIC-flavored
12//! gates). Three facts, never previously combined, say that is not merely
13//! informal but WRONG in a specific, fixable way — and that fixing it
14//! upgrades the whole capstone from observational description to
15//! error-controlled experimental science:
16//!
17//! 1. **Atom existence is a NON-REGULAR testing problem.** "Does a K+1-th
18//!    dictionary atom exist?" is testing K vs K+1 mixture components — the
19//!    textbook boundary/loss-of-identifiability case where the classical
20//!    likelihood-ratio χ² asymptotics FAIL (the null sits on the boundary
21//!    of the alternative; the nuisance parameters of the new atom vanish
22//!    under the null — Davies' problem). Every SAE/dictionary paper that
23//!    thresholds a likelihood improvement is running this broken test.
24//!    **Universal inference** (Wasserman–Ramdas–Balakrishnan 2020) is the
25//!    modern resolution: a split-likelihood-ratio e-value that is valid in
26//!    finite samples with NO regularity conditions whatsoever — exactly
27//!    the irregular regime atom birth lives in.
28//!
29//! 2. **Discovery happens on streams with optional stopping.** Dictionaries
30//!    are trained until the features "look right" — data-dependent
31//!    stopping that p-hacks any fixed-sample test by construction.
32//!    **E-processes** (nonnegative supermartingales under the null,
33//!    `E[E_τ] ≤ 1` at every stopping time) are immune: by Ville's
34//!    inequality `P(sup_t E_t ≥ 1/α) ≤ α`, the guarantee survives stopping
35//!    whenever you like, peeking included, streaming corpora (#973)
36//!    included. Evidence is a RUNNING PRODUCT, resumable across shards.
37//!
38//! 3. **This laboratory is INTERVENTIONAL.** The steering primitive with
39//!    output dosimetry and a validity radius is landed
40//!    (`crate::inference::steering`); the per-token output-Fisher harvest
41//!    (#980) gives the local information geometry of the model's output.
42//!    So "which probe next?" is not a vibe — it is OPTIMAL EXPERIMENTAL
43//!    DESIGN: choose the steering intervention that maximizes the expected
44//!    log-growth of the e-process deciding a contested structural claim.
45//!    Under the local Gaussian output-Fisher model, that growth rate IS a
46//!    KL divergence with a closed form (below). The model chooses its own
47//!    next experiment, optimally, inside its certified validity radius.
48//!
49//! The deliverable shape: **every discovered atom ships an e-value; every
50//! dictionary ships an e-BH FDR certificate over its claimed structure;
51//! contested claims get design-optimal probes until evidence resolves.**
52//! No other interpretability stack has finite-sample, optional-stopping-
53//! safe error control over its discovered structure. This module is the
54//! statistical substrate; the SAE structure search plugs its gates in.
55//!
56//! The instruments, bottom-up: [`EProcess`] (running evidence, Ville
57//! semantics) → [`PredictablePluginEProcess`] (streaming universal
58//! inference) → [`AtomBirthGate`] + [`run_atom_birth_gate`] (the K vs K+1
59//! gate with demote-never-reject [`GateVerdict`]s; the runner enforces
60//! the predictability contract by call order) → [`StructureLedger`] (one
61//! e-process per claim, serializable across #973 shards) →
62//! [`StructureLedger::certify`] (the e-BH [`StructureCertificate`],
63//! shipped beside the gauge report via
64//! `crate::terms::sae::identifiability::dictionary_report`) →
65//! [`plan_probe_for_contested_claim`] (the design loop: contested claims
66//! get a [`ProbePlan`] whose δ runs through
67//! `crate::inference::steering::steer_delta` and whose per-hypothesis
68//! μ₀/μ₁ come from `crate::inference::steering::predicted_response`).
69//!
70//! # The math, fixed here so implementations cannot drift
71//!
72//! **E-value / e-process.** A nonnegative statistic `E` with `E_{H0}[E] ≤ 1`.
73//! Calibration to tests: reject at level α when `E ≥ 1/α` (Markov). An
74//! e-PROCESS compounds multiplicatively: `E_t = Π_{s≤t} e_s` where each
75//! `e_s` is conditionally valid given the past (`E[e_s | F_{s−1}] ≤ 1`
76//! under H0). Ville: `P_{H0}(∃t: E_t ≥ 1/α) ≤ α` — anytime validity.
77//! Always accumulate in log space; evidence products underflow doubles.
78//!
79//! **Universal inference (the atom-birth test).** Split the data (or the
80//! token stream) into D₀ (evaluation) and D₁ (estimation). Fit the
81//! K+1-atom alternative on D₁ by ANY method (the production fitter, warm
82//! starts, GPU, anything — no conditions). Fit the K-atom null by
83//! CONSTRAINED MLE ON D₀ (this is the one side that must be honest). Then
84//!
85//! ```text
86//!   E = L(θ̂₁ ; D₀) / sup_{θ ∈ H0} L(θ ; D₀)
87//! ```
88//!
89//! satisfies `E_{H0}[E] ≤ 1` in finite samples, mixtures and boundaries
90//! and all (the proof is three lines of Markov + tower; no asymptotics).
91//! Sequential version: at each batch t, the alternative plug-in is fit on
92//! data BEFORE t (predictable), the null sup is over the batch; the
93//! product is an e-process. This is `SplitLikelihoodEValue` /
94//! `PredictablePluginEProcess` below, generic over log-likelihood
95//! closures so the SAE stack passes its own (manifold likelihoods,
96//! superposition-aware residual models #974, whatever exists then).
97//!
98//! **Bayes factors are e-values (the #907 bridge).** A Bayes factor
99//! `BF = ∫ L(θ) dΠ₁(θ) / L_{H0}` with a FIXED (data-independent) prior Π₁
100//! and a SIMPLE (or sup-dominated) null has `E_{H0}[BF] ≤ 1` — the #907
101//! geometry-adjudication harness (circle vs clusters vs line, with its
102//! discrete-mixture null) is therefore ONE PRIOR-FREEZE away from anytime
103//! validity. The integration contract: route its per-batch BFs through
104//! [`EProcess::absorb`] instead of comparing a final BF to a threshold,
105//! and geometry claims inherit optional-stopping safety for free.
106//!
107//! **e-BH (the dictionary certificate).** Given e-values e_1..e_m for m
108//! structural claims (one per atom/edge/binding), sort descending and
109//! reject the top k* where `k* = max{ k : e_(k) ≥ m/(α·k) }`. This
110//! controls FDR ≤ α under ARBITRARY dependence between the e-values
111//! (Wang–Ramdas 2022) — no independence assumptions about atoms sharing
112//! tokens, which is good because they all share every token. That
113//! arbitrary-dependence robustness is WHY the certificate uses e-BH and
114//! not a p-value BH: the p-version needs PRDS, which atom statistics
115//! flagrantly violate.
116//!
117//! **Design-optimal probing.** For a contested claim with competing
118//! structural hypotheses H₀/H₁ (e.g. "feature f is one curved atom" vs
119//! "two flat atoms"), a candidate steering intervention δ (within its
120//! certified validity radius, `steering.rs`) produces predicted output
121//! distributions P₀^δ, P₁^δ. The expected per-observation log-growth of
122//! the likelihood-ratio e-process under H₁ is exactly `KL(P₁^δ ‖ P₀^δ)`,
123//! so the optimal next experiment is
124//!
125//! ```text
126//!   δ* = argmax_{δ : ‖δ‖ ≤ r_valid}  KL(P₁^δ ‖ P₀^δ)
127//!       ≈ argmax_δ  ½ (μ₁(δ) − μ₀(δ))ᵀ F (μ₁(δ) − μ₀(δ))
128//! ```
129//!
130//! under the local Gaussian model with output-Fisher metric F (#980
131//! harvest) — the SAME quadratic form the steering dosimetry already
132//! computes, repurposed: dosimetry measures nats delivered, design
133//! maximizes nats of DISCRIMINATION. Probes that maximize raw effect are
134//! not optimal; probes that maximize the *disagreement between the
135//! hypotheses' predictions*, weighted by output information, are. (Full
136//! KL-optimal design over the probe manifold is a research arc; the greedy
137//! quadratic rule below is the correct first instrument and is exact in
138//! the local model.)
139//!
140//! # What this kills
141//!
142//! - Birth/death gates that are threshold heuristics → replaced by
143//!   anytime-valid tests with declared error rates (#976's "detectable,
144//!   correctable misspecification" becomes a literal hypothesis test).
145//! - "We trained until the features looked interpretable" → optional
146//!   stopping is SAFE; the certificate survives it.
147//! - "We found N features" → "we found N features at FDR ≤ α, certificate
148//!   attached, reproducible from the e-value ledger."
149//! - Probe selection by intuition → probe selection by information.
150
151use ndarray::{Array1, Array2};
152use serde::{Deserialize, Serialize};
153
154/// Running anytime-valid evidence against one null hypothesis, in log
155/// space. Multiplicative absorption of conditionally-valid e-values;
156/// Ville's inequality converts the running product into a sequential test
157/// that survives optional stopping. Serializable so evidence is resumable
158/// across corpus shards (#973): persist, reload, keep absorbing.
159#[derive(Clone, Debug, Serialize, Deserialize)]
160pub struct EProcess {
161    /// log E_t — the running log-evidence. Starts at 0 (E_0 = 1).
162    log_e: f64,
163    /// Number of absorbed batches (the ledger length).
164    steps: usize,
165    /// Running maximum of log E_t — Ville's inequality applies to the
166    /// supremum, so a claim once proven at level α stays proven even if
167    /// later evidence retreats (evidence is not p-hackable in reverse).
168    log_e_max: f64,
169}
170
171impl EProcess {
172    pub fn new() -> Self {
173        Self {
174            log_e: 0.0,
175            steps: 0,
176            log_e_max: 0.0,
177        }
178    }
179
180    /// Absorb one conditionally-valid e-value (NOT in log space; must be
181    /// ≥ 0; `E[e | past] ≤ 1` under H0 is the caller's contract — e.g. a
182    /// universal-inference batch ratio or a fixed-prior Bayes factor).
183    pub fn absorb(&mut self, e_value: f64) -> Result<(), String> {
184        if e_value.is_nan() || e_value < 0.0 {
185            return Err(format!("e-value must be in [0, ∞], got {e_value}"));
186        }
187        self.absorb_log(e_value.ln())
188    }
189
190    /// Absorb a batch e-value supplied in log space (the only numerically
191    /// honest interface for long streams).
192    pub fn absorb_log(&mut self, log_e_value: f64) -> Result<(), String> {
193        let next_log_e = checked_log_e_sum(self.log_e, log_e_value)?;
194        self.log_e = next_log_e;
195        self.steps += 1;
196        if self.log_e > self.log_e_max {
197            self.log_e_max = self.log_e;
198        }
199        Ok(())
200    }
201
202    pub fn log_evidence(&self) -> f64 {
203        self.log_e
204    }
205
206    pub fn steps(&self) -> usize {
207        self.steps
208    }
209
210    /// Anytime-valid rejection at level α: by Ville,
211    /// `P_{H0}(sup_t E_t ≥ 1/α) ≤ α`, so crossing 1/α at ANY time —
212    /// including data-dependent stopping times — proves the claim with
213    /// type-I error ≤ α. Uses the running supremum: once crossed, always
214    /// rejected.
215    pub fn rejects_at(&self, alpha: f64) -> bool {
216        alpha > 0.0 && self.log_e_max >= -(alpha.ln())
217    }
218
219    /// The e-value to hand to [`e_benjamini_hochberg`] for the
220    /// dictionary-level FDR certificate (current evidence, not the sup —
221    /// e-BH's guarantee is stated for e-values at the chosen stopping
222    /// time).
223    pub fn current_e_value_log(&self) -> f64 {
224        self.log_e
225    }
226}
227
228impl Default for EProcess {
229    fn default() -> Self {
230        Self::new()
231    }
232}
233
234fn checked_log_e_sum(current: f64, increment: f64) -> Result<f64, String> {
235    if current.is_nan() {
236        return Err("EProcess invariant violation: current log evidence is NaN".to_string());
237    }
238    if increment.is_nan() {
239        return Err("log e-value must not be NaN".to_string());
240    }
241    if current.is_infinite()
242        && increment.is_infinite()
243        && current.is_sign_positive() != increment.is_sign_positive()
244    {
245        return Err(format!(
246            "cannot combine opposing infinite log e-values: current {current}, increment {increment}"
247        ));
248    }
249    Ok(current + increment)
250}
251
252/// One universal-inference (split-likelihood-ratio) e-value: finite-sample
253/// valid with NO regularity conditions — the correct instrument for atom
254/// birth (K vs K+1 components, boundary/Davies regime where χ² fails).
255///
256/// `log_lik_alternative_on_eval`: log-likelihood of the EVALUATION fold
257/// under the alternative fitted on the ESTIMATION fold (any fitter — the
258/// production manifold/SAE fit, warm-started, GPU; zero conditions on it).
259/// `log_lik_null_sup_on_eval`: the SUPREMUM of the evaluation-fold
260/// log-likelihood over the NULL model class (the honest side: a real
261/// constrained fit on the eval fold, e.g. the K-atom dictionary refit on
262/// D₀). Then `log E = ℓ_alt(D₀) − sup_{H0} ℓ(D₀)` and `E_{H0}[E] ≤ 1`
263/// exactly.
264pub fn split_likelihood_log_e_value(
265    log_lik_alternative_on_eval: f64,
266    log_lik_null_sup_on_eval: f64,
267) -> f64 {
268    // Degenerate halves yield a well-defined, conservative e-value of 1
269    // (`log E = 0`, no evidence for the alternative) rather than a NaN that
270    // would poison the e-process product and the downstream FDR certificate:
271    //   * Either log-likelihood is NaN — the model could not be evaluated on
272    //     this shard (a numerically degenerate fit). The honest reading is
273    //     "no information", i.e. `E = 1`.
274    //   * Both halves are the SAME signed infinity — e.g. a shard/outcome
275    //     with zero density under both the alternative and the null
276    //     (`−∞ − (−∞)`), or both `+∞`. The ratio is undefined; again `E = 1`.
277    // This keeps `log E` finite so `absorb_log` never banks a NaN and the
278    // e-BH certificate sees a contributing-nothing claim instead of panicking.
279    if log_lik_alternative_on_eval.is_nan() || log_lik_null_sup_on_eval.is_nan() {
280        return 0.0;
281    }
282    if log_lik_alternative_on_eval.is_infinite()
283        && log_lik_null_sup_on_eval.is_infinite()
284        && log_lik_alternative_on_eval.is_sign_positive()
285            == log_lik_null_sup_on_eval.is_sign_positive()
286    {
287        return 0.0;
288    }
289    log_lik_alternative_on_eval - log_lik_null_sup_on_eval
290}
291
292/// Sequential universal inference over a stream of batches with a
293/// PREDICTABLE plug-in: at batch t the alternative parameters were fit
294/// using only data before t, the null sup is taken on batch t, and the
295/// per-batch ratios compound into an e-process. This is the streaming /
296/// optional-stopping form the corpus-scale pipeline (#973 shards) needs —
297/// evidence is resumable: serialize `EProcess`, keep absorbing on the next
298/// shard.
299#[derive(Clone, Debug, Serialize, Deserialize)]
300pub struct PredictablePluginEProcess {
301    pub process: EProcess,
302}
303
304impl PredictablePluginEProcess {
305    pub fn new() -> Self {
306        Self {
307            process: EProcess::new(),
308        }
309    }
310
311    /// Absorb one batch. The caller guarantees the alternative was fit
312    /// WITHOUT batch-t data (predictability — this is what makes the
313    /// product a supermartingale; violating it voids the guarantee, which
314    /// is why the SAE integration must hand this function the PREVIOUS
315    /// shard's fitted dictionary, never the current one).
316    pub fn try_absorb_batch(
317        &mut self,
318        log_lik_alternative_prefit: f64,
319        log_lik_null_sup_on_batch: f64,
320    ) -> Result<(), String> {
321        self.process.absorb_log(split_likelihood_log_e_value(
322            log_lik_alternative_prefit,
323            log_lik_null_sup_on_batch,
324        ))
325    }
326}
327
328impl Default for PredictablePluginEProcess {
329    fn default() -> Self {
330        Self::new()
331    }
332}
333
334/// The anytime-valid verdict on one structural claim. Deliberately
335/// two-valued — there is NO "rejected" arm. Demote-never-reject (#969
336/// philosophy): an e-process that has not crossed 1/α has failed to prove
337/// the claim, not disproven it; the claim stays contested, keeps its
338/// evidence, and earns a design-optimal probe budget instead of being
339/// silently dropped (or worse, silently accepted the way a threshold gate
340/// accepts whatever clears it).
341#[derive(Clone, Copy, Debug, PartialEq, Serialize, Deserialize)]
342pub enum GateVerdict {
343    /// The running supremum crossed 1/α: the claim is proven with type-I
344    /// error ≤ α, permanently (Ville applies to the sup, so later evidence
345    /// retreat cannot un-prove it).
346    Certified { log_e: f64 },
347    /// Not (yet) proven. Carries the CURRENT log-evidence — the value the
348    /// dictionary certificate's e-BH consumes, and the state a probe loop
349    /// resumes from.
350    Contested { log_e: f64 },
351}
352
353/// The atom-birth gate (#976's threshold comparison, replaced): a
354/// universal-inference e-process over corpus shards deciding "does the
355/// K+1-th atom exist?", the boundary/Davies-regime question where the χ²
356/// gate every dictionary paper runs is broken.
357///
358/// Per shard t the integration contract is exactly the work plan's:
359/// - `log_lik_alternative_prefit`: the K+1-atom dictionary fit on shards
360///   BEFORE t (the PREVIOUS shard's fit — predictability is the one rule;
361///   handing in the current shard's fit voids the guarantee), evaluated on
362///   shard t. Any fitter, warm starts, GPU — no conditions.
363/// - `log_lik_null_sup_on_shard`: the K-atom dictionary REFIT on shard t
364///   (the honest constrained sup on the evaluation data).
365///
366/// The gate never rejects: [`GateVerdict::Contested`] is the only
367/// alternative to certification, and a contested atom's next move is a
368/// probe plan ([`plan_probe_for_contested_claim`]), not deletion.
369#[derive(Clone, Debug, Serialize, Deserialize)]
370pub struct AtomBirthGate {
371    pub test: PredictablePluginEProcess,
372    /// The level the certificate is claimed at; fixed at construction so a
373    /// verdict can never be shopped across α after seeing the evidence.
374    alpha: f64,
375    /// First-passage time: the shard count at which the running supremum
376    /// first crossed the single-hypothesis Ville bar `ln(1/α)`. `None` until
377    /// the crossing happens. This is the realized *time-to-certification* —
378    /// the design-relevant quantity (`expected_resolution_budget`) — recorded
379    /// SEPARATELY from the running evidence so that absorption can continue
380    /// past the crossing without losing the crossing time. By Ville the
381    /// crossing is permanent (it is a property of the running sup), so this is
382    /// monotone: once `Some`, never reset. `#[serde(default)]` keeps gates
383    /// persisted before this field was added (#973 resumability) loadable.
384    #[serde(default)]
385    certified_at_step: Option<usize>,
386}
387
388impl AtomBirthGate {
389    pub fn new(alpha: f64) -> Result<Self, String> {
390        if !(alpha > 0.0 && alpha < 1.0) {
391            return Err(format!(
392                "AtomBirthGate: alpha must be in (0,1), got {alpha}"
393            ));
394        }
395        Ok(Self {
396            test: PredictablePluginEProcess::new(),
397            alpha,
398            certified_at_step: None,
399        })
400    }
401
402    pub fn alpha(&self) -> f64 {
403        self.alpha
404    }
405
406    /// The realized time-to-certification: the shard count at which the
407    /// running supremum first crossed `ln(1/α)`, or `None` if it never did.
408    /// This is the first-passage time the design budget
409    /// ([`expected_resolution_budget`]) predicts — distinct from
410    /// [`EProcess::steps`] (total shards absorbed), which keeps growing after
411    /// the crossing because absorption does not stop (continuing to accumulate
412    /// is what lets the dictionary-level e-BH certificate clear its higher
413    /// multiplicity bar `ln(m/(α·k))`).
414    pub fn certified_at_step(&self) -> Option<usize> {
415        self.certified_at_step
416    }
417
418    /// Absorb one shard's split-likelihood ratio (see type-level contract).
419    pub fn try_absorb_shard(
420        &mut self,
421        log_lik_alternative_prefit: f64,
422        log_lik_null_sup_on_shard: f64,
423    ) -> Result<(), String> {
424        self.test
425            .try_absorb_batch(log_lik_alternative_prefit, log_lik_null_sup_on_shard)?;
426        // Record the first-passage time once, the step the running sup first
427        // crosses the single-hypothesis bar. `rejects_at` reads the running
428        // maximum, so this latches permanently even if later evidence retreats.
429        if self.certified_at_step.is_none() && self.test.process.rejects_at(self.alpha) {
430            self.certified_at_step = Some(self.test.process.steps());
431        }
432        Ok(())
433    }
434
435    pub fn absorb_shard(
436        &mut self,
437        log_lik_alternative_prefit: f64,
438        log_lik_null_sup_on_shard: f64,
439    ) {
440        self.try_absorb_shard(log_lik_alternative_prefit, log_lik_null_sup_on_shard)
441            .expect("AtomBirthGate received invalid log evidence");
442    }
443
444    pub fn verdict(&self) -> GateVerdict {
445        if self.test.process.rejects_at(self.alpha) {
446            GateVerdict::Certified {
447                log_e: self.test.process.log_evidence(),
448            }
449        } else {
450            GateVerdict::Contested {
451                log_e: self.test.process.log_evidence(),
452            }
453        }
454    }
455}
456
457/// Run the atom-birth gate over a shard stream with the predictability
458/// contract enforced BY CONSTRUCTION: on each shard the alternative is
459/// evaluated strictly before it is refit with that shard, so the plug-in
460/// is always predictable and the product is always a supermartingale
461/// under H0. This is the orchestration the SAE structure search calls;
462/// the closures are the only fitter-specific surface.
463///
464/// - `alternative_log_lik(alt, shard)`: evaluation-fold log-likelihood of
465///   shard under the CURRENT alternative state (fit on prior shards only —
466///   guaranteed here by call order).
467/// - `null_sup_log_lik(shard)`: the honest constrained sup — the K-atom
468///   null REFIT on this shard (any fitter; this side must genuinely
469///   maximize over H0 on the shard, an under-maximized null inflates the
470///   e-value and voids validity).
471/// - `refit_alternative(alt, shard)`: fold the shard into the alternative
472///   (warm-started production fit, GPU, anything — zero conditions).
473///
474/// `initial_alternative` is the K+1 fit from data BEFORE the stream (or a
475/// prior-driven init; validity never depends on its quality — a bad init
476/// only costs power). Absorbs EVERY shard's evidence into the e-process — it
477/// does NOT stop at the single-hypothesis Ville crossing. An earlier
478/// early-stop ("the crossing is permanent; further shards only cost compute")
479/// conflated two distinct bars:
480///   * the per-move VERDICT ([`AtomBirthGate::verdict`] via
481///     [`EProcess::rejects_at`]) tests the running SUPREMUM against the
482///     single-hypothesis bar `ln(1/α)`, where the crossing is indeed
483///     permanent and the realized crossing step is captured separately in
484///     [`AtomBirthGate::certified_at_step`]; but
485///   * the dictionary-level CERTIFICATE ([`StructureLedger::certify`] →
486///     [`e_benjamini_hochberg`]) consumes the CURRENT log e-value
487///     ([`EProcess::current_e_value_log`]) under a multiplicity correction
488///     whose bar `ln(m/(α·k))` is strictly higher for m > 1 claims.
489/// Stopping at the single-hypothesis bar banked just enough evidence for the
490/// move's own verdict but starved the FDR certificate, so a genuinely real
491/// atom (e.g. 0.8 nats/shard × 10 shards = 8 nats) failed e-BH confirmation
492/// against a second claim (bar `ln(2/0.05) ≈ 3.69`) even though it cleared
493/// its own `ln(1/0.05) ≈ 3.00` bar — it carried ample evidence but the early
494/// stop threw the surplus away. Continuing to absorb is always sound: the
495/// e-process is a supermartingale under H0, so additional
496/// conditionally-valid increments preserve validity, and the predictability
497/// contract is untouched (the alternative is still evaluated before it is
498/// refit with each shard). Returns the gate (verdict + resumable evidence,
499/// with `certified_at_step` holding the realized time-to-certification) and
500/// the final alternative state, which has seen the whole stream.
501pub fn run_atom_birth_gate<S, A>(
502    alpha: f64,
503    initial_alternative: A,
504    shards: impl IntoIterator<Item = S>,
505    mut alternative_log_lik: impl FnMut(&A, &S) -> f64,
506    mut null_sup_log_lik: impl FnMut(&S) -> f64,
507    mut refit_alternative: impl FnMut(A, &S) -> A,
508) -> Result<(AtomBirthGate, A), String> {
509    let mut gate = AtomBirthGate::new(alpha)?;
510    let mut alt = initial_alternative;
511    for shard in shards {
512        let log_lik_alt = alternative_log_lik(&alt, &shard);
513        let log_lik_null = null_sup_log_lik(&shard);
514        gate.try_absorb_shard(log_lik_alt, log_lik_null)?;
515        alt = refit_alternative(alt, &shard);
516    }
517    Ok((gate, alt))
518}
519
520/// e-BH: FDR control over m structural claims under ARBITRARY dependence
521/// (Wang–Ramdas). Input: per-claim log-e-values. Output: indices of
522/// rejected (i.e. CONFIRMED-STRUCTURE) claims, FDR ≤ α.
523///
524/// Sort e-values descending; reject the top k* where
525/// `k* = max{ k : e_(k) ≥ m/(α·k) }`.
526///
527/// This is the "dictionary certificate": run one e-process per claimed
528/// atom (and per claimed binding edge, #975), call this at the chosen
529/// stopping time, and the dictionary ships with an FDR-controlled list of
530/// which of its claimed structures are statistically real. No
531/// independence assumptions — atoms sharing every token is fine; that is
532/// exactly the case p-value BH cannot legally handle (PRDS violation) and
533/// e-BH can.
534pub fn e_benjamini_hochberg(log_e_values: &[f64], alpha: f64) -> Vec<usize> {
535    let m = log_e_values.len();
536    if m == 0 || !(alpha.is_finite() && alpha > 0.0) {
537        return Vec::new();
538    }
539    // Robustness: a degenerate claim can bank a non-finite log e-value (a
540    // NaN from an upstream `(−∞) − (−∞)` split-LR that escaped the source
541    // guard). This is the documented honest FDR surface, so it must NEVER
542    // panic on such input. Treat any NaN as least-evidence (`−∞`): an
543    // undefined/contested claim contributes no evidence, sorts last, and can
544    // never be among the rejections. Finite/±∞ values pass through unchanged;
545    // `total_cmp` then gives a total order on the sanitized keys.
546    let sanitized: Vec<f64> = log_e_values
547        .iter()
548        .map(|&v| if v.is_nan() { f64::NEG_INFINITY } else { v })
549        .collect();
550    let mut order: Vec<usize> = (0..m).collect();
551    order.sort_by(|&a, &b| sanitized[b].total_cmp(&sanitized[a]));
552    let m_f = m as f64;
553    let mut k_star = 0usize;
554    for (rank0, &idx) in order.iter().enumerate() {
555        let k = (rank0 + 1) as f64;
556        // e_(k) ≥ m / (α k)  ⟺  log e_(k) ≥ log m − log α − log k
557        if sanitized[idx] >= m_f.ln() - alpha.ln() - k.ln() {
558            k_star = rank0 + 1;
559        }
560    }
561    order.truncate(k_star);
562    order
563}
564
565/// What one structural claim asserts about the dictionary. One e-process
566/// runs per claim; the kinds mirror the discovery stack's claim surface:
567/// atom existence (#976 birth), binding edges (#975), geometry
568/// adjudication (#907). `Custom` keeps the ledger open to claim types
569/// that do not exist yet without an enum churn per new discovery gate.
570#[derive(Clone, Debug, PartialEq, Eq, Hash, Serialize, Deserialize)]
571pub enum ClaimKind {
572    /// "Atom `atom` is statistically real" — the K vs K+1 birth claim.
573    AtomExists { atom: usize },
574    /// "Atoms `a` and `b` are bound" — a #975 binding edge.
575    BindingEdge { a: usize, b: usize },
576    /// "Atom `atom`'s latent geometry is `kind`" (e.g. "circle",
577    /// "clusters", "line") — a #907 adjudication claim.
578    GeometryKind { atom: usize, kind: String },
579    /// Any other structural claim, labeled.
580    Custom { label: String },
581}
582
583/// One claim plus its running evidence.
584#[derive(Clone, Debug, Serialize, Deserialize)]
585pub struct StructuralClaim {
586    pub kind: ClaimKind,
587    pub evidence: EProcess,
588}
589
590/// The dictionary's claim ledger: every structural claim the discovery
591/// stack makes, each with its own e-process. Serializable — evidence
592/// resumes across corpus shards (#973) by persisting the ledger, not by
593/// refitting. Calling [`StructureLedger::certify`] at ANY data-dependent
594/// stopping time yields a valid certificate; that is the entire point.
595#[derive(Clone, Debug, Default, Serialize, Deserialize)]
596pub struct StructureLedger {
597    claims: Vec<StructuralClaim>,
598}
599
600impl StructureLedger {
601    pub fn new() -> Self {
602        Self { claims: Vec::new() }
603    }
604
605    /// Register a claim and return its ledger index. Idempotent on the
606    /// claim kind: re-registering an existing claim (a resumed shard loop
607    /// re-announcing its claim surface) returns the existing index and
608    /// PRESERVES its accumulated evidence — a fresh e-process here would
609    /// silently discard the stream's history.
610    pub fn register(&mut self, kind: ClaimKind) -> usize {
611        if let Some(idx) = self.claims.iter().position(|c| c.kind == kind) {
612            return idx;
613        }
614        self.claims.push(StructuralClaim {
615            kind,
616            evidence: EProcess::new(),
617        });
618        self.claims.len() - 1
619    }
620
621    /// Absorb one conditionally-valid log e-value for claim `idx` (a
622    /// universal-inference shard ratio, a frozen-prior log-BF — the
623    /// caller's contract is per-source, documented on the producing gate).
624    pub fn absorb_log(&mut self, idx: usize, log_e_value: f64) -> Result<(), String> {
625        let n = self.claims.len();
626        let claim = self.claims.get_mut(idx).ok_or_else(|| {
627            format!("StructureLedger: claim index {idx} out of range ({n} claims)")
628        })?;
629        claim.evidence.absorb_log(log_e_value)
630    }
631
632    pub fn claims(&self) -> &[StructuralClaim] {
633        &self.claims
634    }
635
636    /// The likelihood half of the probe-design loop (work-plan step 4):
637    /// after running a planned probe ([`ProbePlan`] →
638    /// `crate::inference::steering::steer_delta`), evaluate the REALIZED
639    /// outcomes under both hypotheses' predictive densities and absorb the
640    /// log-ratio into the contested claim's e-process.
641    ///
642    /// Validity contract: both predictive densities must be FROZEN before the
643    /// probe outcome is observed — which the design loop satisfies by
644    /// construction, since both hypotheses' dictionaries were fitted before
645    /// the probe was even chosen. For a composite null, the null density must
646    /// be the honest constrained fit (the same rule as
647    /// [`split_likelihood_log_e_value`], which this delegates to); for a
648    /// simple null the predictive density is the sup. Probe outcomes are new
649    /// data by construction (the model was steered to produce them), so they
650    /// compound validly with the claim's prior shard evidence.
651    pub fn absorb_probe_outcome(
652        &mut self,
653        idx: usize,
654        log_lik_alt_on_outcome: f64,
655        log_lik_null_on_outcome: f64,
656    ) -> Result<(), String> {
657        self.absorb_log(
658            idx,
659            split_likelihood_log_e_value(log_lik_alt_on_outcome, log_lik_null_on_outcome),
660        )
661    }
662
663    /// The dictionary certificate: e-BH over the ledger's CURRENT
664    /// e-values at level α. FDR ≤ α over the confirmed set under arbitrary
665    /// dependence — atoms sharing every token is fine — and valid at any
666    /// stopping time because each entry is an e-process. Claims not
667    /// confirmed are CONTESTED, never rejected (demote-never-reject); they
668    /// keep their evidence and are the inputs to the probe-design loop.
669    pub fn certify(&self, alpha: f64) -> StructureCertificate {
670        let log_e: Vec<f64> = self
671            .claims
672            .iter()
673            .map(|c| c.evidence.current_e_value_log())
674            .collect();
675        let confirmed_idx = e_benjamini_hochberg(&log_e, alpha);
676        let mut entries: Vec<CertificateEntry> = self
677            .claims
678            .iter()
679            .zip(&log_e)
680            .map(|(c, &le)| CertificateEntry {
681                kind: c.kind.clone(),
682                log_e: le,
683                steps: c.evidence.steps(),
684                confirmed: false,
685            })
686            .collect();
687        for &i in &confirmed_idx {
688            entries[i].confirmed = true;
689        }
690        StructureCertificate { alpha, entries }
691    }
692}
693
694/// One line of the certificate's e-value ledger: the claim, its
695/// log-evidence at the stop, how many batches produced it, and the e-BH
696/// outcome. The full entry list IS the reproducibility artifact: anyone
697/// holding it can re-run [`e_benjamini_hochberg`] and re-derive the
698/// confirmed set.
699#[derive(Clone, Debug, Serialize, Deserialize)]
700pub struct CertificateEntry {
701    pub kind: ClaimKind,
702    pub log_e: f64,
703    pub steps: usize,
704    pub confirmed: bool,
705}
706
707/// The deliverable: "we found N structures at FDR ≤ α, certificate
708/// attached". Ships next to the identifiability certificate
709/// ([`crate::terms::sae::identifiability::residual_gauge`], #981) — that one says
710/// what the GAUGE cannot distinguish, this one says what the DATA can.
711#[derive(Clone, Debug, Serialize, Deserialize)]
712pub struct StructureCertificate {
713    pub alpha: f64,
714    pub entries: Vec<CertificateEntry>,
715}
716
717impl StructureCertificate {
718    pub fn confirmed(&self) -> impl Iterator<Item = &CertificateEntry> {
719        self.entries.iter().filter(|e| e.confirmed)
720    }
721
722    pub fn contested(&self) -> impl Iterator<Item = &CertificateEntry> {
723        self.entries.iter().filter(|e| !e.confirmed)
724    }
725}
726
727/// Calibrate one (super)uniform p-value into a single e-value, in log
728/// space: `e(p) = ½ p^{−1/2}` (the κ = ½ member of the calibrator family
729/// `e_κ(p) = κ p^{κ−1}`; `∫₀¹ e_κ(p) dp = 1`, so `E_{H0}[e(P)] ≤ 1` for
730/// any valid p — superuniformity only, no other conditions).
731///
732/// This is the bridge from p-value-shaped instruments into the ledger —
733/// e.g. the feature-binding Wald test (`terms::structure::anova_atom::carve`'s
734/// `edge_p_value` → a [`ClaimKind::BindingEdge`] entry). It spends
735/// calibration slack (a p of 0.01 becomes e = 5, not 100), which is the
736/// honest price of converting a fixed-sample test into anytime-valid
737/// currency; instruments that can produce e-values natively should.
738/// CONTRACT: one calibrated e-value per INDEPENDENT data batch — feeding
739/// repeated tests of the same accumulating data into one e-process is the
740/// p-hacking this module exists to kill.
741pub fn log_e_from_p_calibrator(p_value: f64) -> Result<f64, String> {
742    if !(p_value > 0.0) || p_value > 1.0 {
743        return Err(format!("p-value must be in (0, 1], got {p_value}"));
744    }
745    Ok(0.5f64.ln() - 0.5 * p_value.ln())
746}
747
748/// A candidate steering probe for resolving one contested structural
749/// claim: the intervention direction (in the steering primitive's
750/// coordinates), and the two hypotheses' PREDICTED output-mean responses
751/// to it.
752pub struct CandidateProbe {
753    /// Steering displacement δ, to be applied via
754    /// `crate::inference::steering` (which enforces its own validity
755    /// radius and reports realized dosimetry).
756    pub delta: Array1<f64>,
757    /// Predicted output-mean response under the null structure, μ₀(δ).
758    pub predicted_mean_null: Array1<f64>,
759    /// Predicted output-mean response under the alternative, μ₁(δ).
760    pub predicted_mean_alt: Array1<f64>,
761}
762
763/// Greedy KL-optimal experimental design under the local Gaussian
764/// output-Fisher model: pick the probe maximizing
765/// `½ (μ₁(δ) − μ₀(δ))ᵀ F (μ₁(δ) − μ₀(δ))` — the expected per-observation
766/// log-growth of the deciding e-process under the alternative.
767///
768/// `fisher` is the output-Fisher metric at the operating point (#980
769/// harvest; the same object steering dosimetry contracts against). Probes
770/// whose hypotheses predict the SAME response score zero no matter how
771/// large their raw effect — the design rule selects for DISCRIMINATION,
772/// not impact, which is the entire point: a maximally-steered output that
773/// both hypotheses predict identically teaches nothing.
774///
775/// Returns the index of the best probe and its expected log-growth (nats
776/// per observation), or None if no probe discriminates.
777pub fn select_probe_by_expected_evidence(
778    probes: &[CandidateProbe],
779    fisher: &Array2<f64>,
780) -> Option<(usize, f64)> {
781    let mut best: Option<(usize, f64)> = None;
782    for (idx, probe) in probes.iter().enumerate() {
783        let diff = &probe.predicted_mean_alt - &probe.predicted_mean_null;
784        if diff.len() != fisher.nrows() {
785            continue;
786        }
787        let f_diff = fisher.dot(&diff);
788        let growth = 0.5 * diff.dot(&f_diff);
789        if growth.is_finite() && growth > 0.0 {
790            match best {
791                Some((_, g)) if g >= growth => {}
792                _ => best = Some((idx, growth)),
793            }
794        }
795    }
796    best
797}
798
799/// Expected number of observations for the chosen probe to push a claim's
800/// e-process across the 1/α Ville threshold, under the alternative: the
801/// design-time budget `log(1/α) / growth_rate`. This is what turns the
802/// abstract guarantee into an experiment plan ("this probe should resolve
803/// the claim in ~N tokens; if it hasn't, the alternative is weaker than
804/// hypothesized — itself evidence").
805pub fn expected_resolution_budget(alpha: f64, growth_nats_per_obs: f64) -> Option<f64> {
806    if alpha <= 0.0 || alpha >= 1.0 || growth_nats_per_obs <= 0.0 {
807        return None;
808    }
809    Some(-(alpha.ln()) / growth_nats_per_obs)
810}
811
812/// The experiment plan for one contested claim: which probe to run, the
813/// expected per-observation evidence growth under the alternative, and the
814/// design-time resolution budget. This is the loop's actionable output —
815/// hand `probes[probe]`'s δ to `crate::inference::steering::steer_delta`
816/// (which enforces the validity radius and reports realized dosimetry),
817/// evaluate both hypotheses' likelihoods on the realized outputs, absorb
818/// the log-ratio into the claim's e-process, re-certify.
819#[derive(Clone, Debug, PartialEq)]
820pub struct ProbePlan {
821    /// Index into the candidate probe list.
822    pub probe: usize,
823    /// Expected log-growth of the deciding e-process, nats/observation,
824    /// under the alternative (the KL of the hypotheses' predicted
825    /// responses in the output-Fisher metric).
826    pub expected_log_growth: f64,
827    /// Expected observations to cross 1/α from ZERO evidence — the
828    /// conservative from-scratch budget.
829    pub budget_from_scratch: f64,
830    /// Expected observations to cross 1/α from the claim's CURRENT
831    /// log-evidence — the remaining budget; 0 when already across.
832    pub budget_remaining: f64,
833}
834
835/// Close the design loop for one contested claim: pick the probe whose
836/// predicted hypothesis-disagreement (not raw effect) buys evidence
837/// fastest, and convert the claim's current evidence into a remaining
838/// budget — "this probe should resolve the claim in ~N more observations
839/// at level α; if it does not, the alternative is weaker than
840/// hypothesized, which is itself evidence."
841///
842/// `current_log_e` is the contested claim's running log-evidence (from its
843/// [`StructuralClaim`] / [`GateVerdict::Contested`]). Returns `None` when
844/// no probe discriminates (all candidates score zero growth: the
845/// hypotheses agree on everything reachable inside the validity radius —
846/// the claim is undecidable by steering and needs a different instrument,
847/// which is a finding, not a failure).
848pub fn plan_probe_for_contested_claim(
849    probes: &[CandidateProbe],
850    fisher: &Array2<f64>,
851    alpha: f64,
852    current_log_e: f64,
853) -> Option<ProbePlan> {
854    let (probe, expected_log_growth) = select_probe_by_expected_evidence(probes, fisher)?;
855    let budget_from_scratch = expected_resolution_budget(alpha, expected_log_growth)?;
856    let nats_remaining = (-(alpha.ln()) - current_log_e).max(0.0);
857    Some(ProbePlan {
858        probe,
859        expected_log_growth,
860        budget_from_scratch,
861        budget_remaining: nats_remaining / expected_log_growth,
862    })
863}
864
865#[cfg(test)]
866mod tests {
867    use super::*;
868    use ndarray::array;
869
870    /// e-BH on a hand-checkable configuration.
871    #[test]
872    fn e_bh_rejects_exactly_the_qualifying_prefix() {
873        // m = 4, α = 0.1 → thresholds m/(αk) = 40, 20, 13.33, 10.
874        let log_e = [45.0f64.ln(), 21.0f64.ln(), 12.0f64.ln(), 1.0f64.ln()];
875        let rejected = e_benjamini_hochberg(&log_e, 0.1);
876        // e_(1)=45 ≥ 40 ✓, e_(2)=21 ≥ 20 ✓, e_(3)=12 < 13.33 ✗ → k* = 2.
877        assert_eq!(rejected, vec![0, 1]);
878
879        // A weaker tail cannot drag in a stronger prefix decision.
880        let log_e2 = [45.0f64.ln(), 5.0f64.ln(), 2.0f64.ln(), 1.0f64.ln()];
881        assert_eq!(e_benjamini_hochberg(&log_e2, 0.1), vec![0]);
882    }
883
884    #[test]
885    fn split_likelihood_equal_impossibility_is_neutral_log_evidence() {
886        let log_e = split_likelihood_log_e_value(f64::NEG_INFINITY, f64::NEG_INFINITY);
887        assert_eq!(log_e, 0.0);
888        assert!(log_e.is_finite());
889
890        let mut proc = EProcess::new();
891        proc.absorb_log(log_e).unwrap();
892        assert_eq!(proc.log_evidence(), 0.0);
893        assert_eq!(proc.steps(), 1);
894    }
895
896    #[test]
897    fn e_bh_orders_infinite_log_e_values_without_comparator_panic() {
898        let log_e = [f64::NEG_INFINITY, f64::INFINITY, 45.0f64.ln(), 1.0f64.ln()];
899        assert_eq!(e_benjamini_hochberg(&log_e, 0.1), vec![1, 2]);
900    }
901
902    /// A NaN log e-value (a degenerate claim whose `(−∞) − (−∞)` split-LR
903    /// escaped the source guard) must NOT panic the e-BH comparator — the
904    /// honest FDR surface stays robust. The NaN claim is treated as
905    /// least-evidence (`−∞`): it is never rejected, and it cannot disturb the
906    /// rejection of a genuinely strong claim.
907    #[test]
908    fn e_bh_treats_nan_as_least_evidence_without_panicking() {
909        // m = 2, α = 0.1 → threshold for the top claim is m/(αk) = 2/0.1 = 20.
910        // Claim 0 has e = 45 ≥ 20 (rejected); claim 1 is NaN (no evidence).
911        let log_e = [45.0f64.ln(), f64::NAN];
912        let rejected = e_benjamini_hochberg(&log_e, 0.1);
913        assert_eq!(
914            rejected,
915            vec![0],
916            "strong claim survives; NaN claim never rejected"
917        );
918
919        // An all-NaN ledger yields an empty (no-rejection) certificate, not a
920        // panic.
921        let all_nan = [f64::NAN, f64::NAN, f64::NAN];
922        assert!(e_benjamini_hochberg(&all_nan, 0.1).is_empty());
923    }
924
925    /// The full source→consumer chain: a shard with zero density under both
926    /// the alternative and the null produces `(−∞) − (−∞)`, which the split-LR
927    /// resolves to neutral `log E = 0` rather than NaN; banking it and
928    /// certifying must not panic. A genuinely-NaN log-likelihood is likewise
929    /// resolved to neutral evidence at the source.
930    #[test]
931    fn degenerate_split_lr_flows_through_certify_without_nan_panic() {
932        // (−∞) − (−∞): zero density under both hypotheses → neutral.
933        let neutral = split_likelihood_log_e_value(f64::NEG_INFINITY, f64::NEG_INFINITY);
934        assert_eq!(neutral, 0.0);
935        // NaN log-likelihood (un-evaluable degenerate fit) → neutral, finite.
936        let from_nan = split_likelihood_log_e_value(f64::NAN, -3.0);
937        assert!(from_nan.is_finite());
938        assert_eq!(from_nan, 0.0);
939
940        let mut ledger = StructureLedger::new();
941        let degenerate = ledger.register(ClaimKind::AtomExists { atom: 0 });
942        let strong = ledger.register(ClaimKind::AtomExists { atom: 1 });
943        // Bank the neutral split-LR on the degenerate claim — no NaN reaches
944        // the e-process.
945        ledger.absorb_log(degenerate, neutral).unwrap();
946        ledger.absorb_log(strong, 45.0f64.ln()).unwrap();
947        // certify() runs e_benjamini_hochberg internally; must not panic.
948        let certificate = ledger.certify(0.1);
949        let degenerate_entry = certificate
950            .entries
951            .iter()
952            .find(|e| e.kind == ClaimKind::AtomExists { atom: 0 })
953            .expect("degenerate claim present");
954        // Neutral evidence (log_e = 0) never qualifies → contested, not confirmed.
955        assert!(!degenerate_entry.confirmed);
956        assert_eq!(degenerate_entry.log_e, 0.0);
957    }
958
959    #[test]
960    fn e_process_absorb_log_rejects_undefined_log_products() {
961        let mut proc = EProcess::new();
962        assert!(proc.absorb_log(f64::NAN).is_err());
963
964        proc.absorb_log(f64::INFINITY).unwrap();
965        assert!(proc.absorb_log(f64::NEG_INFINITY).is_err());
966        assert_eq!(proc.log_evidence(), f64::INFINITY);
967        assert_eq!(proc.steps(), 1);
968    }
969
970    /// Ville-style sanity: under H0 (simulated fair e-values from a
971    /// likelihood ratio of identical Gaussians), the e-process crosses
972    /// 1/α rarely; under a true alternative it crosses fast and the
973    /// crossing is PERMANENT (running-sup semantics).
974    #[test]
975    fn e_process_crossing_is_permanent_and_directional() {
976        // Deterministic "stream": per-batch log-LR of N(μ,1) vs N(0,1)
977        // evaluated at x drawn from the alternative: log e = μ x − μ²/2.
978        // Use a fixed quasi-random sequence; no RNG state needed.
979        let mu = 0.6f64;
980        let mut proc_alt = EProcess::new();
981        let mut crossed_at: Option<usize> = None;
982        for t in 0..200 {
983            // x_t ~ alternative-ish deterministic surrogate around μ
984            let x = mu + 0.9 * ((t as f64 * 0.7321).sin());
985            proc_alt.absorb_log(mu * x - 0.5 * mu * mu).unwrap();
986            if proc_alt.rejects_at(0.05) && crossed_at.is_none() {
987                crossed_at = Some(t);
988            }
989        }
990        let t_cross = crossed_at.expect("true alternative must cross 1/α");
991        assert!(t_cross < 100, "evidence should accumulate quickly");
992        // Permanence: rejection holds at the end even if late evidence dips.
993        assert!(proc_alt.rejects_at(0.05));
994
995        // Null stream: x centered at 0 → expected log e = −μ²/2 < 0.
996        let mut proc_null = EProcess::new();
997        for t in 0..200 {
998            let x = 0.9 * ((t as f64 * 0.7321).sin());
999            proc_null.absorb_log(mu * x - 0.5 * mu * mu).unwrap();
1000        }
1001        assert!(
1002            !proc_null.rejects_at(0.05),
1003            "null stream must not accumulate evidence (log E = {:.3})",
1004            proc_null.log_evidence()
1005        );
1006    }
1007
1008    /// The design rule selects discrimination, not raw effect.
1009    #[test]
1010    fn probe_selection_prefers_discrimination_over_impact() {
1011        let fisher = array![[2.0, 0.0], [0.0, 0.5]];
1012        let probes = vec![
1013            // Huge effect, but both hypotheses predict it identically.
1014            CandidateProbe {
1015                delta: array![1.0, 0.0],
1016                predicted_mean_null: array![10.0, 10.0],
1017                predicted_mean_alt: array![10.0, 10.0],
1018            },
1019            // Modest effect, hypotheses disagree along the informative axis.
1020            CandidateProbe {
1021                delta: array![0.0, 1.0],
1022                predicted_mean_null: array![0.0, 0.0],
1023                predicted_mean_alt: array![1.0, 0.2],
1024            },
1025        ];
1026        let (idx, growth) =
1027            select_probe_by_expected_evidence(&probes, &fisher).expect("a probe discriminates");
1028        assert_eq!(idx, 1);
1029        // ½·(1,0.2)ᵀ diag(2,0.5) (1,0.2) = ½·(2 + 0.02) = 1.01 nats/obs.
1030        assert!((growth - 1.01).abs() < 1e-12);
1031        // Budget: ~3 observations to certify at α=0.05.
1032        let budget = expected_resolution_budget(0.05, growth).expect("budget");
1033        assert!(budget > 2.0 && budget < 4.0);
1034    }
1035
1036    /// The birth gate certifies under a true alternative, stays contested
1037    /// under the null, and never emits anything but those two verdicts.
1038    #[test]
1039    fn birth_gate_certifies_alternative_and_demotes_never_rejects() {
1040        let mut gate = AtomBirthGate::new(0.05).expect("valid alpha");
1041        // Strong shards: alternative beats the honest null sup by 1 nat each.
1042        for _ in 0..5 {
1043            gate.absorb_shard(-100.0, -101.0);
1044        }
1045        match gate.verdict() {
1046            GateVerdict::Certified { log_e } => assert!((log_e - 5.0).abs() < 1e-12),
1047            v => panic!("5 nats must certify at α=0.05, got {v:?}"),
1048        }
1049        // Permanence: a later evidence retreat cannot un-certify.
1050        gate.absorb_shard(-110.0, -100.0);
1051        assert!(matches!(gate.verdict(), GateVerdict::Certified { .. }));
1052
1053        // Null-ish stream: the prefit alternative loses to the on-shard sup
1054        // (it must, on average — the sup is fit on the eval shard itself).
1055        let mut null_gate = AtomBirthGate::new(0.05).expect("valid alpha");
1056        for _ in 0..50 {
1057            null_gate.absorb_shard(-100.3, -100.0);
1058        }
1059        match null_gate.verdict() {
1060            GateVerdict::Contested { log_e } => assert!(log_e < 0.0),
1061            v => panic!("null stream must stay contested, got {v:?}"),
1062        }
1063        assert!(AtomBirthGate::new(0.0).is_err());
1064        assert!(AtomBirthGate::new(1.0).is_err());
1065    }
1066
1067    /// Ledger: idempotent registration preserves evidence; the certificate
1068    /// splits confirmed/contested by e-BH and the entry list reproduces it.
1069    #[test]
1070    fn ledger_certificate_splits_confirmed_and_contested() {
1071        let mut ledger = StructureLedger::new();
1072        let a0 = ledger.register(ClaimKind::AtomExists { atom: 0 });
1073        let a1 = ledger.register(ClaimKind::AtomExists { atom: 1 });
1074        let edge = ledger.register(ClaimKind::BindingEdge { a: 0, b: 1 });
1075
1076        // m = 3, α = 0.1 → e-BH thresholds m/(αk) = 30, 15, 10.
1077        ledger.absorb_log(a0, 40.0f64.ln()).unwrap();
1078        ledger.absorb_log(a1, 20.0f64.ln()).unwrap();
1079        ledger.absorb_log(edge, 2.0f64.ln()).unwrap();
1080
1081        // Re-registering must return the same slot with evidence intact.
1082        let a0_again = ledger.register(ClaimKind::AtomExists { atom: 0 });
1083        assert_eq!(a0_again, a0);
1084        assert_eq!(ledger.claims()[a0].evidence.steps(), 1);
1085
1086        let cert = ledger.certify(0.1);
1087        // e_(1)=40 ≥ 30 ✓, e_(2)=20 ≥ 15 ✓, e_(3)=2 < 10 ✗ → atoms confirmed,
1088        // the binding edge stays contested.
1089        let confirmed: Vec<&ClaimKind> = cert.confirmed().map(|e| &e.kind).collect();
1090        assert_eq!(confirmed.len(), 2);
1091        assert!(confirmed.contains(&&ClaimKind::AtomExists { atom: 0 }));
1092        assert!(confirmed.contains(&&ClaimKind::AtomExists { atom: 1 }));
1093        let contested: Vec<&CertificateEntry> = cert.contested().collect();
1094        assert_eq!(contested.len(), 1);
1095        assert_eq!(contested[0].kind, ClaimKind::BindingEdge { a: 0, b: 1 });
1096
1097        assert!(ledger.absorb_log(99, 0.0).is_err());
1098    }
1099
1100    /// Resumability: a serialized ledger reloads with its evidence and
1101    /// keeps absorbing — the #973 shard contract.
1102    #[test]
1103    fn ledger_evidence_resumes_across_serialization() {
1104        let mut ledger = StructureLedger::new();
1105        let idx = ledger.register(ClaimKind::GeometryKind {
1106            atom: 3,
1107            kind: "circle".to_string(),
1108        });
1109        ledger.absorb_log(idx, 1.25).unwrap();
1110
1111        let persisted = serde_json::to_string(&ledger).expect("serialize ledger");
1112        let mut resumed: StructureLedger =
1113            serde_json::from_str(&persisted).expect("deserialize ledger");
1114        assert_eq!(resumed.claims()[idx].evidence.steps(), 1);
1115
1116        resumed.absorb_log(idx, 0.75).unwrap();
1117        let log_e = resumed.claims()[idx].evidence.log_evidence();
1118        assert!((log_e - 2.0).abs() < 1e-12);
1119    }
1120
1121    /// The probe plan discounts the remaining budget by evidence already
1122    /// banked, and floors at zero once the claim is across the line.
1123    #[test]
1124    fn probe_plan_discounts_remaining_budget_by_current_evidence() {
1125        let fisher = array![[2.0, 0.0], [0.0, 0.5]];
1126        let probes = vec![CandidateProbe {
1127            delta: array![0.0, 1.0],
1128            predicted_mean_null: array![0.0, 0.0],
1129            predicted_mean_alt: array![1.0, 0.2],
1130        }];
1131        // growth = 1.01 nats/obs (checked above); α=0.05 → need ln(20) ≈ 3.0 nats.
1132        let from_zero = plan_probe_for_contested_claim(&probes, &fisher, 0.05, 0.0).expect("plan");
1133        assert_eq!(from_zero.probe, 0);
1134        assert!((from_zero.budget_remaining - from_zero.budget_from_scratch).abs() < 1e-12);
1135
1136        let halfway = plan_probe_for_contested_claim(&probes, &fisher, 0.05, 1.5).expect("plan");
1137        assert!(halfway.budget_remaining < from_zero.budget_remaining);
1138        assert!((halfway.budget_remaining - (-(0.05f64.ln()) - 1.5) / 1.01).abs() < 1e-12);
1139
1140        let across = plan_probe_for_contested_claim(&probes, &fisher, 0.05, 10.0).expect("plan");
1141        assert_eq!(across.budget_remaining, 0.0);
1142
1143        // No discriminating probe → no plan (undecidable by steering).
1144        let blind = vec![CandidateProbe {
1145            delta: array![1.0, 0.0],
1146            predicted_mean_null: array![5.0, 5.0],
1147            predicted_mean_alt: array![5.0, 5.0],
1148        }];
1149        assert!(plan_probe_for_contested_claim(&blind, &fisher, 0.05, 0.0).is_none());
1150    }
1151
1152    /// The p→e calibrator on hand-checkable values, including its edges.
1153    #[test]
1154    fn p_to_e_calibrator_hand_values() {
1155        // e(p) = ½ p^{−1/2}: p = 1 → e = 0.5; p = 0.04 → e = 2.5; p = 1e-4 → e = 50.
1156        assert!((log_e_from_p_calibrator(1.0).unwrap() - 0.5f64.ln()).abs() < 1e-12);
1157        assert!((log_e_from_p_calibrator(0.04).unwrap() - 2.5f64.ln()).abs() < 1e-12);
1158        assert!((log_e_from_p_calibrator(1e-4).unwrap() - 50.0f64.ln()).abs() < 1e-12);
1159        assert!(log_e_from_p_calibrator(0.0).is_err());
1160        assert!(log_e_from_p_calibrator(1.5).is_err());
1161        assert!(log_e_from_p_calibrator(f64::NAN).is_err());
1162    }
1163
1164    /// The e-value validity condition: under the null `P ~ Uniform(0, 1]`,
1165    /// the calibrated e-value must satisfy `E_{H0}[e(P)] = ∫₀¹ e(p) dp ≤ 1`
1166    /// (Wang–Ramdas e-BH controls FDR ONLY for genuine e-values). The κ = ½
1167    /// member `e(p) = ½ p^{−1/2}` integrates to exactly 1, the boundary of
1168    /// admissibility. We verify this numerically with a midpoint Riemann sum
1169    /// over the unit interval (which UNDER-estimates `∫ p^{−1/2}` slightly
1170    /// because the integrand is convex, so the analytic value 1 sits just
1171    /// above the quadrature estimate — both safely ≤ 1 + tolerance).
1172    ///
1173    /// This is the property `e = 1/p` VIOLATES — `∫₀¹ (1/p) dp = ∞` — which
1174    /// is why the behavioral-head and anova-atom paths route through this
1175    /// calibrator rather than `−ln p`.
1176    #[test]
1177    fn p_to_e_calibrator_null_expectation_at_most_one() {
1178        let n = 2_000_000usize;
1179        let h = 1.0 / n as f64;
1180        let mut mean_e = 0.0_f64;
1181        for i in 0..n {
1182            // Midpoint of cell i: p = (i + 0.5)/n, always in (0, 1).
1183            let p = (i as f64 + 0.5) * h;
1184            let e = log_e_from_p_calibrator(p).unwrap().exp();
1185            mean_e += e * h;
1186        }
1187        // Analytic E_{H0}[e(P)] = 1; allow a small quadrature tolerance, but
1188        // it MUST NOT exceed 1 by more than that (an invalid calibrator like
1189        // 1/p would diverge here, not land near 1).
1190        assert!(
1191            mean_e <= 1.0 + 1e-3,
1192            "calibrated e-value null expectation {mean_e} exceeds 1 — not a valid e-value"
1193        );
1194        assert!(
1195            mean_e > 0.99,
1196            "calibrated e-value null expectation {mean_e} far below the analytic 1.0"
1197        );
1198    }
1199
1200    /// POWER STUDY, null side: the heuristic gate every dictionary paper
1201    /// runs — "accept the K+1-th atom the first time the cumulative
1202    /// likelihood ratio shows improvement" — versus the e-gate, on a
1203    /// family of NULL streams peeked at after every shard. The per-shard
1204    /// log-LR is `μ x_t − μ²/2` with `x_t = A sin(ω t + φ)` (a
1205    /// deterministic null surrogate: mean drift −μ²/2 < 0, bounded
1206    /// fluctuation). The naive gate's false-accept mechanism is exactly
1207    /// optional stopping: any phase whose partial sums wander above zero
1208    /// at ANY peek accepts a nonexistent atom. The e-gate needs
1209    /// log(1/α) ≈ 3.0 nats, and the partial-sum fluctuation is bounded by
1210    /// `μ·A/sin(ω/2) ≈ 1.51` nats (Dirichlet-kernel bound) BEFORE the
1211    /// negative drift — so it can never certify on any phase, which is
1212    /// Ville's inequality made concrete.
1213    #[test]
1214    fn power_study_null_naive_peeking_gate_false_accepts_e_gate_never() {
1215        let mu = 0.6f64;
1216        let amp = 0.9f64;
1217        let omega = 0.7321f64;
1218        let n_phases = 60usize;
1219        let n_shards = 200usize;
1220
1221        let mut naive_false_accepts = 0usize;
1222        let mut e_gate_false_accepts = 0usize;
1223        for k in 0..n_phases {
1224            let phase = 2.0 * std::f64::consts::PI * (k as f64) / (n_phases as f64);
1225            let mut gate = AtomBirthGate::new(0.05).expect("alpha");
1226            let mut cum_log_lr = 0.0f64;
1227            let mut naive_accepted = false;
1228            for t in 0..n_shards {
1229                let x = amp * ((t as f64) * omega + phase).sin();
1230                let log_lr = mu * x - 0.5 * mu * mu;
1231                cum_log_lr += log_lr;
1232                // The broken test: peek, accept on any improvement.
1233                if cum_log_lr > 0.0 {
1234                    naive_accepted = true;
1235                }
1236                gate.absorb_shard(log_lr, 0.0);
1237            }
1238            if naive_accepted {
1239                naive_false_accepts += 1;
1240            }
1241            if matches!(gate.verdict(), GateVerdict::Certified { .. }) {
1242                e_gate_false_accepts += 1;
1243            }
1244        }
1245        // The naive gate false-accepts on a large fraction of null phases
1246        // (any phase with early-positive partial sums); the e-gate on none.
1247        assert!(
1248            naive_false_accepts >= n_phases / 3,
1249            "the peeking gate should false-accept often under the null \
1250             (got {naive_false_accepts}/{n_phases})"
1251        );
1252        assert_eq!(
1253            e_gate_false_accepts, 0,
1254            "the e-gate must never certify under the null"
1255        );
1256    }
1257
1258    /// POWER STUDY, alternative side, through the orchestration harness:
1259    /// a planted K+1-th atom worth 0.5 nats/shard CROSSES the single-hypothesis
1260    /// Ville bar in ⌈log(1/α)/0.5⌉ = 6 shards — the realized
1261    /// time-to-certification matching the design-time
1262    /// `expected_resolution_budget`. The gate keeps absorbing past the
1263    /// crossing (the crossing time is latched in `certified_at_step`, so it is
1264    /// not lost), banking the full stream's evidence — which is exactly what
1265    /// the dictionary-level e-BH certificate needs to clear its higher
1266    /// multiplicity bar. The alternative refits on every shard regardless.
1267    #[test]
1268    fn power_study_planted_atom_certifies_at_the_predicted_budget() {
1269        let growth = 0.5f64;
1270        let n_shards = 20usize;
1271        let (gate, alt_state) = run_atom_birth_gate(
1272            0.05,
1273            0usize, // alt state = number of shards folded into the fit
1274            0..n_shards,
1275            |_, _| -99.5, // prefit alternative log-lik on the shard
1276            |_| -100.0,   // honest null sup on the shard
1277            |folded, _| folded + 1,
1278        )
1279        .expect("valid alpha");
1280
1281        // Full-stream evidence is banked: 0.5 nats/shard × 20 shards = 10 nats,
1282        // far past the single-hypothesis bar — this surplus is what the e-BH
1283        // dictionary certificate consumes against its `ln(m/(α·k))` bar.
1284        match gate.verdict() {
1285            GateVerdict::Certified { log_e } => {
1286                assert!((log_e - growth * n_shards as f64).abs() < 1e-12)
1287            }
1288            v => panic!("planted atom must certify, got {v:?}"),
1289        }
1290        // Realized time-to-certification (first-passage of the running sup over
1291        // `ln(1/α)`) == the design-time budget, rounded up.
1292        let budget = expected_resolution_budget(0.05, growth).expect("budget");
1293        assert_eq!(gate.certified_at_step(), Some(budget.ceil() as usize));
1294        assert_eq!(gate.certified_at_step(), Some(6));
1295        // Absorption does NOT stop at the crossing: every shard is banked.
1296        assert_eq!(gate.test.process.steps(), n_shards);
1297        // The alternative state saw the whole stream.
1298        assert_eq!(alt_state, n_shards);
1299    }
1300
1301    /// Work-plan step 4, closed end-to-end: a contested claim gets a probe
1302    /// plan, the probe's realized outcomes are scored under both FROZEN
1303    /// hypotheses via [`StructureLedger::absorb_probe_outcome`], and the
1304    /// banked evidence flips the claim to confirmed within a small multiple
1305    /// of the plan's predicted resolution budget. Outcome noise is a
1306    /// deterministic bounded surrogate (zero-mean sinusoid), so under the
1307    /// true alternative each probe's expected log-growth is exactly the
1308    /// design value.
1309    #[test]
1310    fn design_loop_resolves_contested_claim_within_predicted_budget() {
1311        let mut ledger = StructureLedger::new();
1312        let idx = ledger.register(ClaimKind::GeometryKind {
1313            atom: 0,
1314            kind: "circle".to_string(),
1315        });
1316
1317        // Local Gaussian output model, unit-isotropic noise in the
1318        // Fisher-whitened coordinates: per-observation expected log-growth
1319        // under H1 is exactly the planned ½‖μ₁−μ₀‖²_F.
1320        let fisher = array![[1.0, 0.0], [0.0, 1.0]];
1321        let mu0 = array![0.0, 0.0];
1322        let mu1 = array![1.2, 0.5];
1323        let probes = vec![CandidateProbe {
1324            delta: array![0.0, 1.0],
1325            predicted_mean_null: mu0.clone(),
1326            predicted_mean_alt: mu1.clone(),
1327        }];
1328        let alpha = 0.05;
1329        let plan = plan_probe_for_contested_claim(&probes, &fisher, alpha, 0.0).expect("plan");
1330        assert_eq!(plan.probe, 0);
1331        // ½‖μ₁−μ₀‖² = ½(1.44 + 0.25) = 0.845 nats/obs; ln 20 ≈ 3.0 ⇒ ~3.6 obs.
1332        assert!((plan.expected_log_growth - 0.845).abs() < 1e-12);
1333        let budget = plan.budget_remaining.ceil().max(1.0) as usize;
1334
1335        // Run the probe loop: outcomes realized under the TRUE alternative
1336        // (mean μ₁ plus bounded zero-mean fluctuation); both hypotheses'
1337        // densities were frozen above, before any outcome existed.
1338        let mut observations = 0usize;
1339        while !ledger.claims()[idx].evidence.rejects_at(alpha) {
1340            observations += 1;
1341            assert!(
1342                observations <= 4 * budget,
1343                "claim must resolve within a small multiple of the predicted \
1344                 budget {budget}; still contested after {observations} probes"
1345            );
1346            let t = observations as f64;
1347            let eps0 = 0.8 * (t * 0.7321).sin();
1348            let eps1 = 0.8 * (t * 1.1173).cos();
1349            let y = array![mu1[0] + eps0, mu1[1] + eps1];
1350            // Unit-Gaussian log-densities under each frozen hypothesis; the
1351            // shared normalizer cancels in the ratio.
1352            let d1 = &y - &mu1;
1353            let d0 = &y - &mu0;
1354            ledger
1355                .absorb_probe_outcome(idx, -0.5 * d1.dot(&d1), -0.5 * d0.dot(&d0))
1356                .expect("absorb");
1357        }
1358        let cert = ledger.certify(alpha);
1359        assert!(
1360            cert.confirmed()
1361                .any(|e| matches!(e.kind, ClaimKind::GeometryKind { atom: 0, .. }))
1362        );
1363    }
1364}
gam_terms/inference/structure_evidence.rs

gam_terms/inference/
structure_evidence.rs