gam_terms/inference/structure_evidence.rs
1//! Anytime-valid structure discovery: e-process gates, universal-inference
2//! atom tests, e-BH error control, and KL-optimal steering probes.
3//!
4//! Interpretability as sequential experimental design.
5//!
6//! # The thesis
7//!
8//! The dictionary-learning stack (#974–#981) DISCOVERS structure: atom
9//! birth/death/fission/fusion (#976), geometry adjudication — circle vs
10//! clusters vs line (#907), feature binding (#975). Today those decisions
11//! are made by evidence heuristics (likelihood-ratio ladders, BIC-flavored
12//! gates). Three facts, never previously combined, say that is not merely
13//! informal but WRONG in a specific, fixable way — and that fixing it
14//! upgrades the whole capstone from observational description to
15//! error-controlled experimental science:
16//!
17//! 1. **Atom existence is a NON-REGULAR testing problem.** "Does a K+1-th
18//! dictionary atom exist?" is testing K vs K+1 mixture components — the
19//! textbook boundary/loss-of-identifiability case where the classical
20//! likelihood-ratio χ² asymptotics FAIL (the null sits on the boundary
21//! of the alternative; the nuisance parameters of the new atom vanish
22//! under the null — Davies' problem). Every SAE/dictionary paper that
23//! thresholds a likelihood improvement is running this broken test.
24//! **Universal inference** (Wasserman–Ramdas–Balakrishnan 2020) is the
25//! modern resolution: a split-likelihood-ratio e-value that is valid in
26//! finite samples with NO regularity conditions whatsoever — exactly
27//! the irregular regime atom birth lives in.
28//!
29//! 2. **Discovery happens on streams with optional stopping.** Dictionaries
30//! are trained until the features "look right" — data-dependent
31//! stopping that p-hacks any fixed-sample test by construction.
32//! **E-processes** (nonnegative supermartingales under the null,
33//! `E[E_τ] ≤ 1` at every stopping time) are immune: by Ville's
34//! inequality `P(sup_t E_t ≥ 1/α) ≤ α`, the guarantee survives stopping
35//! whenever you like, peeking included, streaming corpora (#973)
36//! included. Evidence is a RUNNING PRODUCT, resumable across shards.
37//!
38//! 3. **This laboratory is INTERVENTIONAL.** The steering primitive with
39//! output dosimetry and a validity radius is landed
40//! (`crate::inference::steering`); the per-token output-Fisher harvest
41//! (#980) gives the local information geometry of the model's output.
42//! So "which probe next?" is not a vibe — it is OPTIMAL EXPERIMENTAL
43//! DESIGN: choose the steering intervention that maximizes the expected
44//! log-growth of the e-process deciding a contested structural claim.
45//! Under the local Gaussian output-Fisher model, that growth rate IS a
46//! KL divergence with a closed form (below). The model chooses its own
47//! next experiment, optimally, inside its certified validity radius.
48//!
49//! The deliverable shape: **every discovered atom ships an e-value; every
50//! dictionary ships an e-BH FDR certificate over its claimed structure;
51//! contested claims get design-optimal probes until evidence resolves.**
52//! No other interpretability stack has finite-sample, optional-stopping-
53//! safe error control over its discovered structure. This module is the
54//! statistical substrate; the SAE structure search plugs its gates in.
55//!
56//! The instruments, bottom-up: [`EProcess`] (running evidence, Ville
57//! semantics) → [`PredictablePluginEProcess`] (streaming universal
58//! inference) → [`AtomBirthGate`] + [`run_atom_birth_gate`] (the K vs K+1
59//! gate with demote-never-reject [`GateVerdict`]s; the runner enforces
60//! the predictability contract by call order) → [`StructureLedger`] (one
61//! e-process per claim, serializable across #973 shards) →
62//! [`StructureLedger::certify`] (the e-BH [`StructureCertificate`],
63//! shipped beside the gauge report via
64//! `crate::terms::sae::identifiability::dictionary_report`) →
65//! [`plan_probe_for_contested_claim`] (the design loop: contested claims
66//! get a [`ProbePlan`] whose δ runs through
67//! `crate::inference::steering::steer_delta` and whose per-hypothesis
68//! μ₀/μ₁ come from `crate::inference::steering::predicted_response`).
69//!
70//! # The math, fixed here so implementations cannot drift
71//!
72//! **E-value / e-process.** A nonnegative statistic `E` with `E_{H0}[E] ≤ 1`.
73//! Calibration to tests: reject at level α when `E ≥ 1/α` (Markov). An
74//! e-PROCESS compounds multiplicatively: `E_t = Π_{s≤t} e_s` where each
75//! `e_s` is conditionally valid given the past (`E[e_s | F_{s−1}] ≤ 1`
76//! under H0). Ville: `P_{H0}(∃t: E_t ≥ 1/α) ≤ α` — anytime validity.
77//! Always accumulate in log space; evidence products underflow doubles.
78//!
79//! **Universal inference (the atom-birth test).** Split the data (or the
80//! token stream) into D₀ (evaluation) and D₁ (estimation). Fit the
81//! K+1-atom alternative on D₁ by ANY method (the production fitter, warm
82//! starts, GPU, anything — no conditions). Fit the K-atom null by
83//! CONSTRAINED MLE ON D₀ (this is the one side that must be honest). Then
84//!
85//! ```text
86//! E = L(θ̂₁ ; D₀) / sup_{θ ∈ H0} L(θ ; D₀)
87//! ```
88//!
89//! satisfies `E_{H0}[E] ≤ 1` in finite samples, mixtures and boundaries
90//! and all (the proof is three lines of Markov + tower; no asymptotics).
91//! Sequential version: at each batch t, the alternative plug-in is fit on
92//! data BEFORE t (predictable), the null sup is over the batch; the
93//! product is an e-process. This is `SplitLikelihoodEValue` /
94//! `PredictablePluginEProcess` below, generic over log-likelihood
95//! closures so the SAE stack passes its own (manifold likelihoods,
96//! superposition-aware residual models #974, whatever exists then).
97//!
98//! **Bayes factors are e-values (the #907 bridge).** A Bayes factor
99//! `BF = ∫ L(θ) dΠ₁(θ) / L_{H0}` with a FIXED (data-independent) prior Π₁
100//! and a SIMPLE (or sup-dominated) null has `E_{H0}[BF] ≤ 1` — the #907
101//! geometry-adjudication harness (circle vs clusters vs line, with its
102//! discrete-mixture null) is therefore ONE PRIOR-FREEZE away from anytime
103//! validity. The integration contract: route its per-batch BFs through
104//! [`EProcess::absorb`] instead of comparing a final BF to a threshold,
105//! and geometry claims inherit optional-stopping safety for free.
106//!
107//! **e-BH (the dictionary certificate).** Given e-values e_1..e_m for m
108//! structural claims (one per atom/edge/binding), sort descending and
109//! reject the top k* where `k* = max{ k : e_(k) ≥ m/(α·k) }`. This
110//! controls FDR ≤ α under ARBITRARY dependence between the e-values
111//! (Wang–Ramdas 2022) — no independence assumptions about atoms sharing
112//! tokens, which is good because they all share every token. That
113//! arbitrary-dependence robustness is WHY the certificate uses e-BH and
114//! not a p-value BH: the p-version needs PRDS, which atom statistics
115//! flagrantly violate.
116//!
117//! **Design-optimal probing.** For a contested claim with competing
118//! structural hypotheses H₀/H₁ (e.g. "feature f is one curved atom" vs
119//! "two flat atoms"), a candidate steering intervention δ (within its
120//! certified validity radius, `steering.rs`) produces predicted output
121//! distributions P₀^δ, P₁^δ. The expected per-observation log-growth of
122//! the likelihood-ratio e-process under H₁ is exactly `KL(P₁^δ ‖ P₀^δ)`,
123//! so the optimal next experiment is
124//!
125//! ```text
126//! δ* = argmax_{δ : ‖δ‖ ≤ r_valid} KL(P₁^δ ‖ P₀^δ)
127//! ≈ argmax_δ ½ (μ₁(δ) − μ₀(δ))ᵀ F (μ₁(δ) − μ₀(δ))
128//! ```
129//!
130//! under the local Gaussian model with output-Fisher metric F (#980
131//! harvest) — the SAME quadratic form the steering dosimetry already
132//! computes, repurposed: dosimetry measures nats delivered, design
133//! maximizes nats of DISCRIMINATION. Probes that maximize raw effect are
134//! not optimal; probes that maximize the *disagreement between the
135//! hypotheses' predictions*, weighted by output information, are. (Full
136//! KL-optimal design over the probe manifold is a research arc; the greedy
137//! quadratic rule below is the correct first instrument and is exact in
138//! the local model.)
139//!
140//! # What this kills
141//!
142//! - Birth/death gates that are threshold heuristics → replaced by
143//! anytime-valid tests with declared error rates (#976's "detectable,
144//! correctable misspecification" becomes a literal hypothesis test).
145//! - "We trained until the features looked interpretable" → optional
146//! stopping is SAFE; the certificate survives it.
147//! - "We found N features" → "we found N features at FDR ≤ α, certificate
148//! attached, reproducible from the e-value ledger."
149//! - Probe selection by intuition → probe selection by information.
150
151use ndarray::{Array1, Array2};
152use serde::{Deserialize, Serialize};
153
154/// Running anytime-valid evidence against one null hypothesis, in log
155/// space. Multiplicative absorption of conditionally-valid e-values;
156/// Ville's inequality converts the running product into a sequential test
157/// that survives optional stopping. Serializable so evidence is resumable
158/// across corpus shards (#973): persist, reload, keep absorbing.
159#[derive(Clone, Debug, Serialize, Deserialize)]
160pub struct EProcess {
161 /// log E_t — the running log-evidence. Starts at 0 (E_0 = 1).
162 log_e: f64,
163 /// Number of absorbed batches (the ledger length).
164 steps: usize,
165 /// Running maximum of log E_t — Ville's inequality applies to the
166 /// supremum, so a claim once proven at level α stays proven even if
167 /// later evidence retreats (evidence is not p-hackable in reverse).
168 log_e_max: f64,
169}
170
171impl EProcess {
172 pub fn new() -> Self {
173 Self {
174 log_e: 0.0,
175 steps: 0,
176 log_e_max: 0.0,
177 }
178 }
179
180 /// Absorb one conditionally-valid e-value (NOT in log space; must be
181 /// ≥ 0; `E[e | past] ≤ 1` under H0 is the caller's contract — e.g. a
182 /// universal-inference batch ratio or a fixed-prior Bayes factor).
183 pub fn absorb(&mut self, e_value: f64) -> Result<(), String> {
184 if e_value.is_nan() || e_value < 0.0 {
185 return Err(format!("e-value must be in [0, ∞], got {e_value}"));
186 }
187 self.absorb_log(e_value.ln())
188 }
189
190 /// Absorb a batch e-value supplied in log space (the only numerically
191 /// honest interface for long streams).
192 pub fn absorb_log(&mut self, log_e_value: f64) -> Result<(), String> {
193 let next_log_e = checked_log_e_sum(self.log_e, log_e_value)?;
194 self.log_e = next_log_e;
195 self.steps += 1;
196 if self.log_e > self.log_e_max {
197 self.log_e_max = self.log_e;
198 }
199 Ok(())
200 }
201
202 pub fn log_evidence(&self) -> f64 {
203 self.log_e
204 }
205
206 pub fn steps(&self) -> usize {
207 self.steps
208 }
209
210 /// Anytime-valid rejection at level α: by Ville,
211 /// `P_{H0}(sup_t E_t ≥ 1/α) ≤ α`, so crossing 1/α at ANY time —
212 /// including data-dependent stopping times — proves the claim with
213 /// type-I error ≤ α. Uses the running supremum: once crossed, always
214 /// rejected.
215 pub fn rejects_at(&self, alpha: f64) -> bool {
216 alpha > 0.0 && self.log_e_max >= -(alpha.ln())
217 }
218
219 /// The e-value to hand to [`e_benjamini_hochberg`] for the
220 /// dictionary-level FDR certificate (current evidence, not the sup —
221 /// e-BH's guarantee is stated for e-values at the chosen stopping
222 /// time).
223 pub fn current_e_value_log(&self) -> f64 {
224 self.log_e
225 }
226}
227
228impl Default for EProcess {
229 fn default() -> Self {
230 Self::new()
231 }
232}
233
234fn checked_log_e_sum(current: f64, increment: f64) -> Result<f64, String> {
235 if current.is_nan() {
236 return Err("EProcess invariant violation: current log evidence is NaN".to_string());
237 }
238 if increment.is_nan() {
239 return Err("log e-value must not be NaN".to_string());
240 }
241 if current.is_infinite()
242 && increment.is_infinite()
243 && current.is_sign_positive() != increment.is_sign_positive()
244 {
245 return Err(format!(
246 "cannot combine opposing infinite log e-values: current {current}, increment {increment}"
247 ));
248 }
249 Ok(current + increment)
250}
251
252/// One universal-inference (split-likelihood-ratio) e-value: finite-sample
253/// valid with NO regularity conditions — the correct instrument for atom
254/// birth (K vs K+1 components, boundary/Davies regime where χ² fails).
255///
256/// `log_lik_alternative_on_eval`: log-likelihood of the EVALUATION fold
257/// under the alternative fitted on the ESTIMATION fold (any fitter — the
258/// production manifold/SAE fit, warm-started, GPU; zero conditions on it).
259/// `log_lik_null_sup_on_eval`: the SUPREMUM of the evaluation-fold
260/// log-likelihood over the NULL model class (the honest side: a real
261/// constrained fit on the eval fold, e.g. the K-atom dictionary refit on
262/// D₀). Then `log E = ℓ_alt(D₀) − sup_{H0} ℓ(D₀)` and `E_{H0}[E] ≤ 1`
263/// exactly.
264pub fn split_likelihood_log_e_value(
265 log_lik_alternative_on_eval: f64,
266 log_lik_null_sup_on_eval: f64,
267) -> f64 {
268 // Degenerate halves yield a well-defined, conservative e-value of 1
269 // (`log E = 0`, no evidence for the alternative) rather than a NaN that
270 // would poison the e-process product and the downstream FDR certificate:
271 // * Either log-likelihood is NaN — the model could not be evaluated on
272 // this shard (a numerically degenerate fit). The honest reading is
273 // "no information", i.e. `E = 1`.
274 // * Both halves are the SAME signed infinity — e.g. a shard/outcome
275 // with zero density under both the alternative and the null
276 // (`−∞ − (−∞)`), or both `+∞`. The ratio is undefined; again `E = 1`.
277 // This keeps `log E` finite so `absorb_log` never banks a NaN and the
278 // e-BH certificate sees a contributing-nothing claim instead of panicking.
279 if log_lik_alternative_on_eval.is_nan() || log_lik_null_sup_on_eval.is_nan() {
280 return 0.0;
281 }
282 if log_lik_alternative_on_eval.is_infinite()
283 && log_lik_null_sup_on_eval.is_infinite()
284 && log_lik_alternative_on_eval.is_sign_positive()
285 == log_lik_null_sup_on_eval.is_sign_positive()
286 {
287 return 0.0;
288 }
289 log_lik_alternative_on_eval - log_lik_null_sup_on_eval
290}
291
292/// Sequential universal inference over a stream of batches with a
293/// PREDICTABLE plug-in: at batch t the alternative parameters were fit
294/// using only data before t, the null sup is taken on batch t, and the
295/// per-batch ratios compound into an e-process. This is the streaming /
296/// optional-stopping form the corpus-scale pipeline (#973 shards) needs —
297/// evidence is resumable: serialize `EProcess`, keep absorbing on the next
298/// shard.
299#[derive(Clone, Debug, Serialize, Deserialize)]
300pub struct PredictablePluginEProcess {
301 pub process: EProcess,
302}
303
304impl PredictablePluginEProcess {
305 pub fn new() -> Self {
306 Self {
307 process: EProcess::new(),
308 }
309 }
310
311 /// Absorb one batch. The caller guarantees the alternative was fit
312 /// WITHOUT batch-t data (predictability — this is what makes the
313 /// product a supermartingale; violating it voids the guarantee, which
314 /// is why the SAE integration must hand this function the PREVIOUS
315 /// shard's fitted dictionary, never the current one).
316 pub fn try_absorb_batch(
317 &mut self,
318 log_lik_alternative_prefit: f64,
319 log_lik_null_sup_on_batch: f64,
320 ) -> Result<(), String> {
321 self.process.absorb_log(split_likelihood_log_e_value(
322 log_lik_alternative_prefit,
323 log_lik_null_sup_on_batch,
324 ))
325 }
326}
327
328impl Default for PredictablePluginEProcess {
329 fn default() -> Self {
330 Self::new()
331 }
332}
333
334/// The anytime-valid verdict on one structural claim. Deliberately
335/// two-valued — there is NO "rejected" arm. Demote-never-reject (#969
336/// philosophy): an e-process that has not crossed 1/α has failed to prove
337/// the claim, not disproven it; the claim stays contested, keeps its
338/// evidence, and earns a design-optimal probe budget instead of being
339/// silently dropped (or worse, silently accepted the way a threshold gate
340/// accepts whatever clears it).
341#[derive(Clone, Copy, Debug, PartialEq, Serialize, Deserialize)]
342pub enum GateVerdict {
343 /// The running supremum crossed 1/α: the claim is proven with type-I
344 /// error ≤ α, permanently (Ville applies to the sup, so later evidence
345 /// retreat cannot un-prove it).
346 Certified { log_e: f64 },
347 /// Not (yet) proven. Carries the CURRENT log-evidence — the value the
348 /// dictionary certificate's e-BH consumes, and the state a probe loop
349 /// resumes from.
350 Contested { log_e: f64 },
351}
352
353/// The atom-birth gate (#976's threshold comparison, replaced): a
354/// universal-inference e-process over corpus shards deciding "does the
355/// K+1-th atom exist?", the boundary/Davies-regime question where the χ²
356/// gate every dictionary paper runs is broken.
357///
358/// Per shard t the integration contract is exactly the work plan's:
359/// - `log_lik_alternative_prefit`: the K+1-atom dictionary fit on shards
360/// BEFORE t (the PREVIOUS shard's fit — predictability is the one rule;
361/// handing in the current shard's fit voids the guarantee), evaluated on
362/// shard t. Any fitter, warm starts, GPU — no conditions.
363/// - `log_lik_null_sup_on_shard`: the K-atom dictionary REFIT on shard t
364/// (the honest constrained sup on the evaluation data).
365///
366/// The gate never rejects: [`GateVerdict::Contested`] is the only
367/// alternative to certification, and a contested atom's next move is a
368/// probe plan ([`plan_probe_for_contested_claim`]), not deletion.
369#[derive(Clone, Debug, Serialize, Deserialize)]
370pub struct AtomBirthGate {
371 pub test: PredictablePluginEProcess,
372 /// The level the certificate is claimed at; fixed at construction so a
373 /// verdict can never be shopped across α after seeing the evidence.
374 alpha: f64,
375 /// First-passage time: the shard count at which the running supremum
376 /// first crossed the single-hypothesis Ville bar `ln(1/α)`. `None` until
377 /// the crossing happens. This is the realized *time-to-certification* —
378 /// the design-relevant quantity (`expected_resolution_budget`) — recorded
379 /// SEPARATELY from the running evidence so that absorption can continue
380 /// past the crossing without losing the crossing time. By Ville the
381 /// crossing is permanent (it is a property of the running sup), so this is
382 /// monotone: once `Some`, never reset. `#[serde(default)]` keeps gates
383 /// persisted before this field was added (#973 resumability) loadable.
384 #[serde(default)]
385 certified_at_step: Option<usize>,
386}
387
388impl AtomBirthGate {
389 pub fn new(alpha: f64) -> Result<Self, String> {
390 if !(alpha > 0.0 && alpha < 1.0) {
391 return Err(format!(
392 "AtomBirthGate: alpha must be in (0,1), got {alpha}"
393 ));
394 }
395 Ok(Self {
396 test: PredictablePluginEProcess::new(),
397 alpha,
398 certified_at_step: None,
399 })
400 }
401
402 pub fn alpha(&self) -> f64 {
403 self.alpha
404 }
405
406 /// The realized time-to-certification: the shard count at which the
407 /// running supremum first crossed `ln(1/α)`, or `None` if it never did.
408 /// This is the first-passage time the design budget
409 /// ([`expected_resolution_budget`]) predicts — distinct from
410 /// [`EProcess::steps`] (total shards absorbed), which keeps growing after
411 /// the crossing because absorption does not stop (continuing to accumulate
412 /// is what lets the dictionary-level e-BH certificate clear its higher
413 /// multiplicity bar `ln(m/(α·k))`).
414 pub fn certified_at_step(&self) -> Option<usize> {
415 self.certified_at_step
416 }
417
418 /// Absorb one shard's split-likelihood ratio (see type-level contract).
419 pub fn try_absorb_shard(
420 &mut self,
421 log_lik_alternative_prefit: f64,
422 log_lik_null_sup_on_shard: f64,
423 ) -> Result<(), String> {
424 self.test
425 .try_absorb_batch(log_lik_alternative_prefit, log_lik_null_sup_on_shard)?;
426 // Record the first-passage time once, the step the running sup first
427 // crosses the single-hypothesis bar. `rejects_at` reads the running
428 // maximum, so this latches permanently even if later evidence retreats.
429 if self.certified_at_step.is_none() && self.test.process.rejects_at(self.alpha) {
430 self.certified_at_step = Some(self.test.process.steps());
431 }
432 Ok(())
433 }
434
435 pub fn absorb_shard(
436 &mut self,
437 log_lik_alternative_prefit: f64,
438 log_lik_null_sup_on_shard: f64,
439 ) {
440 self.try_absorb_shard(log_lik_alternative_prefit, log_lik_null_sup_on_shard)
441 .expect("AtomBirthGate received invalid log evidence");
442 }
443
444 pub fn verdict(&self) -> GateVerdict {
445 if self.test.process.rejects_at(self.alpha) {
446 GateVerdict::Certified {
447 log_e: self.test.process.log_evidence(),
448 }
449 } else {
450 GateVerdict::Contested {
451 log_e: self.test.process.log_evidence(),
452 }
453 }
454 }
455}
456
457/// Run the atom-birth gate over a shard stream with the predictability
458/// contract enforced BY CONSTRUCTION: on each shard the alternative is
459/// evaluated strictly before it is refit with that shard, so the plug-in
460/// is always predictable and the product is always a supermartingale
461/// under H0. This is the orchestration the SAE structure search calls;
462/// the closures are the only fitter-specific surface.
463///
464/// - `alternative_log_lik(alt, shard)`: evaluation-fold log-likelihood of
465/// shard under the CURRENT alternative state (fit on prior shards only —
466/// guaranteed here by call order).
467/// - `null_sup_log_lik(shard)`: the honest constrained sup — the K-atom
468/// null REFIT on this shard (any fitter; this side must genuinely
469/// maximize over H0 on the shard, an under-maximized null inflates the
470/// e-value and voids validity).
471/// - `refit_alternative(alt, shard)`: fold the shard into the alternative
472/// (warm-started production fit, GPU, anything — zero conditions).
473///
474/// `initial_alternative` is the K+1 fit from data BEFORE the stream (or a
475/// prior-driven init; validity never depends on its quality — a bad init
476/// only costs power). Absorbs EVERY shard's evidence into the e-process — it
477/// does NOT stop at the single-hypothesis Ville crossing. An earlier
478/// early-stop ("the crossing is permanent; further shards only cost compute")
479/// conflated two distinct bars:
480/// * the per-move VERDICT ([`AtomBirthGate::verdict`] via
481/// [`EProcess::rejects_at`]) tests the running SUPREMUM against the
482/// single-hypothesis bar `ln(1/α)`, where the crossing is indeed
483/// permanent and the realized crossing step is captured separately in
484/// [`AtomBirthGate::certified_at_step`]; but
485/// * the dictionary-level CERTIFICATE ([`StructureLedger::certify`] →
486/// [`e_benjamini_hochberg`]) consumes the CURRENT log e-value
487/// ([`EProcess::current_e_value_log`]) under a multiplicity correction
488/// whose bar `ln(m/(α·k))` is strictly higher for m > 1 claims.
489/// Stopping at the single-hypothesis bar banked just enough evidence for the
490/// move's own verdict but starved the FDR certificate, so a genuinely real
491/// atom (e.g. 0.8 nats/shard × 10 shards = 8 nats) failed e-BH confirmation
492/// against a second claim (bar `ln(2/0.05) ≈ 3.69`) even though it cleared
493/// its own `ln(1/0.05) ≈ 3.00` bar — it carried ample evidence but the early
494/// stop threw the surplus away. Continuing to absorb is always sound: the
495/// e-process is a supermartingale under H0, so additional
496/// conditionally-valid increments preserve validity, and the predictability
497/// contract is untouched (the alternative is still evaluated before it is
498/// refit with each shard). Returns the gate (verdict + resumable evidence,
499/// with `certified_at_step` holding the realized time-to-certification) and
500/// the final alternative state, which has seen the whole stream.
501pub fn run_atom_birth_gate<S, A>(
502 alpha: f64,
503 initial_alternative: A,
504 shards: impl IntoIterator<Item = S>,
505 mut alternative_log_lik: impl FnMut(&A, &S) -> f64,
506 mut null_sup_log_lik: impl FnMut(&S) -> f64,
507 mut refit_alternative: impl FnMut(A, &S) -> A,
508) -> Result<(AtomBirthGate, A), String> {
509 let mut gate = AtomBirthGate::new(alpha)?;
510 let mut alt = initial_alternative;
511 for shard in shards {
512 let log_lik_alt = alternative_log_lik(&alt, &shard);
513 let log_lik_null = null_sup_log_lik(&shard);
514 gate.try_absorb_shard(log_lik_alt, log_lik_null)?;
515 alt = refit_alternative(alt, &shard);
516 }
517 Ok((gate, alt))
518}
519
520/// e-BH: FDR control over m structural claims under ARBITRARY dependence
521/// (Wang–Ramdas). Input: per-claim log-e-values. Output: indices of
522/// rejected (i.e. CONFIRMED-STRUCTURE) claims, FDR ≤ α.
523///
524/// Sort e-values descending; reject the top k* where
525/// `k* = max{ k : e_(k) ≥ m/(α·k) }`.
526///
527/// This is the "dictionary certificate": run one e-process per claimed
528/// atom (and per claimed binding edge, #975), call this at the chosen
529/// stopping time, and the dictionary ships with an FDR-controlled list of
530/// which of its claimed structures are statistically real. No
531/// independence assumptions — atoms sharing every token is fine; that is
532/// exactly the case p-value BH cannot legally handle (PRDS violation) and
533/// e-BH can.
534pub fn e_benjamini_hochberg(log_e_values: &[f64], alpha: f64) -> Vec<usize> {
535 let m = log_e_values.len();
536 if m == 0 || !(alpha.is_finite() && alpha > 0.0) {
537 return Vec::new();
538 }
539 // Robustness: a degenerate claim can bank a non-finite log e-value (a
540 // NaN from an upstream `(−∞) − (−∞)` split-LR that escaped the source
541 // guard). This is the documented honest FDR surface, so it must NEVER
542 // panic on such input. Treat any NaN as least-evidence (`−∞`): an
543 // undefined/contested claim contributes no evidence, sorts last, and can
544 // never be among the rejections. Finite/±∞ values pass through unchanged;
545 // `total_cmp` then gives a total order on the sanitized keys.
546 let sanitized: Vec<f64> = log_e_values
547 .iter()
548 .map(|&v| if v.is_nan() { f64::NEG_INFINITY } else { v })
549 .collect();
550 let mut order: Vec<usize> = (0..m).collect();
551 order.sort_by(|&a, &b| sanitized[b].total_cmp(&sanitized[a]));
552 let m_f = m as f64;
553 let mut k_star = 0usize;
554 for (rank0, &idx) in order.iter().enumerate() {
555 let k = (rank0 + 1) as f64;
556 // e_(k) ≥ m / (α k) ⟺ log e_(k) ≥ log m − log α − log k
557 if sanitized[idx] >= m_f.ln() - alpha.ln() - k.ln() {
558 k_star = rank0 + 1;
559 }
560 }
561 order.truncate(k_star);
562 order
563}
564
565/// What one structural claim asserts about the dictionary. One e-process
566/// runs per claim; the kinds mirror the discovery stack's claim surface:
567/// atom existence (#976 birth), binding edges (#975), geometry
568/// adjudication (#907). `Custom` keeps the ledger open to claim types
569/// that do not exist yet without an enum churn per new discovery gate.
570#[derive(Clone, Debug, PartialEq, Eq, Hash, Serialize, Deserialize)]
571pub enum ClaimKind {
572 /// "Atom `atom` is statistically real" — the K vs K+1 birth claim.
573 AtomExists { atom: usize },
574 /// "Atoms `a` and `b` are bound" — a #975 binding edge.
575 BindingEdge { a: usize, b: usize },
576 /// "Atom `atom`'s latent geometry is `kind`" (e.g. "circle",
577 /// "clusters", "line") — a #907 adjudication claim.
578 GeometryKind { atom: usize, kind: String },
579 /// Any other structural claim, labeled.
580 Custom { label: String },
581}
582
583/// One claim plus its running evidence.
584#[derive(Clone, Debug, Serialize, Deserialize)]
585pub struct StructuralClaim {
586 pub kind: ClaimKind,
587 pub evidence: EProcess,
588}
589
590/// The dictionary's claim ledger: every structural claim the discovery
591/// stack makes, each with its own e-process. Serializable — evidence
592/// resumes across corpus shards (#973) by persisting the ledger, not by
593/// refitting. Calling [`StructureLedger::certify`] at ANY data-dependent
594/// stopping time yields a valid certificate; that is the entire point.
595#[derive(Clone, Debug, Default, Serialize, Deserialize)]
596pub struct StructureLedger {
597 claims: Vec<StructuralClaim>,
598}
599
600impl StructureLedger {
601 pub fn new() -> Self {
602 Self { claims: Vec::new() }
603 }
604
605 /// Register a claim and return its ledger index. Idempotent on the
606 /// claim kind: re-registering an existing claim (a resumed shard loop
607 /// re-announcing its claim surface) returns the existing index and
608 /// PRESERVES its accumulated evidence — a fresh e-process here would
609 /// silently discard the stream's history.
610 pub fn register(&mut self, kind: ClaimKind) -> usize {
611 if let Some(idx) = self.claims.iter().position(|c| c.kind == kind) {
612 return idx;
613 }
614 self.claims.push(StructuralClaim {
615 kind,
616 evidence: EProcess::new(),
617 });
618 self.claims.len() - 1
619 }
620
621 /// Absorb one conditionally-valid log e-value for claim `idx` (a
622 /// universal-inference shard ratio, a frozen-prior log-BF — the
623 /// caller's contract is per-source, documented on the producing gate).
624 pub fn absorb_log(&mut self, idx: usize, log_e_value: f64) -> Result<(), String> {
625 let n = self.claims.len();
626 let claim = self.claims.get_mut(idx).ok_or_else(|| {
627 format!("StructureLedger: claim index {idx} out of range ({n} claims)")
628 })?;
629 claim.evidence.absorb_log(log_e_value)
630 }
631
632 pub fn claims(&self) -> &[StructuralClaim] {
633 &self.claims
634 }
635
636 /// The likelihood half of the probe-design loop (work-plan step 4):
637 /// after running a planned probe ([`ProbePlan`] →
638 /// `crate::inference::steering::steer_delta`), evaluate the REALIZED
639 /// outcomes under both hypotheses' predictive densities and absorb the
640 /// log-ratio into the contested claim's e-process.
641 ///
642 /// Validity contract: both predictive densities must be FROZEN before the
643 /// probe outcome is observed — which the design loop satisfies by
644 /// construction, since both hypotheses' dictionaries were fitted before
645 /// the probe was even chosen. For a composite null, the null density must
646 /// be the honest constrained fit (the same rule as
647 /// [`split_likelihood_log_e_value`], which this delegates to); for a
648 /// simple null the predictive density is the sup. Probe outcomes are new
649 /// data by construction (the model was steered to produce them), so they
650 /// compound validly with the claim's prior shard evidence.
651 pub fn absorb_probe_outcome(
652 &mut self,
653 idx: usize,
654 log_lik_alt_on_outcome: f64,
655 log_lik_null_on_outcome: f64,
656 ) -> Result<(), String> {
657 self.absorb_log(
658 idx,
659 split_likelihood_log_e_value(log_lik_alt_on_outcome, log_lik_null_on_outcome),
660 )
661 }
662
663 /// The dictionary certificate: e-BH over the ledger's CURRENT
664 /// e-values at level α. FDR ≤ α over the confirmed set under arbitrary
665 /// dependence — atoms sharing every token is fine — and valid at any
666 /// stopping time because each entry is an e-process. Claims not
667 /// confirmed are CONTESTED, never rejected (demote-never-reject); they
668 /// keep their evidence and are the inputs to the probe-design loop.
669 pub fn certify(&self, alpha: f64) -> StructureCertificate {
670 let log_e: Vec<f64> = self
671 .claims
672 .iter()
673 .map(|c| c.evidence.current_e_value_log())
674 .collect();
675 let confirmed_idx = e_benjamini_hochberg(&log_e, alpha);
676 let mut entries: Vec<CertificateEntry> = self
677 .claims
678 .iter()
679 .zip(&log_e)
680 .map(|(c, &le)| CertificateEntry {
681 kind: c.kind.clone(),
682 log_e: le,
683 steps: c.evidence.steps(),
684 confirmed: false,
685 })
686 .collect();
687 for &i in &confirmed_idx {
688 entries[i].confirmed = true;
689 }
690 StructureCertificate { alpha, entries }
691 }
692}
693
694/// One line of the certificate's e-value ledger: the claim, its
695/// log-evidence at the stop, how many batches produced it, and the e-BH
696/// outcome. The full entry list IS the reproducibility artifact: anyone
697/// holding it can re-run [`e_benjamini_hochberg`] and re-derive the
698/// confirmed set.
699#[derive(Clone, Debug, Serialize, Deserialize)]
700pub struct CertificateEntry {
701 pub kind: ClaimKind,
702 pub log_e: f64,
703 pub steps: usize,
704 pub confirmed: bool,
705}
706
707/// The deliverable: "we found N structures at FDR ≤ α, certificate
708/// attached". Ships next to the identifiability certificate
709/// ([`crate::terms::sae::identifiability::residual_gauge`], #981) — that one says
710/// what the GAUGE cannot distinguish, this one says what the DATA can.
711#[derive(Clone, Debug, Serialize, Deserialize)]
712pub struct StructureCertificate {
713 pub alpha: f64,
714 pub entries: Vec<CertificateEntry>,
715}
716
717impl StructureCertificate {
718 pub fn confirmed(&self) -> impl Iterator<Item = &CertificateEntry> {
719 self.entries.iter().filter(|e| e.confirmed)
720 }
721
722 pub fn contested(&self) -> impl Iterator<Item = &CertificateEntry> {
723 self.entries.iter().filter(|e| !e.confirmed)
724 }
725}
726
727/// Calibrate one (super)uniform p-value into a single e-value, in log
728/// space: `e(p) = ½ p^{−1/2}` (the κ = ½ member of the calibrator family
729/// `e_κ(p) = κ p^{κ−1}`; `∫₀¹ e_κ(p) dp = 1`, so `E_{H0}[e(P)] ≤ 1` for
730/// any valid p — superuniformity only, no other conditions).
731///
732/// This is the bridge from p-value-shaped instruments into the ledger —
733/// e.g. the feature-binding Wald test (`terms::structure::anova_atom::carve`'s
734/// `edge_p_value` → a [`ClaimKind::BindingEdge`] entry). It spends
735/// calibration slack (a p of 0.01 becomes e = 5, not 100), which is the
736/// honest price of converting a fixed-sample test into anytime-valid
737/// currency; instruments that can produce e-values natively should.
738/// CONTRACT: one calibrated e-value per INDEPENDENT data batch — feeding
739/// repeated tests of the same accumulating data into one e-process is the
740/// p-hacking this module exists to kill.
741pub fn log_e_from_p_calibrator(p_value: f64) -> Result<f64, String> {
742 if !(p_value > 0.0) || p_value > 1.0 {
743 return Err(format!("p-value must be in (0, 1], got {p_value}"));
744 }
745 Ok(0.5f64.ln() - 0.5 * p_value.ln())
746}
747
748/// A candidate steering probe for resolving one contested structural
749/// claim: the intervention direction (in the steering primitive's
750/// coordinates), and the two hypotheses' PREDICTED output-mean responses
751/// to it.
752pub struct CandidateProbe {
753 /// Steering displacement δ, to be applied via
754 /// `crate::inference::steering` (which enforces its own validity
755 /// radius and reports realized dosimetry).
756 pub delta: Array1<f64>,
757 /// Predicted output-mean response under the null structure, μ₀(δ).
758 pub predicted_mean_null: Array1<f64>,
759 /// Predicted output-mean response under the alternative, μ₁(δ).
760 pub predicted_mean_alt: Array1<f64>,
761}
762
763/// Greedy KL-optimal experimental design under the local Gaussian
764/// output-Fisher model: pick the probe maximizing
765/// `½ (μ₁(δ) − μ₀(δ))ᵀ F (μ₁(δ) − μ₀(δ))` — the expected per-observation
766/// log-growth of the deciding e-process under the alternative.
767///
768/// `fisher` is the output-Fisher metric at the operating point (#980
769/// harvest; the same object steering dosimetry contracts against). Probes
770/// whose hypotheses predict the SAME response score zero no matter how
771/// large their raw effect — the design rule selects for DISCRIMINATION,
772/// not impact, which is the entire point: a maximally-steered output that
773/// both hypotheses predict identically teaches nothing.
774///
775/// Returns the index of the best probe and its expected log-growth (nats
776/// per observation), or None if no probe discriminates.
777pub fn select_probe_by_expected_evidence(
778 probes: &[CandidateProbe],
779 fisher: &Array2<f64>,
780) -> Option<(usize, f64)> {
781 let mut best: Option<(usize, f64)> = None;
782 for (idx, probe) in probes.iter().enumerate() {
783 let diff = &probe.predicted_mean_alt - &probe.predicted_mean_null;
784 if diff.len() != fisher.nrows() {
785 continue;
786 }
787 let f_diff = fisher.dot(&diff);
788 let growth = 0.5 * diff.dot(&f_diff);
789 if growth.is_finite() && growth > 0.0 {
790 match best {
791 Some((_, g)) if g >= growth => {}
792 _ => best = Some((idx, growth)),
793 }
794 }
795 }
796 best
797}
798
799/// Expected number of observations for the chosen probe to push a claim's
800/// e-process across the 1/α Ville threshold, under the alternative: the
801/// design-time budget `log(1/α) / growth_rate`. This is what turns the
802/// abstract guarantee into an experiment plan ("this probe should resolve
803/// the claim in ~N tokens; if it hasn't, the alternative is weaker than
804/// hypothesized — itself evidence").
805pub fn expected_resolution_budget(alpha: f64, growth_nats_per_obs: f64) -> Option<f64> {
806 if alpha <= 0.0 || alpha >= 1.0 || growth_nats_per_obs <= 0.0 {
807 return None;
808 }
809 Some(-(alpha.ln()) / growth_nats_per_obs)
810}
811
812/// The experiment plan for one contested claim: which probe to run, the
813/// expected per-observation evidence growth under the alternative, and the
814/// design-time resolution budget. This is the loop's actionable output —
815/// hand `probes[probe]`'s δ to `crate::inference::steering::steer_delta`
816/// (which enforces the validity radius and reports realized dosimetry),
817/// evaluate both hypotheses' likelihoods on the realized outputs, absorb
818/// the log-ratio into the claim's e-process, re-certify.
819#[derive(Clone, Debug, PartialEq)]
820pub struct ProbePlan {
821 /// Index into the candidate probe list.
822 pub probe: usize,
823 /// Expected log-growth of the deciding e-process, nats/observation,
824 /// under the alternative (the KL of the hypotheses' predicted
825 /// responses in the output-Fisher metric).
826 pub expected_log_growth: f64,
827 /// Expected observations to cross 1/α from ZERO evidence — the
828 /// conservative from-scratch budget.
829 pub budget_from_scratch: f64,
830 /// Expected observations to cross 1/α from the claim's CURRENT
831 /// log-evidence — the remaining budget; 0 when already across.
832 pub budget_remaining: f64,
833}
834
835/// Close the design loop for one contested claim: pick the probe whose
836/// predicted hypothesis-disagreement (not raw effect) buys evidence
837/// fastest, and convert the claim's current evidence into a remaining
838/// budget — "this probe should resolve the claim in ~N more observations
839/// at level α; if it does not, the alternative is weaker than
840/// hypothesized, which is itself evidence."
841///
842/// `current_log_e` is the contested claim's running log-evidence (from its
843/// [`StructuralClaim`] / [`GateVerdict::Contested`]). Returns `None` when
844/// no probe discriminates (all candidates score zero growth: the
845/// hypotheses agree on everything reachable inside the validity radius —
846/// the claim is undecidable by steering and needs a different instrument,
847/// which is a finding, not a failure).
848pub fn plan_probe_for_contested_claim(
849 probes: &[CandidateProbe],
850 fisher: &Array2<f64>,
851 alpha: f64,
852 current_log_e: f64,
853) -> Option<ProbePlan> {
854 let (probe, expected_log_growth) = select_probe_by_expected_evidence(probes, fisher)?;
855 let budget_from_scratch = expected_resolution_budget(alpha, expected_log_growth)?;
856 let nats_remaining = (-(alpha.ln()) - current_log_e).max(0.0);
857 Some(ProbePlan {
858 probe,
859 expected_log_growth,
860 budget_from_scratch,
861 budget_remaining: nats_remaining / expected_log_growth,
862 })
863}
864
865#[cfg(test)]
866mod tests {
867 use super::*;
868 use ndarray::array;
869
870 /// e-BH on a hand-checkable configuration.
871 #[test]
872 fn e_bh_rejects_exactly_the_qualifying_prefix() {
873 // m = 4, α = 0.1 → thresholds m/(αk) = 40, 20, 13.33, 10.
874 let log_e = [45.0f64.ln(), 21.0f64.ln(), 12.0f64.ln(), 1.0f64.ln()];
875 let rejected = e_benjamini_hochberg(&log_e, 0.1);
876 // e_(1)=45 ≥ 40 ✓, e_(2)=21 ≥ 20 ✓, e_(3)=12 < 13.33 ✗ → k* = 2.
877 assert_eq!(rejected, vec![0, 1]);
878
879 // A weaker tail cannot drag in a stronger prefix decision.
880 let log_e2 = [45.0f64.ln(), 5.0f64.ln(), 2.0f64.ln(), 1.0f64.ln()];
881 assert_eq!(e_benjamini_hochberg(&log_e2, 0.1), vec![0]);
882 }
883
884 #[test]
885 fn split_likelihood_equal_impossibility_is_neutral_log_evidence() {
886 let log_e = split_likelihood_log_e_value(f64::NEG_INFINITY, f64::NEG_INFINITY);
887 assert_eq!(log_e, 0.0);
888 assert!(log_e.is_finite());
889
890 let mut proc = EProcess::new();
891 proc.absorb_log(log_e).unwrap();
892 assert_eq!(proc.log_evidence(), 0.0);
893 assert_eq!(proc.steps(), 1);
894 }
895
896 #[test]
897 fn e_bh_orders_infinite_log_e_values_without_comparator_panic() {
898 let log_e = [f64::NEG_INFINITY, f64::INFINITY, 45.0f64.ln(), 1.0f64.ln()];
899 assert_eq!(e_benjamini_hochberg(&log_e, 0.1), vec![1, 2]);
900 }
901
902 /// A NaN log e-value (a degenerate claim whose `(−∞) − (−∞)` split-LR
903 /// escaped the source guard) must NOT panic the e-BH comparator — the
904 /// honest FDR surface stays robust. The NaN claim is treated as
905 /// least-evidence (`−∞`): it is never rejected, and it cannot disturb the
906 /// rejection of a genuinely strong claim.
907 #[test]
908 fn e_bh_treats_nan_as_least_evidence_without_panicking() {
909 // m = 2, α = 0.1 → threshold for the top claim is m/(αk) = 2/0.1 = 20.
910 // Claim 0 has e = 45 ≥ 20 (rejected); claim 1 is NaN (no evidence).
911 let log_e = [45.0f64.ln(), f64::NAN];
912 let rejected = e_benjamini_hochberg(&log_e, 0.1);
913 assert_eq!(
914 rejected,
915 vec![0],
916 "strong claim survives; NaN claim never rejected"
917 );
918
919 // An all-NaN ledger yields an empty (no-rejection) certificate, not a
920 // panic.
921 let all_nan = [f64::NAN, f64::NAN, f64::NAN];
922 assert!(e_benjamini_hochberg(&all_nan, 0.1).is_empty());
923 }
924
925 /// The full source→consumer chain: a shard with zero density under both
926 /// the alternative and the null produces `(−∞) − (−∞)`, which the split-LR
927 /// resolves to neutral `log E = 0` rather than NaN; banking it and
928 /// certifying must not panic. A genuinely-NaN log-likelihood is likewise
929 /// resolved to neutral evidence at the source.
930 #[test]
931 fn degenerate_split_lr_flows_through_certify_without_nan_panic() {
932 // (−∞) − (−∞): zero density under both hypotheses → neutral.
933 let neutral = split_likelihood_log_e_value(f64::NEG_INFINITY, f64::NEG_INFINITY);
934 assert_eq!(neutral, 0.0);
935 // NaN log-likelihood (un-evaluable degenerate fit) → neutral, finite.
936 let from_nan = split_likelihood_log_e_value(f64::NAN, -3.0);
937 assert!(from_nan.is_finite());
938 assert_eq!(from_nan, 0.0);
939
940 let mut ledger = StructureLedger::new();
941 let degenerate = ledger.register(ClaimKind::AtomExists { atom: 0 });
942 let strong = ledger.register(ClaimKind::AtomExists { atom: 1 });
943 // Bank the neutral split-LR on the degenerate claim — no NaN reaches
944 // the e-process.
945 ledger.absorb_log(degenerate, neutral).unwrap();
946 ledger.absorb_log(strong, 45.0f64.ln()).unwrap();
947 // certify() runs e_benjamini_hochberg internally; must not panic.
948 let certificate = ledger.certify(0.1);
949 let degenerate_entry = certificate
950 .entries
951 .iter()
952 .find(|e| e.kind == ClaimKind::AtomExists { atom: 0 })
953 .expect("degenerate claim present");
954 // Neutral evidence (log_e = 0) never qualifies → contested, not confirmed.
955 assert!(!degenerate_entry.confirmed);
956 assert_eq!(degenerate_entry.log_e, 0.0);
957 }
958
959 #[test]
960 fn e_process_absorb_log_rejects_undefined_log_products() {
961 let mut proc = EProcess::new();
962 assert!(proc.absorb_log(f64::NAN).is_err());
963
964 proc.absorb_log(f64::INFINITY).unwrap();
965 assert!(proc.absorb_log(f64::NEG_INFINITY).is_err());
966 assert_eq!(proc.log_evidence(), f64::INFINITY);
967 assert_eq!(proc.steps(), 1);
968 }
969
970 /// Ville-style sanity: under H0 (simulated fair e-values from a
971 /// likelihood ratio of identical Gaussians), the e-process crosses
972 /// 1/α rarely; under a true alternative it crosses fast and the
973 /// crossing is PERMANENT (running-sup semantics).
974 #[test]
975 fn e_process_crossing_is_permanent_and_directional() {
976 // Deterministic "stream": per-batch log-LR of N(μ,1) vs N(0,1)
977 // evaluated at x drawn from the alternative: log e = μ x − μ²/2.
978 // Use a fixed quasi-random sequence; no RNG state needed.
979 let mu = 0.6f64;
980 let mut proc_alt = EProcess::new();
981 let mut crossed_at: Option<usize> = None;
982 for t in 0..200 {
983 // x_t ~ alternative-ish deterministic surrogate around μ
984 let x = mu + 0.9 * ((t as f64 * 0.7321).sin());
985 proc_alt.absorb_log(mu * x - 0.5 * mu * mu).unwrap();
986 if proc_alt.rejects_at(0.05) && crossed_at.is_none() {
987 crossed_at = Some(t);
988 }
989 }
990 let t_cross = crossed_at.expect("true alternative must cross 1/α");
991 assert!(t_cross < 100, "evidence should accumulate quickly");
992 // Permanence: rejection holds at the end even if late evidence dips.
993 assert!(proc_alt.rejects_at(0.05));
994
995 // Null stream: x centered at 0 → expected log e = −μ²/2 < 0.
996 let mut proc_null = EProcess::new();
997 for t in 0..200 {
998 let x = 0.9 * ((t as f64 * 0.7321).sin());
999 proc_null.absorb_log(mu * x - 0.5 * mu * mu).unwrap();
1000 }
1001 assert!(
1002 !proc_null.rejects_at(0.05),
1003 "null stream must not accumulate evidence (log E = {:.3})",
1004 proc_null.log_evidence()
1005 );
1006 }
1007
1008 /// The design rule selects discrimination, not raw effect.
1009 #[test]
1010 fn probe_selection_prefers_discrimination_over_impact() {
1011 let fisher = array![[2.0, 0.0], [0.0, 0.5]];
1012 let probes = vec![
1013 // Huge effect, but both hypotheses predict it identically.
1014 CandidateProbe {
1015 delta: array![1.0, 0.0],
1016 predicted_mean_null: array![10.0, 10.0],
1017 predicted_mean_alt: array![10.0, 10.0],
1018 },
1019 // Modest effect, hypotheses disagree along the informative axis.
1020 CandidateProbe {
1021 delta: array![0.0, 1.0],
1022 predicted_mean_null: array![0.0, 0.0],
1023 predicted_mean_alt: array![1.0, 0.2],
1024 },
1025 ];
1026 let (idx, growth) =
1027 select_probe_by_expected_evidence(&probes, &fisher).expect("a probe discriminates");
1028 assert_eq!(idx, 1);
1029 // ½·(1,0.2)ᵀ diag(2,0.5) (1,0.2) = ½·(2 + 0.02) = 1.01 nats/obs.
1030 assert!((growth - 1.01).abs() < 1e-12);
1031 // Budget: ~3 observations to certify at α=0.05.
1032 let budget = expected_resolution_budget(0.05, growth).expect("budget");
1033 assert!(budget > 2.0 && budget < 4.0);
1034 }
1035
1036 /// The birth gate certifies under a true alternative, stays contested
1037 /// under the null, and never emits anything but those two verdicts.
1038 #[test]
1039 fn birth_gate_certifies_alternative_and_demotes_never_rejects() {
1040 let mut gate = AtomBirthGate::new(0.05).expect("valid alpha");
1041 // Strong shards: alternative beats the honest null sup by 1 nat each.
1042 for _ in 0..5 {
1043 gate.absorb_shard(-100.0, -101.0);
1044 }
1045 match gate.verdict() {
1046 GateVerdict::Certified { log_e } => assert!((log_e - 5.0).abs() < 1e-12),
1047 v => panic!("5 nats must certify at α=0.05, got {v:?}"),
1048 }
1049 // Permanence: a later evidence retreat cannot un-certify.
1050 gate.absorb_shard(-110.0, -100.0);
1051 assert!(matches!(gate.verdict(), GateVerdict::Certified { .. }));
1052
1053 // Null-ish stream: the prefit alternative loses to the on-shard sup
1054 // (it must, on average — the sup is fit on the eval shard itself).
1055 let mut null_gate = AtomBirthGate::new(0.05).expect("valid alpha");
1056 for _ in 0..50 {
1057 null_gate.absorb_shard(-100.3, -100.0);
1058 }
1059 match null_gate.verdict() {
1060 GateVerdict::Contested { log_e } => assert!(log_e < 0.0),
1061 v => panic!("null stream must stay contested, got {v:?}"),
1062 }
1063 assert!(AtomBirthGate::new(0.0).is_err());
1064 assert!(AtomBirthGate::new(1.0).is_err());
1065 }
1066
1067 /// Ledger: idempotent registration preserves evidence; the certificate
1068 /// splits confirmed/contested by e-BH and the entry list reproduces it.
1069 #[test]
1070 fn ledger_certificate_splits_confirmed_and_contested() {
1071 let mut ledger = StructureLedger::new();
1072 let a0 = ledger.register(ClaimKind::AtomExists { atom: 0 });
1073 let a1 = ledger.register(ClaimKind::AtomExists { atom: 1 });
1074 let edge = ledger.register(ClaimKind::BindingEdge { a: 0, b: 1 });
1075
1076 // m = 3, α = 0.1 → e-BH thresholds m/(αk) = 30, 15, 10.
1077 ledger.absorb_log(a0, 40.0f64.ln()).unwrap();
1078 ledger.absorb_log(a1, 20.0f64.ln()).unwrap();
1079 ledger.absorb_log(edge, 2.0f64.ln()).unwrap();
1080
1081 // Re-registering must return the same slot with evidence intact.
1082 let a0_again = ledger.register(ClaimKind::AtomExists { atom: 0 });
1083 assert_eq!(a0_again, a0);
1084 assert_eq!(ledger.claims()[a0].evidence.steps(), 1);
1085
1086 let cert = ledger.certify(0.1);
1087 // e_(1)=40 ≥ 30 ✓, e_(2)=20 ≥ 15 ✓, e_(3)=2 < 10 ✗ → atoms confirmed,
1088 // the binding edge stays contested.
1089 let confirmed: Vec<&ClaimKind> = cert.confirmed().map(|e| &e.kind).collect();
1090 assert_eq!(confirmed.len(), 2);
1091 assert!(confirmed.contains(&&ClaimKind::AtomExists { atom: 0 }));
1092 assert!(confirmed.contains(&&ClaimKind::AtomExists { atom: 1 }));
1093 let contested: Vec<&CertificateEntry> = cert.contested().collect();
1094 assert_eq!(contested.len(), 1);
1095 assert_eq!(contested[0].kind, ClaimKind::BindingEdge { a: 0, b: 1 });
1096
1097 assert!(ledger.absorb_log(99, 0.0).is_err());
1098 }
1099
1100 /// Resumability: a serialized ledger reloads with its evidence and
1101 /// keeps absorbing — the #973 shard contract.
1102 #[test]
1103 fn ledger_evidence_resumes_across_serialization() {
1104 let mut ledger = StructureLedger::new();
1105 let idx = ledger.register(ClaimKind::GeometryKind {
1106 atom: 3,
1107 kind: "circle".to_string(),
1108 });
1109 ledger.absorb_log(idx, 1.25).unwrap();
1110
1111 let persisted = serde_json::to_string(&ledger).expect("serialize ledger");
1112 let mut resumed: StructureLedger =
1113 serde_json::from_str(&persisted).expect("deserialize ledger");
1114 assert_eq!(resumed.claims()[idx].evidence.steps(), 1);
1115
1116 resumed.absorb_log(idx, 0.75).unwrap();
1117 let log_e = resumed.claims()[idx].evidence.log_evidence();
1118 assert!((log_e - 2.0).abs() < 1e-12);
1119 }
1120
1121 /// The probe plan discounts the remaining budget by evidence already
1122 /// banked, and floors at zero once the claim is across the line.
1123 #[test]
1124 fn probe_plan_discounts_remaining_budget_by_current_evidence() {
1125 let fisher = array![[2.0, 0.0], [0.0, 0.5]];
1126 let probes = vec![CandidateProbe {
1127 delta: array![0.0, 1.0],
1128 predicted_mean_null: array![0.0, 0.0],
1129 predicted_mean_alt: array![1.0, 0.2],
1130 }];
1131 // growth = 1.01 nats/obs (checked above); α=0.05 → need ln(20) ≈ 3.0 nats.
1132 let from_zero = plan_probe_for_contested_claim(&probes, &fisher, 0.05, 0.0).expect("plan");
1133 assert_eq!(from_zero.probe, 0);
1134 assert!((from_zero.budget_remaining - from_zero.budget_from_scratch).abs() < 1e-12);
1135
1136 let halfway = plan_probe_for_contested_claim(&probes, &fisher, 0.05, 1.5).expect("plan");
1137 assert!(halfway.budget_remaining < from_zero.budget_remaining);
1138 assert!((halfway.budget_remaining - (-(0.05f64.ln()) - 1.5) / 1.01).abs() < 1e-12);
1139
1140 let across = plan_probe_for_contested_claim(&probes, &fisher, 0.05, 10.0).expect("plan");
1141 assert_eq!(across.budget_remaining, 0.0);
1142
1143 // No discriminating probe → no plan (undecidable by steering).
1144 let blind = vec![CandidateProbe {
1145 delta: array![1.0, 0.0],
1146 predicted_mean_null: array![5.0, 5.0],
1147 predicted_mean_alt: array![5.0, 5.0],
1148 }];
1149 assert!(plan_probe_for_contested_claim(&blind, &fisher, 0.05, 0.0).is_none());
1150 }
1151
1152 /// The p→e calibrator on hand-checkable values, including its edges.
1153 #[test]
1154 fn p_to_e_calibrator_hand_values() {
1155 // e(p) = ½ p^{−1/2}: p = 1 → e = 0.5; p = 0.04 → e = 2.5; p = 1e-4 → e = 50.
1156 assert!((log_e_from_p_calibrator(1.0).unwrap() - 0.5f64.ln()).abs() < 1e-12);
1157 assert!((log_e_from_p_calibrator(0.04).unwrap() - 2.5f64.ln()).abs() < 1e-12);
1158 assert!((log_e_from_p_calibrator(1e-4).unwrap() - 50.0f64.ln()).abs() < 1e-12);
1159 assert!(log_e_from_p_calibrator(0.0).is_err());
1160 assert!(log_e_from_p_calibrator(1.5).is_err());
1161 assert!(log_e_from_p_calibrator(f64::NAN).is_err());
1162 }
1163
1164 /// The e-value validity condition: under the null `P ~ Uniform(0, 1]`,
1165 /// the calibrated e-value must satisfy `E_{H0}[e(P)] = ∫₀¹ e(p) dp ≤ 1`
1166 /// (Wang–Ramdas e-BH controls FDR ONLY for genuine e-values). The κ = ½
1167 /// member `e(p) = ½ p^{−1/2}` integrates to exactly 1, the boundary of
1168 /// admissibility. We verify this numerically with a midpoint Riemann sum
1169 /// over the unit interval (which UNDER-estimates `∫ p^{−1/2}` slightly
1170 /// because the integrand is convex, so the analytic value 1 sits just
1171 /// above the quadrature estimate — both safely ≤ 1 + tolerance).
1172 ///
1173 /// This is the property `e = 1/p` VIOLATES — `∫₀¹ (1/p) dp = ∞` — which
1174 /// is why the behavioral-head and anova-atom paths route through this
1175 /// calibrator rather than `−ln p`.
1176 #[test]
1177 fn p_to_e_calibrator_null_expectation_at_most_one() {
1178 let n = 2_000_000usize;
1179 let h = 1.0 / n as f64;
1180 let mut mean_e = 0.0_f64;
1181 for i in 0..n {
1182 // Midpoint of cell i: p = (i + 0.5)/n, always in (0, 1).
1183 let p = (i as f64 + 0.5) * h;
1184 let e = log_e_from_p_calibrator(p).unwrap().exp();
1185 mean_e += e * h;
1186 }
1187 // Analytic E_{H0}[e(P)] = 1; allow a small quadrature tolerance, but
1188 // it MUST NOT exceed 1 by more than that (an invalid calibrator like
1189 // 1/p would diverge here, not land near 1).
1190 assert!(
1191 mean_e <= 1.0 + 1e-3,
1192 "calibrated e-value null expectation {mean_e} exceeds 1 — not a valid e-value"
1193 );
1194 assert!(
1195 mean_e > 0.99,
1196 "calibrated e-value null expectation {mean_e} far below the analytic 1.0"
1197 );
1198 }
1199
1200 /// POWER STUDY, null side: the heuristic gate every dictionary paper
1201 /// runs — "accept the K+1-th atom the first time the cumulative
1202 /// likelihood ratio shows improvement" — versus the e-gate, on a
1203 /// family of NULL streams peeked at after every shard. The per-shard
1204 /// log-LR is `μ x_t − μ²/2` with `x_t = A sin(ω t + φ)` (a
1205 /// deterministic null surrogate: mean drift −μ²/2 < 0, bounded
1206 /// fluctuation). The naive gate's false-accept mechanism is exactly
1207 /// optional stopping: any phase whose partial sums wander above zero
1208 /// at ANY peek accepts a nonexistent atom. The e-gate needs
1209 /// log(1/α) ≈ 3.0 nats, and the partial-sum fluctuation is bounded by
1210 /// `μ·A/sin(ω/2) ≈ 1.51` nats (Dirichlet-kernel bound) BEFORE the
1211 /// negative drift — so it can never certify on any phase, which is
1212 /// Ville's inequality made concrete.
1213 #[test]
1214 fn power_study_null_naive_peeking_gate_false_accepts_e_gate_never() {
1215 let mu = 0.6f64;
1216 let amp = 0.9f64;
1217 let omega = 0.7321f64;
1218 let n_phases = 60usize;
1219 let n_shards = 200usize;
1220
1221 let mut naive_false_accepts = 0usize;
1222 let mut e_gate_false_accepts = 0usize;
1223 for k in 0..n_phases {
1224 let phase = 2.0 * std::f64::consts::PI * (k as f64) / (n_phases as f64);
1225 let mut gate = AtomBirthGate::new(0.05).expect("alpha");
1226 let mut cum_log_lr = 0.0f64;
1227 let mut naive_accepted = false;
1228 for t in 0..n_shards {
1229 let x = amp * ((t as f64) * omega + phase).sin();
1230 let log_lr = mu * x - 0.5 * mu * mu;
1231 cum_log_lr += log_lr;
1232 // The broken test: peek, accept on any improvement.
1233 if cum_log_lr > 0.0 {
1234 naive_accepted = true;
1235 }
1236 gate.absorb_shard(log_lr, 0.0);
1237 }
1238 if naive_accepted {
1239 naive_false_accepts += 1;
1240 }
1241 if matches!(gate.verdict(), GateVerdict::Certified { .. }) {
1242 e_gate_false_accepts += 1;
1243 }
1244 }
1245 // The naive gate false-accepts on a large fraction of null phases
1246 // (any phase with early-positive partial sums); the e-gate on none.
1247 assert!(
1248 naive_false_accepts >= n_phases / 3,
1249 "the peeking gate should false-accept often under the null \
1250 (got {naive_false_accepts}/{n_phases})"
1251 );
1252 assert_eq!(
1253 e_gate_false_accepts, 0,
1254 "the e-gate must never certify under the null"
1255 );
1256 }
1257
1258 /// POWER STUDY, alternative side, through the orchestration harness:
1259 /// a planted K+1-th atom worth 0.5 nats/shard CROSSES the single-hypothesis
1260 /// Ville bar in ⌈log(1/α)/0.5⌉ = 6 shards — the realized
1261 /// time-to-certification matching the design-time
1262 /// `expected_resolution_budget`. The gate keeps absorbing past the
1263 /// crossing (the crossing time is latched in `certified_at_step`, so it is
1264 /// not lost), banking the full stream's evidence — which is exactly what
1265 /// the dictionary-level e-BH certificate needs to clear its higher
1266 /// multiplicity bar. The alternative refits on every shard regardless.
1267 #[test]
1268 fn power_study_planted_atom_certifies_at_the_predicted_budget() {
1269 let growth = 0.5f64;
1270 let n_shards = 20usize;
1271 let (gate, alt_state) = run_atom_birth_gate(
1272 0.05,
1273 0usize, // alt state = number of shards folded into the fit
1274 0..n_shards,
1275 |_, _| -99.5, // prefit alternative log-lik on the shard
1276 |_| -100.0, // honest null sup on the shard
1277 |folded, _| folded + 1,
1278 )
1279 .expect("valid alpha");
1280
1281 // Full-stream evidence is banked: 0.5 nats/shard × 20 shards = 10 nats,
1282 // far past the single-hypothesis bar — this surplus is what the e-BH
1283 // dictionary certificate consumes against its `ln(m/(α·k))` bar.
1284 match gate.verdict() {
1285 GateVerdict::Certified { log_e } => {
1286 assert!((log_e - growth * n_shards as f64).abs() < 1e-12)
1287 }
1288 v => panic!("planted atom must certify, got {v:?}"),
1289 }
1290 // Realized time-to-certification (first-passage of the running sup over
1291 // `ln(1/α)`) == the design-time budget, rounded up.
1292 let budget = expected_resolution_budget(0.05, growth).expect("budget");
1293 assert_eq!(gate.certified_at_step(), Some(budget.ceil() as usize));
1294 assert_eq!(gate.certified_at_step(), Some(6));
1295 // Absorption does NOT stop at the crossing: every shard is banked.
1296 assert_eq!(gate.test.process.steps(), n_shards);
1297 // The alternative state saw the whole stream.
1298 assert_eq!(alt_state, n_shards);
1299 }
1300
1301 /// Work-plan step 4, closed end-to-end: a contested claim gets a probe
1302 /// plan, the probe's realized outcomes are scored under both FROZEN
1303 /// hypotheses via [`StructureLedger::absorb_probe_outcome`], and the
1304 /// banked evidence flips the claim to confirmed within a small multiple
1305 /// of the plan's predicted resolution budget. Outcome noise is a
1306 /// deterministic bounded surrogate (zero-mean sinusoid), so under the
1307 /// true alternative each probe's expected log-growth is exactly the
1308 /// design value.
1309 #[test]
1310 fn design_loop_resolves_contested_claim_within_predicted_budget() {
1311 let mut ledger = StructureLedger::new();
1312 let idx = ledger.register(ClaimKind::GeometryKind {
1313 atom: 0,
1314 kind: "circle".to_string(),
1315 });
1316
1317 // Local Gaussian output model, unit-isotropic noise in the
1318 // Fisher-whitened coordinates: per-observation expected log-growth
1319 // under H1 is exactly the planned ½‖μ₁−μ₀‖²_F.
1320 let fisher = array![[1.0, 0.0], [0.0, 1.0]];
1321 let mu0 = array![0.0, 0.0];
1322 let mu1 = array![1.2, 0.5];
1323 let probes = vec![CandidateProbe {
1324 delta: array![0.0, 1.0],
1325 predicted_mean_null: mu0.clone(),
1326 predicted_mean_alt: mu1.clone(),
1327 }];
1328 let alpha = 0.05;
1329 let plan = plan_probe_for_contested_claim(&probes, &fisher, alpha, 0.0).expect("plan");
1330 assert_eq!(plan.probe, 0);
1331 // ½‖μ₁−μ₀‖² = ½(1.44 + 0.25) = 0.845 nats/obs; ln 20 ≈ 3.0 ⇒ ~3.6 obs.
1332 assert!((plan.expected_log_growth - 0.845).abs() < 1e-12);
1333 let budget = plan.budget_remaining.ceil().max(1.0) as usize;
1334
1335 // Run the probe loop: outcomes realized under the TRUE alternative
1336 // (mean μ₁ plus bounded zero-mean fluctuation); both hypotheses'
1337 // densities were frozen above, before any outcome existed.
1338 let mut observations = 0usize;
1339 while !ledger.claims()[idx].evidence.rejects_at(alpha) {
1340 observations += 1;
1341 assert!(
1342 observations <= 4 * budget,
1343 "claim must resolve within a small multiple of the predicted \
1344 budget {budget}; still contested after {observations} probes"
1345 );
1346 let t = observations as f64;
1347 let eps0 = 0.8 * (t * 0.7321).sin();
1348 let eps1 = 0.8 * (t * 1.1173).cos();
1349 let y = array![mu1[0] + eps0, mu1[1] + eps1];
1350 // Unit-Gaussian log-densities under each frozen hypothesis; the
1351 // shared normalizer cancels in the ratio.
1352 let d1 = &y - &mu1;
1353 let d0 = &y - &mu0;
1354 ledger
1355 .absorb_probe_outcome(idx, -0.5 * d1.dot(&d1), -0.5 * d0.dot(&d0))
1356 .expect("absorb");
1357 }
1358 let cert = ledger.certify(alpha);
1359 assert!(
1360 cert.confirmed()
1361 .any(|e| matches!(e.kind, ClaimKind::GeometryKind { atom: 0, .. }))
1362 );
1363 }
1364}