pub struct Tower4<const K: usize> {
pub v: f64,
pub g: [f64; K],
pub h: [[f64; K]; K],
pub t3: [[[f64; K]; K]; K],
pub t4: [[[[f64; K]; K]; K]; K],
}Expand description
Truncated fourth-order multivariate Taylor scalar in K variables.
See the module documentation for semantics and conventions. Copy is
intentional despite the size (2 KiB at K=4): towers are per-row
temporaries that live entirely in registers/stack during a row program,
and value semantics keep program code readable (a * b + c).
Fields§
§v: f64Value ℓ.
g: [f64; K]Gradient ∂ℓ/∂p_a.
h: [[f64; K]; K]Hessian ∂²ℓ/∂p_a∂p_b (symmetric).
t3: [[[f64; K]; K]; K]Third derivatives ∂³ℓ/∂p_a∂p_b∂p_c (fully symmetric).
t4: [[[[f64; K]; K]; K]; K]Fourth derivatives ∂⁴ℓ/∂p_a∂p_b∂p_c∂p_d (fully symmetric).
Implementations§
Source§impl<const K: usize> Tower4<K>
impl<const K: usize> Tower4<K>
Sourcepub fn variable(value: f64, idx: usize) -> Self
pub fn variable(value: f64, idx: usize) -> Self
The seeded variable p_idx with current value value:
unit first derivative in slot idx, zero elsewhere and above.
Sourcepub fn mul(&self, o: &Self) -> Self
pub fn mul(&self, o: &Self) -> Self
Exact truncated Leibniz product D_S(ab) = Σ_{T ⊆ S} D_T(a) · D_{S∖T}(b).
§Codegen
Each output entry’s 2^m subset sum is written as a compact straight-line
expression instead of the shared [jet_algebra::leibniz_product] subset
walker (which, per entry, builds SlotBufs and match-dispatches the
deriv closure across all 2^m subsets). The loop nest over (i,j,k,l)
is unchanged — only the inner per-entry sum is unrolled — so this does NOT
unroll over K and does NOT bloat code: on a Tower4<9> mul-and-read
consumer the new form is faster AND smaller (asm: 34 outlined walker bl
calls → 0, 21.1 KiB → 14.3 KiB, +100 NEON .2d ops).
BIT-IDENTICAL to the walker: each entry’s terms are in the walker’s exact
subset-enumeration order (subset bit b ↔ position b, sub = 0..2^m),
and the per-entry acc accumulator mirrors the walker’s total = 0.0
start so a signed-zero leading product collapses to +0.0 identically —
which matters because real jets carry exact-0.0 channels
(constant/variable towers). Proven to_bits-identical on
v/g/h/t3/t4 across K ∈ {2,3,4,9}, 5000 inputs each with ~30 %
exact-0.0 channels and signed values (a no-leading-0.0 form fails this
stress — the accumulator start is load-bearing).
Sourcepub fn add(&self, o: &Self) -> Self
pub fn add(&self, o: &Self) -> Self
Ref-taking elementwise sum, the by-ref twin of the std::ops::Add
operator (which consumes by value). Mirrors the inherent mul/scale
API so a chain like a.mul(&b).add(&c) reads uniformly without moving
out of the borrowed operands.
Sourcepub fn sub(&self, o: &Self) -> Self
pub fn sub(&self, o: &Self) -> Self
Ref-taking elementwise difference, the by-ref twin of std::ops::Sub.
Sourcepub fn compose_unary(&self, d: [f64; 5]) -> Self
pub fn compose_unary(&self, d: [f64; 5]) -> Self
Exact multivariate Faà di Bruno composition f ∘ self.
d = [f(u), f′(u), f″(u), f‴(u), f⁗(u)] evaluated at u = self.v —
the SAME [f64; 5] stack shape the families’ existing
unary_derivatives_* helpers produce, so those special-function
stacks (Φ, log-Φ, normal pdf, …) plug in directly.
The order-m output sums over the set partitions of the m indices
(Bell(3) = 5 terms at order 3, Bell(4) = 15 at order 4), grouped by
block count: each partition into r blocks contributes
f⁽ʳ⁾ · Π_blocks D_block(u).
§Codegen
Evaluated as a compact closed form (the Bell(4)=15 set-partitions of
t4, Bell(3)=5 of t3, …) instead of routing through the recursive
jet_algebra::faa_di_bruno walker (per-output for_each_partition
recursion + per-block SlotBuf + closure dispatch). The loop nest is
identical to the walker’s (for i,j,k,l); only the per-entry partition
sum is straight-line, so this does NOT unroll over K and does NOT
bloat code — measured on a Tower4<9> compose-and-read consumer the new
form is both faster and SMALLER (asm: 94 outlined walker bl calls → 0,
47.5 KiB → 16.7 KiB, +197 NEON .2d ops).
BIT-IDENTICAL to the walker: each channel’s terms are emitted in the
walker’s exact partition-enumeration order, each term’s block products
are left-associated exactly as the walker’s prod *= block, and the
per-channel acc accumulator mirrors the walker’s total = 0.0 start
(so signed-zero products collapse to +0.0 identically). The order-4
term sequence was generated from the walker’s own enumeration. Proven
to_bits-identical on v/g/h/t3/t4 across K ∈ {2,3,4,9},
5000 random inputs each (zeroed / sign-varied stacks included).
Sourcepub fn compose_unary_with(&self, stack_fn: impl Fn(f64) -> [f64; 5]) -> Self
pub fn compose_unary_with(&self, stack_fn: impl Fn(f64) -> [f64; 5]) -> Self
Compose with a unary special-function whose [f64; 5] derivative STACK is
built from the base value through stack_fn — the scalar arm of the
generic-over-Lane compose seam (see
Tower4Lane::compose_unary_with). Evaluates stack_fn(self.v) ONCE and
forwards to Self::compose_unary, so it is BIT-IDENTICAL to the explicit
self.compose_unary(stack_fn(self.v)). Writing a program against this seam
lets it re-instantiate, unchanged, at Tower4Lane (where each of the four
lanes carries a DISTINCT base value and stack_fn is re-run per lane).
Sourcepub fn compose_unary_single_slot(&self, d: [f64; 5], slot: usize) -> Self
pub fn compose_unary_single_slot(&self, d: [f64; 5], slot: usize) -> Self
Single-active-slot fast path for Self::compose_unary.
When the inner jet self has derivative support ONLY on the all-slot
diagonal channels — i.e. it is a univariate jet in primary slot
scattered into the K-wide layout (g[a] = 0, h[a][b] = 0,
t3 = 0, t4 = 0 for any axis ≠ slot) — the multivariate Faà di
Bruno walk collapses. Every output channel whose axis tuple contains an
axis ≠ slot is structurally 0: each set-partition has a block
covering that axis, that block reads an off-slot derivative of self
(which is 0), so the block product and the whole partition vanish, and
the channel sums to the walker’s total = 0.0 start, i.e. +0.0. Only
the five diagonal channels (v, g[slot], h[slot][slot],
t3[slot]³, t4[slot]⁴) survive.
This computes exactly those five as STRAIGHT-LINE accumulations, each in
the EXACT term order of Self::compose_unary’s diagonal
(i = j = k = l = slot) case — so they are BIT-IDENTICAL to
Self::compose_unary on the diagonal — and leaves every other channel
at the zero-init +0.0, which the full walk also produces (the
off-slot collapse is to_bits-+0.0, signed-zero products included;
proven across K ∈ {2,3,4,9}, 5000 single-slot inputs each). At any
K ≥ 2 this is far fewer floating-point operations than materialising
the full 1 + K + K² + K³ + K⁴ channel set whose off-diagonal entries
are all zero, and far cheaper than the recursive set-partition walker the
diagonal channels previously routed through (a measured ~9.5× speedup vs
the full compose_unary, recovering a 5.9× walker regression at the
K ∈ {2,3} BMS tower widths).
#[inline] so an adopting consumer pays no bl call (uninlined, the
five-channel build does not amortise the call/spill overhead).
§Precondition
The caller guarantees the single-active-slot structure. If it does not
hold, the off-slot channels would be wrongly zeroed; use the full
Self::compose_unary in that case.
Sourcepub fn powf(&self, a: f64) -> Self
pub fn powf(&self, a: f64) -> Self
self^a for real exponent a. Caller guarantees a positive base.
Trait Implementations§
impl<const K: usize> Copy for Tower4<K>
Source§impl<const K: usize> JetScalar<K> for Tower4<K>
The full dense crate::jet_tower::Tower4 is itself a JetScalar: it
carries EVERY channel, so a row expression written ONCE against JetScalar
can be evaluated at Tower4 to obtain the full (v, g, H, t3, t4) in one
pass. This is BOTH the #932 oracle ground truth the packed Order2 /
OneSeed / TwoSeed scalars are pinned against, AND a production scalar:
a family whose uncontracted third / fourth derivative tensors are needed
(the BMS rigid third_full / fourth_full caches) evaluates the SAME
generic row-NLL expression at Tower4 and reads .t3 / .t4 off the
result — so the dense tensors come from the single source of truth, not a
separately hand-written jet. The packed scalars serve the consumers that
need only (v, g, H) (Order2) or one / two contractions
(OneSeed / TwoSeed) without paying for the dense tensors.
impl<const K: usize> JetScalar<K> for Tower4<K>
The full dense crate::jet_tower::Tower4 is itself a JetScalar: it
carries EVERY channel, so a row expression written ONCE against JetScalar
can be evaluated at Tower4 to obtain the full (v, g, H, t3, t4) in one
pass. This is BOTH the #932 oracle ground truth the packed Order2 /
OneSeed / TwoSeed scalars are pinned against, AND a production scalar:
a family whose uncontracted third / fourth derivative tensors are needed
(the BMS rigid third_full / fourth_full caches) evaluates the SAME
generic row-NLL expression at Tower4 and reads .t3 / .t4 off the
result — so the dense tensors come from the single source of truth, not a
separately hand-written jet. The packed scalars serve the consumers that
need only (v, g, H) (Order2) or one / two contractions
(OneSeed / TwoSeed) without paying for the dense tensors.
Source§fn variable(x: f64, axis: usize) -> Self
fn variable(x: f64, axis: usize) -> Self
p_axis at value x: unit first derivative in slot
axis, all higher channels zero. (The nilpotent / cross channels of the
directional scalars are seeded zero — callers set ε/δ directions through
the scalar-specific OneSeed::seed_direction / TwoSeed::seed.)Source§fn compose_unary(&self, d: [f64; 5]) -> Self
fn compose_unary(&self, d: [f64; 5]) -> Self
f ∘ self, given the outer
derivative stack d = [f(u), f′(u), f″(u), f‴(u), f⁗(u)] at
u = self.value(). Read moreSource§fn compose_unary_with(&self, stack_fn: impl Fn(f64) -> [f64; 5]) -> Self
fn compose_unary_with(&self, stack_fn: impl Fn(f64) -> [f64; 5]) -> Self
stack_fn — the generic-over-Lane
seam that lets a single-sourced row program instantiate at BOTH the scalar
f64 jets and the SIMD f64x4 batch towers from ONE expression. Read moreSource§fn ln(&self) -> Self
fn ln(&self) -> Self
ln(self). Caller guarantees positivity. Same derivative stack
crate::jet_tower::Tower4::ln uses, so any program written over both
matches term-for-term.Source§fn powf(&self, a: f64) -> Self
fn powf(&self, a: f64) -> Self
self^a for real exponent a. Caller guarantees a positive base.
Mirrors crate::jet_tower::Tower4::powf (falling-factorial stack).Source§fn ln_gamma(&self) -> Self
fn ln_gamma(&self) -> Self
ln Γ(self). Caller guarantees a positive argument. Uses the SAME
hand-certified derivative stack crate::jet_tower::Tower4::ln_gamma
consumes (crate::jet_tower::ln_gamma_derivative_stack), so any
program written over both matches term-for-term.Source§fn digamma(&self) -> Self
fn digamma(&self) -> Self
ψ(self) = d/dx ln Γ(x) (digamma). Caller guarantees a positive
argument. Same hand-certified stack
crate::jet_tower::digamma_derivative_stack.Auto Trait Implementations§
impl<const K: usize> Freeze for Tower4<K>
impl<const K: usize> RefUnwindSafe for Tower4<K>
impl<const K: usize> Send for Tower4<K>
impl<const K: usize> Sync for Tower4<K>
impl<const K: usize> Unpin for Tower4<K>
impl<const K: usize> UnsafeUnpin for Tower4<K>
impl<const K: usize> UnwindSafe for Tower4<K>
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
impl<ST, DT> CastableFrom<ST, Initialized, Initialized> for DT
impl<ST, DT> CastableFrom<ST, Uninit, Uninit> for DT
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> ClosedNeg for Twhere
T: Neg<Output = T>,
impl<T> Read<Exclusive, BecauseExclusive> for Twhere
T: ?Sized,
Source§impl<const K: usize, S> RowJet<K> for Swhere
S: JetScalar<K>,
impl<const K: usize, S> RowJet<K> for Swhere
S: JetScalar<K>,
Source§fn variable(x: f64, slot: usize) -> S
fn variable(x: f64, slot: usize) -> S
slot at value x (unit first derivative in slot),
broadcast to every lane. Per-lane-DISTINCT seeding for the batch path is
done by the lane instantiators (generic_batched_fourth_tower /
generic_batched_third_tower), which build the tower variables directly
from each row’s primaries; this method is for any row-invariant auxiliary
variable a body introduces.Source§fn neg(&self) -> S
fn neg(&self) -> S
scale(-1.0); the blanket overrides it
to delegate to crate::jet_scalar::JetScalar::neg.Source§fn compose_unary_with<const N: usize>(
&self,
stack_fn: impl Fn(f64) -> [f64; N],
) -> S
fn compose_unary_with<const N: usize>( &self, stack_fn: impl Fn(f64) -> [f64; N], ) -> S
[f64; N]
derivative stack is built from the running base value PER LANE through
stack_fn. This is the SHARED-TRAIT version of the compose_unary_with
inherent method that already exists on both the scalar towers and the lane
towers: on a scalar jet stack_fn is run once at the value; on an f64x4
lane tower it is re-run per lane (the four rows carry four distinct base
values), so lane i is to_bits-identical to the scalar result on row i.
Making it a trait method is precisely what lets a body written once over
R: RowJet<K> instantiate at the batch towers. N is widened/narrowed to
the tower’s native width by [resize_stack] (N == 5 is a verbatim copy).Source§fn guard(&self, pred: impl Fn(f64) -> bool) -> GuardVerdict
fn guard(&self, pred: impl Fn(f64) -> bool) -> GuardVerdict
pred on each active lane’s value channel
and report which lanes failed (see GuardVerdict). A scalar jet checks
its one value; a lane tower checks all four. Lets a batched program detect
an out-of-domain row in a 4-group and bail that group to the scalar tail.Source§fn scale_rows(&self, s: f64) -> S
fn scale_rows(&self, s: f64) -> S
s
(Self::Value). On a scalar jet Self::Value = f64, so this is exactly
scale and the scalar call sites stay BIT-IDENTICAL when
.scale(x) is rewritten to .scale_rows(x); on an f64x4 lane tower
Self::Value = [f64; 4] and lane i is multiplied by s[i]. This is the
primitive that lets a batched body carry CONTINUOUS per-row data — the
survival covariance_ones / z_sum / observation-weight wi factors that
enter the jet algebra as .scale(per_row_value) and that the single-f64
scale would broadcast wrongly across the four rows. Build
s from the lane→row map with pack_rows.Source§fn pack_rows(rows: &[usize], value_of: impl Fn(usize) -> f64) -> f64
fn pack_rows(rows: &[usize], value_of: impl Fn(usize) -> f64) -> f64
rows: value_of(r)
is evaluated for each active lane’s row and packed into Self::Value (a
single f64 on a scalar jet, [f64; 4] on an f64x4 lane tower). This is
how a body written once over RowJet feeds per-row CONTINUOUS data (the
arguments to scale_rows) into the batch path without
knowing the concrete representation: the program holds the per-row data and
the caller threads rows (length 1 scalar, length 4 batch) into
RowNllProgramRowJet::row_nll, so the body writes
x.scale_rows(R::pack_rows(rows, |r| self.cov(r))). A multiplicative weight
buried in a compose_unary_with stack is pulled out the same way:
x.compose_unary_with(|u| stack(u, 1.0)).scale_rows(R::pack_rows(rows, |r| self.wi(r))).
(Binary per-row branches such as the event indicator di are kept
lane-uniform by grouping and the guard bail, not packed.)Source§impl<SS, SP> SupersetOf<SS> for SPwhere
SS: SubsetOf<SP>,
impl<SS, SP> SupersetOf<SS> for SPwhere
SS: SubsetOf<SP>,
Source§fn to_subset(&self) -> Option<SS>
fn to_subset(&self) -> Option<SS>
self from the equivalent element of its
superset. Read moreSource§fn is_in_subset(&self) -> bool
fn is_in_subset(&self) -> bool
self is actually part of its subset T (and can be converted to it).Source§fn to_subset_unchecked(&self) -> SS
fn to_subset_unchecked(&self) -> SS
self.to_subset but without any property checks. Always succeeds.Source§fn from_subset(element: &SS) -> SP
fn from_subset(element: &SS) -> SP
self to the equivalent element of its superset.