Module jet_partitions

Expand description

Bitmask-coefficient multi-directional jets used by marginal-slope and latent-survival row kernels.

The layout stores one coefficient per direction mask. The calculus itself lives in crate::jet_algebra: that module owns the layout-agnostic Leibniz / Faà di Bruno combinatorics once, and the scalar (n_dirs <= 1) path here still routes through it so a fix to the rule is a fix to both representations.

§Why this layout is special (and how the hot path exploits it)

Each direction is seeded linearly (one first-derivative slot), so every direction variable squares to zero. The coefficients therefore form the commutative multilinear / set-function algebra: coeffs[mask] is the coefficient of Π_{i ∈ mask} ε_i. In that algebra two facts collapse the generic combinatorial walkers into tight branch-free arithmetic:

mul is the subset (zeta-style) convolution out[mask] = Σ_{sub ⊆ mask} a[sub] · b[mask \ sub]. The shared leibniz_product walker rebuilds two SlotBufs and folds bit lists back into masks (mask_of) per subset; here we enumerate the submasks of mask directly — mask \ sub == mask ^ sub because sub ⊆ mask — in the same ascending order the walker used, so the floating-point accumulation is bit-for-bit identical while every SlotBuf/closure/mask_of allocation and indirection disappears (3^K pure FMAs, no heap, no dyn).
compose_unary is the truncated Faà di Bruno composition, computed here by the exact truncated-Taylor reassociation rather than a direct set-partition sum. Let v be the non-constant part of self (v[0] = 0, v[mask] = self[mask]) and let v^{⊛k} be the k-fold subset convolution (the multilinear power). The ordered-tuple identity v^{⊛k}[mask] = k! · Σ_{π ⊢ mask, |π| = k} Π_{B ∈ π} v[B] turns the set-partition sum into a degree-4 polynomial in v:
```
f(self)[mask] = Σ_{k=0}^{4} (f^{(k)} / k!) · v^{⊛k}[mask]      (mask ≠ 0)
f(self)[0]    = f^{(0)}
```
so a composition is just three subset convolutions (v², v³=v²⊛v, v⁴=v²⊛v² — the Motzkin floor for a quartic) plus a five-term combine. That is ~3× fewer FLOPs than the per-mask partition gather; each convolution is a four-lane compensated dot product (Ogita–Rump–Oishi Dot2, FMA-split products + TwoSum carry) so the result is computed in ~double the working precision and the rounding of v² cannot compound through v³/v⁴; the final per-mask combine is Neumaier-compensated and wide::f64x4-vectorised; and the whole call runs on reused thread-local scratch with no per-call heap traffic. The reassociation is algebraically exact; accuracy-vs-truth (a double-double oracle) is the test gate and is strictly ≤ the old partition sum’s error (see tests).

Structs§

MultiDirJet

Statics§

COMPOSE_UNARY_CALLS
MUL_CALLS