Expand description
Bitmask-coefficient multi-directional jets used by marginal-slope and latent-survival row kernels.
The layout stores one coefficient per direction mask. The calculus itself
lives in crate::jet_algebra: that module owns the layout-agnostic
Leibniz / Faà di Bruno combinatorics once, and the scalar (n_dirs <= 1)
path here still routes through it so a fix to the rule is a fix to both
representations.
§Why this layout is special (and how the hot path exploits it)
Each direction is seeded linearly (one first-derivative slot), so every
direction variable squares to zero. The coefficients therefore form the
commutative multilinear / set-function algebra: coeffs[mask] is the
coefficient of Π_{i ∈ mask} ε_i. In that algebra two facts collapse the
generic combinatorial walkers into tight branch-free arithmetic:
-
mulis the subset (zeta-style) convolutionout[mask] = Σ_{sub ⊆ mask} a[sub] · b[mask \ sub]. The sharedleibniz_productwalker rebuilds twoSlotBufs and folds bit lists back into masks (mask_of) per subset; here we enumerate the submasks ofmaskdirectly —mask \ sub == mask ^ subbecausesub ⊆ mask— in the same ascending order the walker used, so the floating-point accumulation is bit-for-bit identical while everySlotBuf/closure/mask_ofallocation and indirection disappears (3^Kpure FMAs, no heap, nodyn). -
compose_unaryis the truncated Faà di Bruno composition, computed here by the exact truncated-Taylor reassociation rather than a direct set-partition sum. Letvbe the non-constant part ofself(v[0] = 0,v[mask] = self[mask]) and letv^{⊛k}be thek-fold subset convolution (the multilinear power). The ordered-tuple identityv^{⊛k}[mask] = k! · Σ_{π ⊢ mask, |π| = k} Π_{B ∈ π} v[B]turns the set-partition sum into a degree-4 polynomial inv:f(self)[mask] = Σ_{k=0}^{4} (f^{(k)} / k!) · v^{⊛k}[mask] (mask ≠ 0) f(self)[0] = f^{(0)}so a composition is just three subset convolutions (
v²,v³=v²⊛v,v⁴=v²⊛v²— the Motzkin floor for a quartic) plus a five-term combine. That is ~3× fewer FLOPs than the per-mask partition gather; each convolution is a four-lane compensated dot product (Ogita–Rump–Oishi Dot2, FMA-split products + TwoSum carry) so the result is computed in ~double the working precision and the rounding ofv²cannot compound throughv³/v⁴; the final per-mask combine is Neumaier-compensated andwide::f64x4-vectorised; and the whole call runs on reused thread-local scratch with no per-call heap traffic. The reassociation is algebraically exact; accuracy-vs-truth (a double-double oracle) is the test gate and is strictly ≤ the old partition sum’s error (seetests).