pub fn try_fast_spectral_leverage_diagonal(
x: &DesignMatrix,
g: ArrayView2<'_, f64>,
) -> Option<Array1<f64>>Expand description
GPU-offloaded spectral leverage diagonal h[i] = ‖(X G)_{i,:}‖².
G is the (p × rank) spectral factor with G_ε(H) = G Gᵀ; the per-row
leverage is the squared norm of the i-th row of X G. This is the dominant
n-dependent cost of every REML outer evaluation at large scale (issue
#922), and historically ran only on the CPU while the device pool idled.
The row dimension is split into byte-balanced chunks scattered across the
whole device pool via super::pool::scatter_batched — the same
whole-solve row-block granularity as Arrow-Schur — and each tile runs one
cuBLAS GEMM X_chunk · G on its bound ordinal before reducing row-wise
sum-of-squares. The arithmetic is identical f64 to the CPU faer path (modulo
IEEE-754 reduction order); on no device, a below-threshold shape, or any
tile failure the function returns None and the caller runs its
deterministic CPU stream.