Expand description
GPU Hutchinson stochastic trace estimator for the REML/LAML logdet gradient, per math team block 2 (sections 12–18 of the V100 design).
Public entry point: evidence_derivatives_hutchinson_gpu. For each
derivative Hessian H_j (j = 1..D) and a single penalized Hessian H
held resident on device, returns the unbiased Hutchinson estimate of
t_j = tr(H^{-1} H_j)plus the sample standard error of each estimate, computed from K
Rademacher probe vectors z_k ∈ {±1}^p whose entries are drawn from a
stateless SplitMix64 counter hash (no cuRAND state). The math
identity used on device is
z^T H^{-1} H_j z = z^T H_j w where H w = zso we factor H once with cusolverDnDpotrf, batch-solve H W = Z
with one cusolverDnDpotrs of nrhs = K, and then evaluate the
quadratic forms with a custom NVRTC reduction kernel. The REML logdet
gradient is g_j = (1/2) · mean_k(q_{j,k}).
Two assembly variants for H_j are supported:
- Dense — caller passes
H_jas ap × pdevice or host matrix. GEMM formsY_j = H_j W, then a custom reduction sumsz_k^T y_{j,k}per (j, k). Cost:DGEMMs of sizep × p × K. - Weighted-Gram structural — caller provides the design
X(n × p), weight vectorsA_j(n, one per derivative — the diagonal of the design’s row weights thatH_jadds), and the per-derivative penalty contributionQ_pen[j,k]if any. The kernel formsR_Z = X ZandR_W = X Wonce via GEMM and then sumssum_i a_j[i] · R_Z[i,k] · R_W[i,k]per (j, k) without ever materialising thep × pH_jmatrix. Cost: 2 GEMMs of sizen × p × Kshared across allDderivatives.
The structural path is the high-value route for large-scale models
where p is hundreds and there are many derivatives.
§Stateless probe RNG
The probe entries are produced on device by a SplitMix64 finalizer over
(seed, probe_index k, coordinate i). This has three consequences:
- No cuRAND state — the kernel is fully stateless, threads write into
Z[i + k·p]independently. - Common random numbers: the first
K1probes of a run withK2 > K1are bit-identical to aK = K1run with the same seed. This is the property that lets the adaptiveKschedule build on earlier probes without re-running them, and lets CPU and GPU implementations of Hutchinson compare estimator-by-estimator (the same probes produce the sameq_{j,k}to round-off). - Reproducibility — a probe at
(seed, k, i)is the same call after call regardless of how the grid was scheduled.
§Gating
The companion helper should_use_gpu_hutchinson mirrors the CPU
gate (prefers_stochastic_trace_estimation + matching kernel +
plain-SPD logdet path) and adds the GPU-specific minima from the math
team’s section 18:
p ≥ 512K ∈ [8, 128]- Hessian and design held resident or about to be uploaded
- The projected penalty-subspace trace is inactive (otherwise the CPU path projects through the IFT kernel — that route is required for marginal-slope ρ-saturated rows)
Structs§
- Adaptive
Trace Evidence - Adaptive-K Hutchinson trace schedule with common random numbers (CRN).
- Probe
Seed - Stateless seed for the SplitMix64 Rademacher probe RNG.
- Reml
Trace Hutchinson Evidence - Output of
evidence_derivatives_hutchinson_gpu. - Reml
Trace Hutchinson Input - Inputs to
evidence_derivatives_hutchinson_gpu.
Enums§
- Derivative
Hessian - Description of one derivative-Hessian contribution
H_j.
Constants§
- HUTCHINSON_
ADAPTIVE_ REL_ TOL - Default relative-error target for the adaptive-K stopping rule.
Matches
StochasticTraceConfig::default().relative_tol. - HUTCHINSON_
ADAPTIVE_ TAU_ REL - Default near-zero-trace protection floor. Matches
StochasticTraceConfig::default().tau_rel. - HUTCHINSON_
GPU_ MAX_ K - HUTCHINSON_
GPU_ MIN_ K - Minimum and maximum probe counts the GPU path accepts (math section 18).
- HUTCHINSON_
GPU_ MIN_ P - Minimum joint-dimension at which the GPU Hutchinson path is enabled.
- PCG_
HVP_ MAX_ ITERS - Maximum CG iterations per probe before we stop and accept the partial
solve. Capped so a poorly conditioned
Hcannot make a single REML step pay unbounded time — the Hutchinson estimator is statistically robust to a few stalew_kvalues (it inflates SE, which the adaptive stopping rule then catches by extending the schedule). - PCG_
HVP_ REL_ TOL - CG convergence tolerance for the per-probe solve
H w = z. The outer adaptive-K loop already drives Hutchinson variance to ~1%; a per-probe relative residual of 1e-6 keeps the CG round-off well below the stochastic SE without paying for double-machine convergence.
Functions§
- evidence_
derivatives_ hutchinson_ cpu - Run the Hutchinson estimator on CPU using the exact same probe bits the device kernel uses. Returns the same evidence struct.
- evidence_
derivatives_ hutchinson_ gpu - Compute
log |H|and the Hutchinson estimate of(1/2) tr(H^{-1} H_j)for every derivative. Dispatches to the device-resident path when the CUDA runtime is up and probes the GPU successfully; otherwise runs the CPU reference. Either way the probe bits are identical (stateless SplitMix), so callers see the same estimator value to round-off. - evidence_
traces_ adaptive - evidence_
traces_ adaptive_ hvp - Adaptive Hutchinson variant that consumes
Has a matrix-free HVP closure rather than a denseArrayView2. Used by call sites where the penalized Hessian is implicit (operator-only) and forming it densely would blow the memory budget — e.g. the device-resident PCG path ingpu/bms_flex_row.rsor the large-scale BMS Schur operator. - fill_
rademacher_ host - Host-side reference: fill a column-major
(p, K)Rademacher matrix. Used by tests to verify the GPU kernel produces the same bits. - rademacher_
entry - Stateless Rademacher entry at probe index
k(0-based), coordinatei(0-based), seeds. Returns+1.0or-1.0. - should_
bypass_ cpu_ with_ gpu_ adaptive - Composite gate predicate for the outer REML logdet-gradient bypass:
when this returns
true, the unified evaluator should replace its CPU stochastic-trace call withevidence_traces_adaptive. - should_
use_ gpu_ hutchinson - True when the GPU Hutchinson path is eligible at the current shape and
configuration. Caller still has to satisfy the CPU-side gate
(
prefers_stochastic_trace_estimation, matching kernel, plain-SPD logdet, projected penalty subspace inactive) — the parametersprefers_stochastic,kernel_matches_hinv,plain_spd_logdet, andprojected_penalty_subspace_activecarry those CPU-side gate booleans into the dispatch decision. - splitmix64_
mix - SplitMix64 finalizer (Sebastiano Vigna, 2015). Thin wrapper over the
canonical implementation in
gam_linalg::utils::splitmix64_hash.