Expand description
Exact GPU REML evidence + derivative gradient.
Refactor (Block 2.1, math team section 18): the penalized Hessian H is
Cholesky-factored exactly once on device, the factor is held resident,
and every derivative Hessian H_j is solved through the cached factor
with a single batched potrs call (nrhs = d_rho · p). Previously each
derivative re-issued the full cholesky_solve_gpu path, which uploaded
H, allocated and ran potrf, and downloaded the factor again — turning
a p^3 + d·p^3 workload into a (d+1)·p^3 one and serializing d_rho
factor passes onto the device.
On the non-Linux fallback the same cholesky_solve_gpu path is exercised
via pirls_gpu::cholesky_solve_gpu, so behaviour outside Linux is
numerically identical (with the same per-derivative overhead) — the
optimisation is Linux-only because that is where CUDA actually runs.