Structs§
- Pirls
GpuInput - Pirls
GpuShared Data - Shared, batch-wide GPU state for stream-pool sigma-cubature PIRLS.
- Pirls
GpuStep - Pirls
Step Stream Device Input - Stage 3.2 device-input variant of
PirlsStepStreamInput. - Pirls
Step Stream Input - Per-step inputs for
solve_pirls_step_on_stream. - Sigma
Pirls GpuWorkspace - Per-stream workspace for
solve_pirls_step_on_stream.
Functions§
- allocate_
pirls_ loop_ workspace - Allocate a Stage 3.3 PIRLS loop workspace bound to the same stream
as
wsagainst the shared device-resident design matrix. - allocate_
sigma_ pirls_ workspace - Allocate a per-stream workspace bound to a fresh non-default CUDA
stream on
shared’s context. The cuBLAS and cuSOLVER handles are bound to the workspace stream so peer workspaces achieve overlapped execution. - cholesky_
lower_ gpu - cholesky_
solve_ gpu - cholesky_
solve_ only_ gpu - Solution-only mixed-precision solve (logdet discarded). Skips the redundant fp64 POTRF so the PIRLS Newton direction solve gets the full fp32-factor speedup; the solution is fp64-accurate via iterative refinement.
- pirls_
loop_ on_ stream - Stage 3.3 device-resident PIRLS loop driver. See
[
cuda::pirls_loop] for the full per-iter contract. Only a few 1-f64 scalars cross the host boundary per Newton iteration; β and the final penalised Hessian are downloaded once at loop exit. - solve_
gaussian_ pls_ gpu - GPU exact penalised least-squares for Gaussian-identity models.
- solve_
pirls_ step_ gpu - solve_
pirls_ step_ on_ stream - Drive one PIRLS Newton step on the workspace’s CUDA stream against the
device-resident shared design matrix. The math is bit-identical to the
one-shot
solve_pirls_step_gpu; this entry differs only by amortising the design upload and the cuBLAS / cuSOLVER handle creation across many sigma fits. - solve_
pirls_ step_ on_ stream_ device - Stage 3.2 device-input PIRLS step. Reads
w_solverandgrad_etafrom caller-supplied device buffers (typically populated bycrate::gpu_kernels::pirls_row::launch_row_reweight_on_stream) instead of uploading them from host arrays. Math is bit-identical tosolve_pirls_step_on_stream; this entry differs only by skipping the per-iterweightsandgradienthost-to-device transfers — only the small p×p penalty matrix still crosses the host boundary. - upload_
qs_ identity_ pirls - Upload an identity Qs for the current ρ / σ point. Equivalent to
upload_qs_pirlswith an identity matrix; avoids host allocation. - upload_
qs_ pirls - Upload the reparameterisation matrix
Qs(p×p) for the current ρ / σ point. Call once per ρ / σ point before callingpirls_loop_on_stream. When no reparameterisation is active, pass an identity matrix. - upload_
shared_ pirls_ gpu - Upload X_original, y, prior_w, and offset once per model and return a
shared device-resident handle reused across all ρ / σ points. All four
arrays must have the same row-count
n. The shared handle keeps the cached per-ordinalCudaContextalive so all peer workspaces bind to the same context and can interleave on its asynchronous engines. - weighted_
crossprod_ gpu