Skip to main content

Module pirls_gpu

Module pirls_gpu 

Source

Structs§

PirlsGpuInput
PirlsGpuSharedData
Shared, batch-wide GPU state for stream-pool sigma-cubature PIRLS.
PirlsGpuStep
PirlsStepStreamDeviceInput
Stage 3.2 device-input variant of PirlsStepStreamInput.
PirlsStepStreamInput
Per-step inputs for solve_pirls_step_on_stream.
SigmaPirlsGpuWorkspace
Per-stream workspace for solve_pirls_step_on_stream.

Functions§

allocate_pirls_loop_workspace
Allocate a Stage 3.3 PIRLS loop workspace bound to the same stream as ws against the shared device-resident design matrix.
allocate_sigma_pirls_workspace
Allocate a per-stream workspace bound to a fresh non-default CUDA stream on shared’s context. The cuBLAS and cuSOLVER handles are bound to the workspace stream so peer workspaces achieve overlapped execution.
cholesky_lower_gpu
cholesky_solve_gpu
cholesky_solve_only_gpu
Solution-only mixed-precision solve (logdet discarded). Skips the redundant fp64 POTRF so the PIRLS Newton direction solve gets the full fp32-factor speedup; the solution is fp64-accurate via iterative refinement.
pirls_loop_on_stream
Stage 3.3 device-resident PIRLS loop driver. See [cuda::pirls_loop] for the full per-iter contract. Only a few 1-f64 scalars cross the host boundary per Newton iteration; β and the final penalised Hessian are downloaded once at loop exit.
solve_gaussian_pls_gpu
GPU exact penalised least-squares for Gaussian-identity models.
solve_pirls_step_gpu
solve_pirls_step_on_stream
Drive one PIRLS Newton step on the workspace’s CUDA stream against the device-resident shared design matrix. The math is bit-identical to the one-shot solve_pirls_step_gpu; this entry differs only by amortising the design upload and the cuBLAS / cuSOLVER handle creation across many sigma fits.
solve_pirls_step_on_stream_device
Stage 3.2 device-input PIRLS step. Reads w_solver and grad_eta from caller-supplied device buffers (typically populated by crate::gpu_kernels::pirls_row::launch_row_reweight_on_stream) instead of uploading them from host arrays. Math is bit-identical to solve_pirls_step_on_stream; this entry differs only by skipping the per-iter weights and gradient host-to-device transfers — only the small p×p penalty matrix still crosses the host boundary.
upload_qs_identity_pirls
Upload an identity Qs for the current ρ / σ point. Equivalent to upload_qs_pirls with an identity matrix; avoids host allocation.
upload_qs_pirls
Upload the reparameterisation matrix Qs (p×p) for the current ρ / σ point. Call once per ρ / σ point before calling pirls_loop_on_stream. When no reparameterisation is active, pass an identity matrix.
upload_shared_pirls_gpu
Upload X_original, y, prior_w, and offset once per model and return a shared device-resident handle reused across all ρ / σ points. All four arrays must have the same row-count n. The shared handle keeps the cached per-ordinal CudaContext alive so all peer workspaces bind to the same context and can interleave on its asynchronous engines.
weighted_crossprod_gpu