Skip to main content

Module linalg_dispatch

Module linalg_dispatch 

Source
Expand description

Automatic GPU dispatch shim for dense linear algebra hot kernels.

Every try_* entry point in this module is invoked unconditionally from gam_linalg::faer_ndarray before the CPU fast-path runs. The decision to send the kernel to a device is fully automatic and never requires a user-facing flag — it depends only on:

  1. GpuRuntime::global() returning Some(_) (a device was probed at process startup).
  2. The kernel being large enough to amortize launch/PCIe overhead, per the thresholds in policy::GpuDispatchPolicy.
  3. cudarc successfully dynamically loading libcuda at process startup via its fallback-dynamic-loading feature. When the loader fails (no driver, no toolkit installed), GpuRuntime::probe() returns Ok(None) and every try_* returns None so the caller falls through to the existing faer CPU kernel.

The wiring lives here so solver/pirls.rs and the family Hessian assemblers can stay backend-agnostic: they call gam_linalg::faer_ndarray::fast_* and get GPU acceleration automatically whenever it is profitable.

Structs§

CudaGemmDispatch
ResidentDesignGram
#1017 Phase 3: a device-resident design matrix for repeated Xᵀ·diag(w)·X Gram evaluations that uploads X to the device ONCE.

Enums§

DispatchOp
Discriminator used by route_through_gpu to apply the right size threshold from super::policy::GpuDispatchPolicy.

Functions§

route_through_gpu
Returns Some(runtime) when both a device is available and the workload is large enough per policy. The caller can then attempt the actual device kernel; any backend failure is expected to return None from the lower layer and the CPU fast path resumes.
try_cholesky_batched_lower_inplace
try_cholesky_lower_inplace
try_fast_ab
try_fast_ab_broadcast_b_batched
try_fast_abt_strided_batched
try_fast_atb
try_fast_atb_on_ordinal
Aᵀ·B on a specific device ordinal, for pool-tiled callers that already own the ordinal (the worker thread has bound that ordinal’s context). Semantics are identical to try_fast_atba is m×k, b is m×n, output is the k×n product aᵀ·b — but the kernel is pinned to ordinal instead of the probe-selected primary device. Returns None when CUDA is unavailable, the shape is below policy threshold, or the backend reports a transient failure, so the caller runs its CPU fallback. f64 only.
try_fast_atv
try_fast_av
try_fast_joint_hessian_2x2
try_fast_spectral_leverage_diagonal
GPU-offloaded spectral leverage diagonal h[i] = ‖(X G)_{i,:}‖².
try_fast_xt_diag_x
try_fast_xt_diag_y
try_solve_lower_triangular_matrix
try_solve_upper_triangular_matrix