Expand description
Automatic GPU dispatch shim for dense linear algebra hot kernels.
Every try_* entry point in this module is invoked unconditionally from
gam_linalg::faer_ndarray before the CPU fast-path runs. The decision to send
the kernel to a device is fully automatic and never requires a user-facing
flag — it depends only on:
GpuRuntime::global()returningSome(_)(a device was probed at process startup).- The kernel being large enough to amortize launch/PCIe overhead, per
the thresholds in
policy::GpuDispatchPolicy. - cudarc successfully dynamically loading
libcudaat process startup via itsfallback-dynamic-loadingfeature. When the loader fails (no driver, no toolkit installed),GpuRuntime::probe()returnsOk(None)and everytry_*returnsNoneso the caller falls through to the existing faer CPU kernel.
The wiring lives here so solver/pirls.rs and the family Hessian
assemblers can stay backend-agnostic: they call gam_linalg::faer_ndarray::fast_*
and get GPU acceleration automatically whenever it is profitable.
Structs§
- Cuda
Gemm Dispatch - Resident
Design Gram - #1017 Phase 3: a device-resident design matrix for repeated
Xᵀ·diag(w)·XGram evaluations that uploadsXto the device ONCE.
Enums§
- Dispatch
Op - Discriminator used by
route_through_gputo apply the right size threshold fromsuper::policy::GpuDispatchPolicy.
Functions§
- route_
through_ gpu - Returns
Some(runtime)when both a device is available and the workload is large enough per policy. The caller can then attempt the actual device kernel; any backend failure is expected to returnNonefrom the lower layer and the CPU fast path resumes. - try_
cholesky_ batched_ lower_ inplace - try_
cholesky_ lower_ inplace - try_
fast_ ab - try_
fast_ ab_ broadcast_ b_ batched - try_
fast_ abt_ strided_ batched - try_
fast_ atb - try_
fast_ atb_ on_ ordinal Aᵀ·Bon a specific device ordinal, for pool-tiled callers that already own the ordinal (the worker thread has bound that ordinal’s context). Semantics are identical totry_fast_atb—aism×k,bism×n, output is thek×nproductaᵀ·b— but the kernel is pinned toordinalinstead of the probe-selected primary device. ReturnsNonewhen CUDA is unavailable, the shape is below policy threshold, or the backend reports a transient failure, so the caller runs its CPU fallback. f64 only.- try_
fast_ atv - try_
fast_ av - try_
fast_ joint_ hessian_ 2x2 - try_
fast_ spectral_ leverage_ diagonal - GPU-offloaded spectral leverage diagonal
h[i] = ‖(X G)_{i,:}‖². - try_
fast_ xt_ diag_ x - try_
fast_ xt_ diag_ y - try_
solve_ lower_ triangular_ matrix - try_
solve_ upper_ triangular_ matrix