Module backend_probe

Expand description

Shared CUDA backend-probe contract for every cudarc-backed module under src/gpu/*.

Before this module existed, every GPU backend (bms_flex, survival_flex, cubic_bspline_moments, cubic_cell, pirls_row, sphere, …) carried its own near-identical probe_linux prologue:

Fetch the process-wide [GpuRuntime] or fail with a DriverLibraryUnavailable { reason: "<module> backend: no CUDA runtime available" }.
Read the runtime’s selected device ordinal.
Create (or reuse) the per-ordinal [CudaContext] or fail with a DriverCallFailed { reason: "<module> backend: failed to create CUDA context for device N" }.
Open the context’s default [CudaStream].
Carry the device’s compute capability alongside the handles.

Those five steps are identical apart from the per-module label that gets woven into the two error messages. Drift between copies meant error wording, capability handling, context reuse, and stream choice could diverge module to module. This module hosts the single contract: each backend now calls probe_cuda_backend with its label and keeps only its module caches and optional eager-compilation step.

The migration is atomic: no backend re-implements the prologue, and there is no transitional shim.

Structs§

CudaBackendContext: The process-wide device handles every cudarc backend stores after a successful probe: the CudaContext, its default CudaStream, the lazily NVRTC-compiled PtxModuleCache, and the bucketed DeviceArena of reusable f64 device buffers (held under a Mutex because large-scale fits dispatch from multiple rayon worker threads; the mutex is only held during alloc / release, not across kernel launches). Module-specific backends (bms_flex, survival_flex, …) wrap one of these as their inner context so the host-side scaffolding (arena pooling, module cache, mutex around alloc) is uniform instead of duplicated per backend.
CudaBackendParts: The handles every cudarc backend shares once the probe succeeds: a context on the runtime’s selected device, that context’s default stream, and the device’s compute capability. Module-specific backends layer their own caches and optional eager compilation on top of these.

Functions§

probe_backend_with_compile: Probe the CUDA backend for label and run a backend-specific build step on the resolved handles.
probe_cuda_backend: Probe the process-wide CUDA backend for the calling module.