Expand description
Global cache for compiled CUDA modules and kernel functions.
Without caching, every call to a GPU kernel (e.g. gpu_add, gpu_conv2d_f32,
gpu_flash_attention_f32) recompiles PTX source into a CUBIN via
CudaContext::load_module(Ptx::from_src(...)). This compilation takes
~1700 us per call – far longer than the actual kernel execution.
This module provides get_or_compile, which compiles the PTX only on
first use and returns a cached CudaFunction on subsequent calls. The
cache is keyed by the static kernel name string, which is unique per
kernel entry point in this crate.
§Thread safety
The cache uses a global Mutex-protected HashMap. The critical
section is short (a hash lookup + optional insert), so contention is
negligible in practice.
Functions§
- get_
or_ compile - Get a compiled kernel function, compiling the PTX only on first use.