Skip to main content

Module pool

Module pool 

Source
Expand description

Multi-GPU device pool.

The runtime probe (super::device_runtime) already discovers every usable CUDA device into GpuRuntime::devices (sorted by GpuDeviceInfo::score desc), but every dispatch path historically pinned its work to the single primary GpuRuntime::device. This module turns that pool into usable parallelism:

§Concurrency model

Per-device fan-out uses std::thread::scope, not rayon. A rayon par_iter worker that reaches a OnceLock::get_or_init whose closure itself does into_par_iter deadlocks the whole process (team-known hazard), and the cudarc context cache (cuda_context_for) is exactly such a lazily-initialized global. Scoped OS threads sidestep that entirely: each worker calls ctx.bind_to_thread() for its ordinal before issuing any CUDA work, so the thread-local current context is correct for every kernel launched on it.

Functions§

balanced_partition
Partition n_units independent work items across all usable devices, weighted by GpuDeviceInfo::score.
scatter_batched
Run independent work across ALL devices concurrently.