Expand description
Multi-GPU device pool.
The runtime probe (super::device_runtime) already discovers every usable CUDA
device into GpuRuntime::devices (sorted by GpuDeviceInfo::score desc),
but every dispatch path historically pinned its work to the single primary
GpuRuntime::device. This module turns that pool into usable parallelism:
GpuRuntime::device_ordinals/GpuRuntime::device_countexpose the full set of usable ordinals (highest-score first).GpuRuntime::memory_budget_forgives a per-device byte budget so each tile can size its device buffers against the device it actually runs on.balanced_partitionsplitsnindependent work items across the pool weighted by each device’sGpuDeviceInfo::score.scatter_batchedruns an independent-per-item closure across every device concurrently, binding each ordinal’s context on its own worker thread.
§Concurrency model
Per-device fan-out uses std::thread::scope, not rayon. A rayon
par_iter worker that reaches a OnceLock::get_or_init whose closure itself
does into_par_iter deadlocks the whole process (team-known hazard), and the
cudarc context cache (cuda_context_for) is exactly such a lazily-initialized
global. Scoped OS threads sidestep that entirely: each worker calls
ctx.bind_to_thread() for its ordinal before issuing any CUDA work, so the
thread-local current context is correct for every kernel launched on it.
Functions§
- balanced_
partition - Partition
n_unitsindependent work items across all usable devices, weighted byGpuDeviceInfo::score. - scatter_
batched - Run independent work across ALL devices concurrently.