Module pool

Expand description

Multi-GPU device pool.

The runtime probe (super::device_runtime) already discovers every usable CUDA device into GpuRuntime::devices (sorted by GpuDeviceInfo::score desc), but every dispatch path historically pinned its work to the single primary GpuRuntime::device. This module turns that pool into usable parallelism:

GpuRuntime::device_ordinals / GpuRuntime::device_count expose the full set of usable ordinals (highest-score first).
GpuRuntime::memory_budget_for gives a per-device byte budget so each tile can size its device buffers against the device it actually runs on.
balanced_partition splits n independent work items across the pool weighted by each device’s GpuDeviceInfo::score.
scatter_batched runs an independent-per-item closure across every device concurrently, binding each ordinal’s context on its own worker thread.

§Concurrency model

Per-device fan-out uses std::thread::scope, not rayon. A rayon par_iter worker that reaches a OnceLock::get_or_init whose closure itself does into_par_iter deadlocks the whole process (team-known hazard), and the cudarc context cache (cuda_context_for) is exactly such a lazily-initialized global. Scoped OS threads sidestep that entirely: each worker calls ctx.bind_to_thread() for its ordinal before issuing any CUDA work, so the thread-local current context is correct for every kernel launched on it.

Functions§

balanced_partition: Partition n_units independent work items across all usable devices, weighted by GpuDeviceInfo::score.
scatter_batched: Run independent work across ALL devices concurrently.

Module pool

Module pool Copy item path

§Concurrency model

Functions§

Module pool