Expand description
Atomic Types for modification of numbers in multiple threads in a sound way.
Core Interop
Every type in this module works on the CPU (targets outside of nvptx). However, core::sync::atomic
types
do NOT work on the GPU currently. This is because CUDA atomics have some fundamental differences
that make representing them fully with existing core types impossible:
- CUDA has block-scoped, device-scoped, and system-scoped atomics, core does not make such a distinction (obviously).
- CUDA trivially supports relaxed/acquire/release orderings on most architectures, but SeqCst and other orderings use specialized instructions on compute capabilities 7.x+, but can be emulated with fences/membars on 7.x >. This makes it difficult to hide away such details in the codegen.
- CUDA has hardware atomic floats, core does not.
- CUDA makes the distinction between “fetch, do operation, read” (
atom
) and “do operation” (red
). - Core thinks CUDA supports 8 and 16 bit atomics, this is a bug in the nvptx target but it is nevertheless an annoying detail to silently trap on.
Therefore we chose to go with the approach of implementing all atomics inside cuda_std. In the future, we may support a subset of core atomics, but for now, you will have to use cuda_std atomics.
Modules
Raw CUDA-specific atomic functions that map to PTX instructions.
Mid-level intrinsics that take an ordering parameter and emulate specialized instructions when not available (on lower compute capabilities).
Structs
A 32-bit float type which can be safely shared between threads and synchronizes across a single device (GPU).
A 64-bit float type which can be safely shared between threads and synchronizes across a single device (GPU).
A 32-bit float type which can be safely shared between threads and synchronizes across a single thread block (also called a CTA, cooperative thread array).
A 64-bit float type which can be safely shared between threads and synchronizes across a single thread block (also called a CTA, cooperative thread array).
A 32-bit float type which can be safely shared between threads and synchronizes across a single device (GPU).
A 64-bit float type which can be safely shared between threads and synchronizes across a single device (GPU).