Expand description

Functions for dealing with the parallel thread execution model employed by CUDA.

CUDA Thread model

The CUDA thread model is based on 3 main structures:

  • Threads
  • Thread Blocks
  • Grids

Threads

Threads are the fundamental element of GPU computing. Threads execute the same kernel at the same time, controlling their task by retrieving their corresponding global thread ID.

Thread Blocks

The most important structure after threads, thread blocks arrange

Functions

Gets the 3d layout of the thread blocks executing this kernel. In other words, how many threads exist in each thread block in every direction.

Gets the 3d index of the block that the thread currently executing the kernel is located in.

Acts as a memory fence at the device level.

Whether this is the first thread (not the first thread to be executing). This function is guaranteed to only return true in a single thread that is invoking it. This is useful for only doing something once.

Gets the 3d layout of the block grids executing this kernel. In other words, how many thread blocks exist in each grid in every direction.

Acts as a memory fence at the grid level (all threads inside of a kernel execution).

Gets the overall thread index, accounting for 1d/2d/3d block/grid dimensions. This value is most commonly used for indexing into data and this index is guaranteed to be unique for every single thread executing this kernel no matter the launch configuration.

Suspends the calling thread for a duration (in nanoseconds) approximately close to nanos.

Waits until all threads in the thread block have reached this point. This guarantees that any global or shared mem accesses are visible to every thread after this call.

Identical to sync_threads but with the additional feature that it evaluates the predicate for every thread and returns a non-zero integer if every predicate evaluates to non-zero for all threads.

Identical to sync_threads but with the additional feature that it evaluates the predicate for every thread and returns the number of threads in which it evaluated to a non-zero number.

Identical to sync_threads but with the additional feature that it evaluates the predicate for every thread and returns a non-zero integer if at least one predicate in a thread evaluates to non-zero.

Acts as a memory fence at the system level.

Gets the 3d index of the thread currently executing the kernel.

Gets the number of threads inside of a warp. Currently 32 threads on every GPU architecture.