Module thread

Source
Expand description

Functions for dealing with the parallel thread execution model employed by CUDA.

§CUDA Thread model

The CUDA thread model is based on 3 main structures:

  • Threads
  • Thread Blocks
  • Grids

§Threads

Threads are the fundamental element of GPU computing. Threads execute the same kernel at the same time, controlling their task by retrieving their corresponding global thread ID.

§Thread Blocks

The most important structure after threads, thread blocks arrange

Functions§

block_dim
Gets the 3d layout of the thread blocks executing this kernel. In other words, how many threads exist in each thread block in every direction.
block_dim_x
block_dim_y
block_dim_z
block_idx
Gets the 3d index of the block that the thread currently executing the kernel is located in.
block_idx_x
block_idx_y
block_idx_z
device_fence
Acts as a memory fence at the device level.
first
Whether this is the first thread (not the first thread to be executing). This function is guaranteed to only return true in a single thread that is invoking it. This is useful for only doing something once.
grid_dim
Gets the 3d layout of the block grids executing this kernel. In other words, how many thread blocks exist in each grid in every direction.
grid_dim_x
grid_dim_y
grid_dim_z
grid_fence
Acts as a memory fence at the grid level (all threads inside of a kernel execution).
index
Gets the overall thread index, accounting for 1d/2d/3d block/grid dimensions. This value is most commonly used for indexing into data and this index is guaranteed to be unique for every single thread executing this kernel no matter the launch configuration.
index_1d
index_2d
index_3d
nanosleep
Suspends the calling thread for a duration (in nanoseconds) approximately close to nanos.
sync_threads
Waits until all threads in the thread block have reached this point. This guarantees that any global or shared mem accesses are visible to every thread after this call.
sync_threads_and
Identical to sync_threads but with the additional feature that it evaluates the predicate for every thread and returns a non-zero integer if every predicate evaluates to non-zero for all threads.
sync_threads_count
Identical to sync_threads but with the additional feature that it evaluates the predicate for every thread and returns the number of threads in which it evaluated to a non-zero number.
sync_threads_or
Identical to sync_threads but with the additional feature that it evaluates the predicate for every thread and returns a non-zero integer if at least one predicate in a thread evaluates to non-zero.
system_fence
Acts as a memory fence at the system level.
thread_idx
Gets the 3d index of the thread currently executing the kernel.
thread_idx_x
thread_idx_y
thread_idx_z
warp_size
Gets the number of threads inside of a warp. Currently 32 threads on every GPU architecture.