Module thread

Expand description

Functions for dealing with the parallel thread execution model employed by CUDA.

§CUDA Thread model

The CUDA thread model is based on 3 main structures:

Threads
Thread Blocks
Grids

§Threads

Threads are the fundamental element of GPU computing. Threads execute the same kernel at the same time, controlling their task by retrieving their corresponding global thread ID.

§Thread Blocks

The most important structure after threads, thread blocks arrange

Functions§

block_dim: Gets the 3d layout of the thread blocks executing this kernel. In other words, how many threads exist in each thread block in every direction.
block_dim_x
block_dim_y
block_dim_z
block_idx: Gets the 3d index of the block that the thread currently executing the kernel is located in.
block_idx_x
block_idx_y
block_idx_z
device_fence: Acts as a memory fence at the device level.
first: Whether this is the first thread (not the first thread to be executing). This function is guaranteed to only return true in a single thread that is invoking it. This is useful for only doing something once.
grid_dim: Gets the 3d layout of the block grids executing this kernel. In other words, how many thread blocks exist in each grid in every direction.
grid_dim_x
grid_dim_y
grid_dim_z
grid_fence: Acts as a memory fence at the grid level (all threads inside of a kernel execution).
index: Gets the overall thread index, accounting for 1d/2d/3d block/grid dimensions. This value is most commonly used for indexing into data and this index is guaranteed to be unique for every single thread executing this kernel no matter the launch configuration.
index_1d
index_2d
index_3d
nanosleep: Suspends the calling thread for a duration (in nanoseconds) approximately close to nanos.
sync_threads: Waits until all threads in the thread block have reached this point. This guarantees that any global or shared mem accesses are visible to every thread after this call.
sync_threads_and: Identical to sync_threads but with the additional feature that it evaluates the predicate for every thread and returns a non-zero integer if every predicate evaluates to non-zero for all threads.
sync_threads_count: Identical to sync_threads but with the additional feature that it evaluates the predicate for every thread and returns the number of threads in which it evaluated to a non-zero number.
sync_threads_or: Identical to sync_threads but with the additional feature that it evaluates the predicate for every thread and returns a non-zero integer if at least one predicate in a thread evaluates to non-zero.
system_fence: Acts as a memory fence at the system level.
thread_idx: Gets the 3d index of the thread currently executing the kernel.
thread_idx_x
thread_idx_y
thread_idx_z
warp_size: Gets the number of threads inside of a warp. Currently 32 threads on every GPU architecture.

Module threadCopy item path

§CUDA Thread model

§Threads

§Thread Blocks

Functions§

Module thread