Function rcudnn::cudaLaunchCooperativeKernelMultiDevice[][src]

pub unsafe extern "C" fn cudaLaunchCooperativeKernelMultiDevice(
    launchParamsList: *mut cudaLaunchParams,
    numDevices: u32,
    flags: u32
) -> cudaError
Expand description

\brief Launches device functions on multiple devices where thread blocks can cooperate and synchronize as they execute

\deprecated This function is deprecated as of CUDA 11.3.

Invokes kernels as specified in the \p launchParamsList array where each element of the array specifies all the parameters required to perform a single kernel launch. These kernels can cooperate and synchronize as they execute. The size of the array is specified by \p numDevices.

No two kernels can be launched on the same device. All the devices targeted by this multi-device launch must be identical. All devices must have a non-zero value for the device attribute ::cudaDevAttrCooperativeMultiDeviceLaunch.

The same kernel must be launched on all devices. Note that any device or constant variables are independently instantiated on every device. It is the application’s responsiblity to ensure these variables are initialized and used appropriately.

The size of the grids as specified in blocks, the size of the blocks themselves and the amount of shared memory used by each thread block must also match across all launched kernels.

The streams used to launch these kernels must have been created via either ::cudaStreamCreate or ::cudaStreamCreateWithPriority or ::cudaStreamCreateWithPriority. The NULL stream or ::cudaStreamLegacy or ::cudaStreamPerThread cannot be used.

The total number of blocks launched per kernel cannot exceed the maximum number of blocks per multiprocessor as returned by ::cudaOccupancyMaxActiveBlocksPerMultiprocessor (or ::cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags) times the number of multiprocessors as specified by the device attribute ::cudaDevAttrMultiProcessorCount. Since the total number of blocks launched per device has to match across all devices, the maximum number of blocks that can be launched per device will be limited by the device with the least number of multiprocessors.

The kernel cannot make use of CUDA dynamic parallelism.

The ::cudaLaunchParams structure is defined as: \code struct cudaLaunchParams { void *func; dim3 gridDim; dim3 blockDim; void **args; size_t sharedMem; cudaStream_t stream; }; \endcode where:

  • ::cudaLaunchParams::func specifies the kernel to be launched. This same functions must be launched on all devices. For templated functions, pass the function symbol as follows: func_name<template_arg_0,…,template_arg_N>
  • ::cudaLaunchParams::gridDim specifies the width, height and depth of the grid in blocks. This must match across all kernels launched.
  • ::cudaLaunchParams::blockDim is the width, height and depth of each thread block. This must match across all kernels launched.
  • ::cudaLaunchParams::args specifies the arguments to the kernel. If the kernel has N parameters then ::cudaLaunchParams::args should point to array of N pointers. Each pointer, from ::cudaLaunchParams::args[0] to ::cudaLaunchParams::args[N - 1], point to the region of memory from which the actual parameter will be copied.
  • ::cudaLaunchParams::sharedMem is the dynamic shared-memory size per thread block in bytes. This must match across all kernels launched.
  • ::cudaLaunchParams::stream is the handle to the stream to perform the launch in. This cannot be the NULL stream or ::cudaStreamLegacy or ::cudaStreamPerThread.

By default, the kernel won’t begin execution on any GPU until all prior work in all the specified streams has completed. This behavior can be overridden by specifying the flag ::cudaCooperativeLaunchMultiDeviceNoPreSync. When this flag is specified, each kernel will only wait for prior work in the stream corresponding to that GPU to complete before it begins execution.

Similarly, by default, any subsequent work pushed in any of the specified streams will not begin execution until the kernels on all GPUs have completed. This behavior can be overridden by specifying the flag ::cudaCooperativeLaunchMultiDeviceNoPostSync. When this flag is specified, any subsequent work pushed in any of the specified streams will only wait for the kernel launched on the GPU corresponding to that stream to complete before it begins execution.

\param launchParamsList - List of launch parameters, one per device \param numDevices - Size of the \p launchParamsList array \param flags - Flags to control launch behavior

\return ::cudaSuccess, ::cudaErrorInvalidDeviceFunction, ::cudaErrorInvalidConfiguration, ::cudaErrorLaunchFailure, ::cudaErrorLaunchTimeout, ::cudaErrorLaunchOutOfResources, ::cudaErrorCooperativeLaunchTooLarge, ::cudaErrorSharedObjectInitFailed \note_null_stream \notefnerr \note_init_rt \note_callback

\sa \ref ::cudaLaunchCooperativeKernel(const T *func, dim3 gridDim, dim3 blockDim, void **args, size_t sharedMem, cudaStream_t stream) “cudaLaunchCooperativeKernel (C++ API)”, ::cudaLaunchCooperativeKernel, ::cuLaunchCooperativeKernelMultiDevice