Function cusparseSpMM

Source

pub unsafe extern "C" fn cusparseSpMM(
    handle: cusparseHandle_t,
    opA: cusparseOperation_t,
    opB: cusparseOperation_t,
    alpha: *const c_void,
    matA: cusparseConstSpMatDescr_t,
    matB: cusparseConstDnMatDescr_t,
    beta: *const c_void,
    matC: cusparseDnMatDescr_t,
    computeType: cudaDataType,
    alg: cusparseSpMMAlg_t,
    externalBuffer: *mut c_void,
) -> cusparseStatus_t

Expand description

The function performs the multiplication of a sparse matrix matA and a dense matrix matB.

where

op(A) is a sparse matrix of size $m \times k$
op(B) is a dense matrix of size $k \times n$
C is a dense matrix of size $m \times n$
$\alpha$ and $\beta$ are scalars

The routine can be also used to perform the multiplication of a dense matrix and a sparse matrix by switching the dense matrices layout:

where $\mathbf{B}{C}$, $\mathbf{C}{C}$ indicate column-major layout, while $\mathbf{B}{R}$, $\mathbf{C}{R}$ refer to row-major layout

Also, for matrix A and B

When using the (conjugate) transpose of the sparse matrix A, this routine may produce slightly different results during different runs with the same input parameters.

The function cusparseSpMM_bufferSize returns the size of the workspace needed by cusparseSpMM

Calling cusparseSpMM_preprocess is optional. It may accelerate subsequent calls to cusparseSpMM. It is useful when cusparseSpMM is called multiple times with the same sparsity pattern (matA). It provides performance advantages with cusparseSpMMAlg_t::CUSPARSE_SPMM_CSR_ALG1 or cusparseSpMMAlg_t::CUSPARSE_SPMM_CSR_ALG3. For all other formats and algorithms have no effect.

Calling cusparseSpMM_preprocess with buffer makes that buffer “active” for matA SpMM calls. Subsequent calls to cusparseSpMM with matA and the active buffer must use the same values for all parameters as the call to cusparseSpMM_preprocess. The exceptions are: alpha, beta, matX, matY, and the values (but not indices) of matA may be different. Importantly, the buffer contents must be unmodified since the call to cusparseSpMM_preprocess. When cusparseSpMM is called with matA and its active buffer, it may read acceleration data from the buffer.

Calling cusparseSpMM_preprocess again with matA and a new buffer will make the new buffer active, forgetting about the previously-active buffer and making it inactive. For cusparseSpMM, there can only be one active buffer per sparse matrix at a time. To get the effect of multiple active buffers for a single sparse matrix, create multiple matrix handles that all point to the same index and value buffers, and call cusparseSpMM_preprocess once per handle with different workspace buffers.

Calling cusparseSpMM with an inactive buffer is always permitted. However, there may be no acceleration from the preprocessing in that case.

For the purposes of thread safety, cusparseSpMM_preprocess is writing to matA internal state.

cusparseSpMM supports the following sparse matrix formats:


(1)	COO/CSR/CSC/BSR FORMATS

cusparseSpMM supports the following index type for representing the sparse matrix matA:

32-bit indices (cusparseIndexType_t::CUSPARSE_INDEX_32I)
64-bit indices (cusparseIndexType_t::CUSPARSE_INDEX_64I)

cusparseSpMM supports the following data types:

Uniform-precision computation:

`A`/`B`/ `C`/`computeType`
`cudaDataType_t::CUDA_R_32F`
`cudaDataType_t::CUDA_R_64F`
`cudaDataType_t::CUDA_C_32F`
`cudaDataType_t::CUDA_C_64F`

Mixed-precision computation:

`A`/`B`	`C`	`computeType`
`cudaDataType_t::CUDA_R_8I`	`cudaDataType_t::CUDA_R_32I`	`cudaDataType_t::CUDA_R_32I`
`cudaDataType_t::CUDA_R_8I`	`cudaDataType_t::CUDA_R_32F`	`cudaDataType_t::CUDA_R_32F`
`cudaDataType_t::CUDA_R_16F`
`cudaDataType_t::CUDA_R_16BF`
`cudaDataType_t::CUDA_R_16F`	`cudaDataType_t::CUDA_R_16F`
`cudaDataType_t::CUDA_R_16BF`	`cudaDataType_t::CUDA_R_16BF`
`cudaDataType_t::CUDA_C_16F`	`cudaDataType_t::CUDA_C_16F`	`cudaDataType_t::CUDA_C_32F`	[DEPRECATED]
`cudaDataType_t::CUDA_C_16BF`	`cudaDataType_t::CUDA_C_16BF`	[DEPRECATED]

NOTE: cudaDataType_t::CUDA_R_16F, cudaDataType_t::CUDA_R_16BF, cudaDataType_t::CUDA_C_16F, and cudaDataType_t::CUDA_C_16BF data types always imply mixed-precision computation.

cusparseSpMM supports the following algorithms:

Algorithm	Notes
`cusparseSpMMAlg_t::CUSPARSE_SPMM_ALG_DEFAULT`	Default algorithm for any sparse matrix format
`cusparseSpMMAlg_t::CUSPARSE_SPMM_COO_ALG1`	Algorithm 1 for COO sparse matrix format * May provide better performance for small number of nnz * Provides the best performance with column-major layout * It supports batched computation * May produce slightly different results during different runs with the same input parameters
`cusparseSpMMAlg_t::CUSPARSE_SPMM_COO_ALG2`	Algorithm 2 for COO sparse matrix format * It provides deterministic result * Provides the best performance with column-major layout * In general, slower than Algorithm 1 * It supports batched computation * It requires additional memory * If `opA != CUSPARSE_OPERATION_NON_TRANSPOSE`, it is identical to `cusparseSpMMAlg_t::CUSPARSE_SPMM_COO_ALG1`
`cusparseSpMMAlg_t::CUSPARSE_SPMM_COO_ALG3`	Algorithm 3 for COO sparse matrix format * May provide better performance for large number of nnz * May produce slightly different results during different runs with the same input parameters
`cusparseSpMMAlg_t::CUSPARSE_SPMM_COO_ALG4`	Algorithm 4 for COO sparse matrix format * Provides better performance with row-major layout * It supports batched computation * May produce slightly different results during different runs with the same input parameters
`cusparseSpMMAlg_t::CUSPARSE_SPMM_CSR_ALG1`	Algorithm 1 for CSR/CSC sparse matrix format * Provides the best performance with column-major layout * It supports batched computation * It requires additional memory * May produce slightly different results during different runs with the same input parameters
`cusparseSpMMAlg_t::CUSPARSE_SPMM_CSR_ALG2`	Algorithm 2 for CSR/CSC sparse matrix format * Provides the best performance with row-major layout * It supports batched computation * It requires additional memory * May produce slightly different results during different runs with the same input parameters
`cusparseSpMMAlg_t::CUSPARSE_SPMM_CSR_ALG3`	Algorithm 3 for CSR sparse matrix format * It provides deterministic result * It requires additional memory * It supports only CSR matrix and `opA == CUSPARSE_OPERATION_NON_TRANSPOSE` * It does not support `opB == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE` * It does not support `CUDA_C_16F and CUDA_C_16BF` data types
`cusparseSpMMAlg_t::CUSPARSE_SPMM_BSR_ALG1`	Algorithm 1 for BSR sparse matrix format * It provides deterministic result * It requires no additional memory * It supports only `opA == CUSPARSE_OPERATION_NON_TRANSPOSE` * It does not support `cudaDataType_t::CUDA_C_16F` and `cudaDataType_t::CUDA_C_16BF` data types * It does not support column-major blocks in `A`

NOTE: When using cusparseSpMM for mixed-precision computation on COO or CSR matrices, it defaults to algorithms cusparseSpMMAlg_t::CUSPARSE_SPMM_COO_ALG2 and cusparseSpMMAlg_t::CUSPARSE_SPMM_CSR_ALG3, respectively. If the required computation isn’t supported by those algorithms, the mixed-precision operation will fail.

Performance notes:

Row-major layout provides higher performance than column-major
cusparseSpMMAlg_t::CUSPARSE_SPMM_COO_ALG4 and cusparseSpMMAlg_t::CUSPARSE_SPMM_CSR_ALG2 should be used with row-major layout, while cusparseSpMMAlg_t::CUSPARSE_SPMM_COO_ALG1, cusparseSpMMAlg_t::CUSPARSE_SPMM_COO_ALG2, cusparseSpMMAlg_t::CUSPARSE_SPMM_COO_ALG3, and cusparseSpMMAlg_t::CUSPARSE_SPMM_CSR_ALG1 with column-major layout
For beta != 1, most algorithms scale the output matrix before the main computation
For n == 1, the routine may use cusparseSpMV

cusparseSpMM with all algorithms support the following batch modes except for cusparseSpMMAlg_t::CUSPARSE_SPMM_CSR_ALG3:

$C_{i} = A \cdot B_{i}$
$C_{i} = A_{i} \cdot B$
$C_{i} = A_{i} \cdot B_{i}$

The number of batches and their strides can be set by using cusparseCooSetStridedBatch, cusparseCsrSetStridedBatch, and cusparseDnMatSetStridedBatch. The maximum number of batches for cusparseSpMM is 65,535.

cusparseSpMM has the following properties:

The routine requires no extra storage for cusparseSpMMAlg_t::CUSPARSE_SPMM_COO_ALG1, cusparseSpMMAlg_t::CUSPARSE_SPMM_COO_ALG3, cusparseSpMMAlg_t::CUSPARSE_SPMM_COO_ALG4, cusparseSpMMAlg_t::CUSPARSE_SPMM_BSR_ALG1
The routine supports asynchronous execution
Provides deterministic (bit-wise) results for each run only for cusparseSpMMAlg_t::CUSPARSE_SPMM_COO_ALG2, cusparseSpMMAlg_t::CUSPARSE_SPMM_CSR_ALG3, and cusparseSpMMAlg_t::CUSPARSE_SPMM_BSR_ALG1 algorithms
compute-sanitizer could report false race conditions for this routine. This is for optimization purposes and does not affect the correctness of the computation
The routine allows the indices of matA to be unsorted

cusparseSpMM supports the following optimizations:

CUDA graph capture
Hardware Memory Compression

Please visit cuSPARSE Library Samples - cusparseSpMM CSR and cusparseSpMM COO for a code example. For batched computation please visit cusparseSpMM CSR Batched and cusparseSpMM COO Batched.


(2)	BLOCKED-ELLPACK FORMAT

cusparseSpMM supports the following data types for cusparseFormat_t::CUSPARSE_FORMAT_BLOCKED_ELL format and the following GPU architectures for exploiting NVIDIA Tensor Cores:

`A`/`B`	`C`	`computeType`	`opB`	`Compute Capability`
`cudaDataType_t::CUDA_R_16F`	`cudaDataType_t::CUDA_R_16F`	`cudaDataType_t::CUDA_R_16F`	`N`, `T`	`≥ 70`
`cudaDataType_t::CUDA_R_16F`	`cudaDataType_t::CUDA_R_16F`	`cudaDataType_t::CUDA_R_32F`	`N`, `T`	`≥ 70`
`cudaDataType_t::CUDA_R_16F`	`cudaDataType_t::CUDA_R_32F`	`cudaDataType_t::CUDA_R_32F`	`N`, `T`	`≥ 70`
`cudaDataType_t::CUDA_R_8I`	`cudaDataType_t::CUDA_R_32I`	`cudaDataType_t::CUDA_R_32I`	`N` column-major	`≥ 75`
`T` row-major
`cudaDataType_t::CUDA_R_16BF`	`cudaDataType_t::CUDA_R_16BF`	`cudaDataType_t::CUDA_R_32F`	`N`, `T`	`≥ 80`
`cudaDataType_t::CUDA_R_16BF`	`cudaDataType_t::CUDA_R_32F`	`cudaDataType_t::CUDA_R_32F`	`N`, `T`	`≥ 80`
`cudaDataType_t::CUDA_R_32F`	`cudaDataType_t::CUDA_R_32F`	`cudaDataType_t::CUDA_R_32F`	`N`, `T`	`≥ 80`
`cudaDataType_t::CUDA_R_64F`	`cudaDataType_t::CUDA_R_64F`	`cudaDataType_t::CUDA_R_64F`	`N`, `T`	`≥ 80`

cusparseSpMM supports the following algorithms with cusparseFormat_t::CUSPARSE_FORMAT_BLOCKED_ELL format:

Algorithm	Notes
`cusparseSpMMAlg_t::CUSPARSE_SPMM_ALG_DEFAULT`	Default algorithm for any sparse matrix format
`cusparseSpMMAlg_t::CUSPARSE_SPMM_BLOCKED_ELL_ALG1`	Default algorithm for Blocked-ELL format

Performance notes:

Blocked-ELL SpMM provides the best performance with Power-of-2 Block-Sizes.
Large Block-Sizes (e.g. ≥ 64) provide the best performance.

The function has the following limitations:

The pointer mode must be equal to cusparsePointerMode_t::CUSPARSE_POINTER_MODE_HOST
Only opA == CUSPARSE_OPERATION_NON_TRANSPOSE is supported.
opB == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE is not supported.
Only cusparseIndexType_t::CUSPARSE_INDEX_32I is supported.

Please visit cuSPARSE Library Samples - cusparseSpMM Blocked-ELL for a code example.

§Parameters

handle: Handle to the cuSPARSE library context.
opA: Operation op(A).
opB: Operation op(B).
alpha: $\alpha$ scalar used for multiplication of type computeType.
matA: Sparse matrix A.
matB: Dense matrix B.
beta: $\beta$ scalar used for multiplication of type computeType.
matC: Dense matrix C.
computeType: Datatype in which the computation is executed.
alg: Algorithm for the computation.
externalBuffer: Pointer to workspace buffer of at least bufferSize bytes.

cusparseSpMM

Function cusparseSpMM Copy item path

§Parameters

Function cusparseSpMM