pub unsafe extern "C" fn cusparseZgtsvInterleavedBatch(
handle: cusparseHandle_t,
algo: c_int,
m: c_int,
dl: *mut cuDoubleComplex,
d: *mut cuDoubleComplex,
du: *mut cuDoubleComplex,
x: *mut cuDoubleComplex,
batchCount: c_int,
pBuffer: *mut c_void,
) -> cusparseStatus_tExpand description
This function computes the solution of multiple tridiagonal linear systems for i=0,…,batchCount:
The coefficient matrix A of each of these tri-diagonal linear system is defined with three vectors corresponding to its lower (dl), main (d), and upper (du) matrix diagonals; the right-hand sides are stored in the dense matrix B. Notice that solution X overwrites right-hand-side matrix B on exit.
Assuming A is of size m and base-1, dl, d and du are defined by the following formula:
dl(i):= A(i, i-1) for i=1,2,...,m
The first element of dl is out-of-bound (dl(1):= A(1,0)), so dl(1) = 0.
d(i) = A(i,i) for i=1,2,...,m
du(i) = A(i,i+1) for i=1,2,...,m
The last element of du is out-of-bound (du(m):= A(m,m+1)), so du(m) = 0.
The data layout is different from gtsvStridedBatch which aggregates all matrices one after another. Instead, gtsvInterleavedBatch gathers different matrices of the same element in a continous manner. If dl is regarded as a 2-D array of size m-by-batchCount, dl(:,j) to store j-th matrix. gtsvStridedBatch uses column-major while gtsvInterleavedBatch uses row-major.
The routine provides three different algorithms, selected by parameter algo. The first algorithm is cuThomas provided by Barcelona Supercomputing Center. The second algorithm is LU with partial pivoting and last algorithm is QR. From stability perspective, cuThomas is not numerically stable because it does not have pivoting. LU with partial pivoting and QR are stable. From performance perspective, LU with partial pivoting and QR is about 10% to 20% slower than cuThomas.
This function requires a buffer size returned by gtsvInterleavedBatch_bufferSizeExt(). The address of pBuffer must be multiple of 128 bytes. If it is not, cusparseStatus_t::CUSPARSE_STATUS_INVALID_VALUE is returned.
If the user prepares aggregate format, one can use cublasXgeam to get interleaved format. However such transformation takes time comparable to solver itself. To reach best performance, the user must prepare interleaved format explicitly.
- This function requires temporary extra storage that is allocated internally.
- The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available.
- The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available.