Module demo_efficiency

Source
Expand description

§Efficiency Demonstration

Notes of efficiency when programming (general advice, not only for blas-array2):

  • Row/col-major layout. One of two dimensions in matrix should be contiguous (stride of the dimension is one).

    This note may seems to be common sense, but will be useful when applying matrix multiplication to sub-tensors, or problem of tensor products that can be reduced to matrix multiplication. For example, $A_{ijkl} \rightarrow_{ik} A_{jl}$ is row-major, $A_{ijkl} \rightarrow_{jl} A_{ik}$ is col-major, but $A_{ijkl} \rightarrow_{il} A_{jk}$ is nither row/col-major. So it is important to avoid such cases when deriving formulas involving tensor products.

  • Avoid memory allocation and outplace add-assign.

    This will not affect efficiency a lot, but could be avoided if calling API correctly. To be more specific, it is better to call

    gemm(A, B, out = C, beta = 1)

    instead to call the following form, though more intutive but involves unnecessary allocation and add-assign of C

    C += gemm(A, B)

We benchmark efficiency of blas-array2, ndarray and faer on AMD 7945HX. Estimated theoretical performance is 1126.4 GFLOPS.

Related code is in demo-efficiency directory in repository.

The problem to be benchmarked is

$$ C_{ij} = \sum_{mk} A_{mik} B_{mkj} \quad (\texttt{DGEMM}, 10 \times 2560 \times 2560) $$ and $$ C_{ij} = \sum_{mk} A_{mik} A_{mkj} \quad (\texttt{DSYRK}, 10 \times 2560 \times 2560) $$

Main purpose of this benchmark, is to show that cost of blas-array2 itself (manipulation and generation of ndarray views, flags, stride length) is considerably small. Efficiency is mostly affected by BLAS backend, instead of representation of data.

§DGEMM

crateAPI callbackendGFLOPS
blas-array2inplaceAOCL964.1
OpenBLAS947.6
MKL768.0
outplace add-assignAOCL790.3
OpenBLAS641.3
MKL630.3
ndarrayoutplace add-assignAOCL780.8
OpenBLAS636.5
MKL626.6
faerinplacefaer844.7

§DSYRK

crateAPI callbackendGFLOPS
blas-array2inplaceAOCL735.6
OpenBLAS669.2
MKL594.1
outplace add-assignAOCL541.5
OpenBLAS456.0
MKL487.9
ndarrayoutplace add-assignAOCL (GEMM)408.3
OpenBLAS (GEMM)324.9
MKL (GEMM)340.4
faerinplacefaer (GEMMT)670.0

§Version info

  • AOCL 4.2
  • OpenBLAS 0.3.27
  • MKL 2024.0