Crate tblis

Crate tblis 

Source
Expand description

TBLIS wrapper and several minimal implementations.

§API Documentation Summary

§About safety

Most functions in this crate are marked as unsafe because they do not check the validity of tensor data and mutability.

§About function tblis_einsum

This is the most important function of this crate.

For the parameters of this function, also refer to crate opt-einsum-path, document of function contract_path.

§Most important functions, structs and traits

ItemDescription
fn tblis_einsumEinstein summation
fn tblis_einsum_ndarrayEinstein summation with ndarray
struct TblisTensorTensor struct of this crate
trait TblisFloatAPIFloat trait for TBLIS operations (f32, f64, c32, c64)

§Functions

ItemDescription
tblis_einsum(high-level) Einstein summation
tblis_einsum_ndarray(high-level) Einstein summation with ndarray
tblis_einsum_f(high-level) Einstein summation (failable)
tblis_einsum_ndarray_f(high-level) Einstein summation with ndarray (failable)
tblis_tensor_add(1t-level) $B = \alpha A + \beta B$
tblis_tensor_scale(1t-level) $A = \alpha A$
tblis_tensor_set(1t-level) $A = \alpha$
tblis_tensor_shift(1t-level) $A = \alpha + \beta A$
tblis_tensor_reduce(1t-level) $\gamma = \mathrm{op}(A)$
tblis_tensor_dot(1t-level) $\gamma = A B$
tblis_tensor_mult(3t-level) $C = \alpha A B + \beta C$
tblis_get_num_threadsGet the number of threads used by TBLIS
tblis_set_num_threadsSet the number of threads used by TBLIS

§Traits

ItemDescription
TblisFloatAPIFloat trait for TBLIS operations (f32, f64, c32, c64)
ToTblisTensorTensor view convert to TBLIS (mutable) tensor
Method function ToTblisTensor::to_tblis_tensor
ArrayFromTblisTensorConvert from TBLIS tensor to ndarray object
Method function ArrayFromTblisTensor::into_array
Only available for crate feature ndarray

§Configurations

ItemDescription
TblisReduceOpReduction operations for tblis_tensor_reduce
- Supported operations: sum, sumabs/norm1, max, maxabs/norminf, min, minabs, norm2
TblisZeroCfg
TblisZeroCfgBuilder
Zero parameter configuration for TBLIS operations
- Used in tblis_tensor_set
- Aliased as TblisSetCfg
TblisUniCfg
TblisUniCfgBuilder
One parameter configuration for TBLIS operations
- By default $\alpha = 1$, no conjugate
- Used in tblis_tensor_scale, tblis_tensor_shift, tblis_tensor_reduce
- Aliased as TblisScaleCfg, TblisShiftCfg, TblisReduceCfg
TblisBiCfg
TblisBiCfgBuilder
Two parameter configuration for TBLIS operations
- By default $\alpha = 1, \beta = 1$, no conjugate
- Used in tblis_tensor_add, tblis_tensor_dot
- Aliased as TblisAddCfg, TblisDotCfg
TblisTriCfg
TblisTriCfgBuilder
Three parameter configuration for TBLIS operations
- By default $\alpha = 1, \beta = 0$, no conjugate
- Used in tblis_tensor_mult
- Aliased as TblisMultCfg

§TBLIS Wrapper in Rust

This crate contains TBLIS wrapper and several minimal implementations.

TBLIS (Tensor BLIS, The Tensor-Based Library Instantiation Software) can perform various tensor operations (multiplication, addition, reduction, transposition, etc.) efficiently on single-node CPU. This library can be an underlying driver for performing einsum (Einstein summation).

TBLIS (C++) source code is available on github by Devin Matthews research group.

Please note that to use crate tblis as wrapper, you also need to provide shared library libtblis.so, or compile by crate tblis-src with cargo feature build_from_source. Refer to section installation for more information. Notice that if you compile libtblis.so with CMake, please make sure -DCMAKE_BUILD_TYPE=Release.

This crate is not official wrapper project. It is originally intended to serve rust tensor toolkit RSTSR and rust electronic structure toolkit REST at Xin Xu (徐昕) and Igor Ying Zhang (张颖) research groups.

ResourcesBadgesAPI Document
Crate for Wrapper (tblis)CrateAPI Documentation
Crate for FFI (tblis-ffi)CrateAPI Documentation
Crate for Source (tblis-src)CrateAPI Documentation
FFI Binding9b95712 after v1.3.0

§Table of Contents

§Example

The following example is to perform contraction:

$$ G_{pqrs} = \sum_{\mu \nu \kappa \lambda} C_{\mu p} C_{\nu q} E_{\mu \nu \kappa \lambda} C_{\kappa r} C_{\lambda s} $$

This tensor contraction is utilized in electronic structure (electronic integral in atomic orbital basis $E_{\mu \nu \kappa \lambda}$ to molecular orbital basis $G_{pqrs}$).

To run the following code, you may need to

  • activate crate feature ndarray to make conversion between ndarray::{Array, ArrayView, ArrayViewMut} and tblis::TblisTensor;
  • properly link libtblis.so in your project (also see crate tblis-ffi and tblis-src for more information).

The following code snippet performs this contraction.

// Must declare crate `tblis-src` if you want link tblis dynamically.
// You can also call the following code in `build.rs`, instead of using crate `tblis-src`:
//     println!("cargo:rustc-link-lib=tblis");
extern crate tblis_src;

use ndarray::prelude::*;
use tblis::prelude::*;

// dummy setting of matrix C and tensor E
let (nao, nmo): (usize, usize) = (3, 2);
let vec_c: Vec<f64> = (0..nao * nmo).map(|x| x as f64).collect();
let vec_e: Vec<f64> = (0..nao * nao * nao * nao).map(|x| x as f64).collect();

let arr_c = ArrayView2::from_shape((nao, nmo), &vec_c).unwrap();
let arr_e = ArrayView4::from_shape((nao, nao, nao, nao), &vec_e).unwrap();

/// # Parameters
/// - `arr_c`: coefficient matrix $C_{\mu p}$
/// - `arr_s`: electronic integral $E_{\mu \nu \kappa \lambda}$ (in atomic orbital basis)
///
/// # Returns
/// - `arr_g`: electronic integral $G_{pqrs}$ (in molecular orbital basis)
fn ao2mo(arr_c: ArrayView2<f64>, arr_e: ArrayView4<f64>) -> Array4<f64> {
    let view_c = arr_c.view().into_dyn();
    let view_e = arr_e.view().into_dyn();
    let operands = [&view_c, &view_c, &view_e, &view_c, &view_c];
    let arr_g = tblis_einsum_ndarray(
        "μi,νa,μνκλ,κj,λb->iajb", // einsum subscripts
        &operands,                // tensors to be contracted
        "optimal",                // contraction strategy (see crate opt-einsum-path)
        None,                     // memory limit (None means no limit, see crate opt-einsum-path)
        true,                     // row-major (true) or col-major (false)
        None,                     // pre-allocated output tensor (None to allocate internally)
    )
    .unwrap();

    // transform to 4-dimensional array
    arr_g.into_dimensionality().unwrap()
}

let arr_g = ao2mo(arr_c, arr_e);
println!("{:?}", arr_g);

§Cargo features

Optional features:

  • ndarray: Supports conversion from ndarray objects (Array, ArrayView, ArrayMut) to TblisTensor; conversion from TblisTensor to ndarray object (ArrayD).
  • dynamic_loading: Supports dynamic loading (for dependency crate tblis-ffi).

§Installation

If you wish using dynamic loading (instead of dynamic/static linking), refer to the next subsection “Dynamic loading”.

You can either

  • link library tblis manually with pre-built libtblis.so
  • use cargo crate tblis-src with pre-built libtblis.so
  • use cargo crate tblis-src and build-from-source

Refer TBLIS repository for information of installation of TBLIS. Notice that if you compile libtblis.so with CMake, please make sure -DCMAKE_BUILD_TYPE=Release.

By this way, you can directly use cargo crate tblis or tblis-ffi, without using tblis-src.

It is recommended to link libtblis.so by dynamic linking. Making sure your library is in environment variable LD_LIBRARY_PATH, then

// build.rs
println!("cargo:rustc-link-lib=tblis");

§Use cargo crate tblis-src with pre-built libtblis.so

By this way, you need to add tblis-src as Cargo.toml dependency:

tblis-src = { version = "0.1" }

and then export this crate in your lib.rs/main.rs:

extern crate tblis_src;

§Use cargo crate tblis-src and build-from-source

You can use crago feature build_from_source to automatically build TBLIS with default configuration.

cargo crate tblis-src has the following cargo features:

  • build_from_source: This will use CMake (cmake > 3.23, c++20 standard), and use the code from git submodule to compile tblis. Though this option can be developer-friendly (you do not need to perform any other configurations to make program compile and run by cargo), build_from_source does not provide customized compilation.

    CMake configurable variables (can be defined as environment variables):

    • TBLIS_SRC: Git repository source directory or URL. All git submodules (marray, blis, tci) should be properly downloaded.
    • TBLIS_VER: Git repository version (branch or tag). Default to be develop.
  • static: This will link static libary instead of dynamic one. Please note that static linking may require additional dynamic library linking, which should be configured manually by developer in build.rs or environment variables RUSTFLAGS. Static linking can be difficult when searching symbols, and we recommend dynamic linking in most cases.

§Dynamic loading

This crate supports dynamic loading.

If you want to use dynamic loading, please enable cargo feature dynamic_loading when cargo build.

The dynamic loading will try to find proper library when your program initializes.

  • This crate will automatically detect proper libraries, if these libraries are in environmental path LD_LIBRARY_PATH (Linux) DYLD_LIBRARY_PATH (Mac OS), PATH (Windows).
  • If you want to override the library to be loaded, please set these shell environmental variable RSTSR_DYLOAD_TBLIS to the dynamic library path.

§Why TBLIS?

TBLIS can perform many types of einsum (tensor contraction), as well as tensor transposition, addition and reduction.

Some einsum tasks can transform to matrix multiplication (GEMM) tasks. For those tasks, TBLIS may probably not the best choice (this depends on efficiency of BLIS and some other factors).

However, TBLIS can be extremely useful if

  • Contraction is very difficult that usual GEMM or batched GEMM is not sutiable to handle;
  • Layout of your tensor is strided (not contiguous) in memory.

As an example, some benchmarks on my personal computer (AMD Ryzen 7945HX, estimated FP64 1.1 TFLOP/sec with 16 cores). The shape of input tensor is (96, 96, 96, 96). For the strided case, the stride of each dimension is 128.

§Benchmark of contiguous case

casedescriptionFLOPsTBLISNumPy (MKL)PyTorch (CPU)
abxy, xycd -> abcdnaive GEMM$2 n^6$1.90 sec
767 GFLOP/sec
2.13 sec
683 GFLOP/sec
1.98 sec
736 GFLOP/sec
axyz, xyzb -> abnaive GEMM$2 n^5$132.3 msec
112 GFLOP/sec
63.1 msec
241 GFLOP/sec
63.4 msec
240 GFLOP/sec
axyz, bxyz -> abnaive SYRK$n^5$96.9 msec
77 GFLOP/sec
293.2 msec
26 GFLOP/sec
37.4 msec
203 GFLOP/sec
axyz, ybzx -> abcomplicated GEMM$2 n^5$120.7 msec
126 GFLOP/sec
207.7 msec
73 GFLOP/sec
211.1 msec
72 GFLOP/sec
axby, yacx -> abcbatched complicated GEMM$2 n^5$124.1 msec
122 GFLOP/sec
29.7 sec
0.5 GFLOP/sec
179.2 msec
85 GFLOP/sec
xpay, aybx -> abtrace then complicated GEMM$2 n^4$36.4 msec
4.3 GFLOP/sec
33.9 sec
0.0 GFLOP/sec
106.9 msec
1.5 GFLOP/sec

§Benchmark of strided case

casedescriptionFLOPsTBLISNumPy (MKL)PyTorch (CPU)
abxy, xycd -> abcdnaive GEMM$2 n^6$2.02 sec
722 GFLOP/sec
7.30 sec
200 GFLOP/sec
2.10 sec
694 GFLOP/sec
axyz, xyzb -> abnaive GEMM$2 n^5$133.1 msec
114 GFLOP/sec
776.8 msec
20 GFLOP/sec
204.4 msec
74 GFLOP/sec
axyz, bxyz -> abnaive SYRK$n^5$98.3 msec
77 GFLOP/sec
455.5 msec
17 GFLOP/sec
211.4 msec
36 GFLOP/sec
axyz, ybzx -> abcomplicated GEMM$2 n^5$144.7 msec
105 GFLOP/sec
725.0 msec
21 GFLOP/sec
406.7 msec
37 GFLOP/sec
axby, yacx -> abcbatched complicated GEMM$2 n^5$142.7 msec
106 GFLOP/sec
27.1 sec
0.6 GFLOP/sec
263.6 msec
58 GFLOP/sec
xpay, aybx -> abtrace then complicated GEMM$2 n^4$232.3 msec
0.7 GFLOP/sec
248.5 sec
0.0 GFLOP/sec
147.3 msec
1.1 GFLOP/sec

§Citation

TBLIS for Rust is not the original work of TBLIS.

Please cite TBLIS as:

Matthews, D. A. High-Performance Tensor Contraction without Transposition. SIAM J. Sci. Comput. 2018, 40 (1), C1–C24. DOI: 10.1137/16M108968X. arXiv: 1607.00291.

Related work is:

Huang, J.; Matthews, D. A.; van de Geijn, R. A. Strassen’s Algorithm for Tensor Contraction. SIAM J. Sci. Comput. 2018, 40 (3), C305–C326. DOI: 10.1137/17M1135578. arXiv: 1704.03092.

§Miscellaneous

Integration testing cases comes from Python libraries pytblis and opt_einsum.

Modules§

alloc_vec
char_parse
Parse rust’s char to c_char (with mapping).
containers
TBLIS data containers.
einsum_impl
Implementation of einsum operations preparations.
float_trait
TBLIS trait for floating point types.
ndarray_impl
prelude
tensor_ops
Operations for tensors in TBLIS.
threading