TBLIS for Rust

This repository contains wrapper, FFI bindings and build-from-source support of TBLIS.

TBLIS (Tensor BLIS, The Tensor-Based Library Instantiation Software) can perform various tensor operations (multiplication, addition, reduction, transposition, etc.) efficiently on single-node CPU. This library can be an underlying driver for performing einsum (Einstein summation).

TBLIS (C++) source code is available on github by Devin Matthews research group.

This repository is not official wrapper project. It is originally intended to serve rust tensor toolkit RSTSR and rust electronic structure toolkit REST at Xin Xu (徐昕) and Igor Ying Zhang (张颖) research groups.

Resources

We refer the readme file of crate tblis (Minimal wrapper), tblis-ffi (FFI bindings) and tblis-src (source code build) for more information. You can also see these files in API documentations at docs.rs.

Resources	Badges	API Document
Crate for Wrapper (tblis)
Crate for FFI (tblis-ffi)
Crate for Source (tblis-src)
FFI Binding	9b95712 after

Example
Cargo features
Installation
Dynamic loading
Why TBLIS?
- Benchmark of contiguous case
- Benchmark of strided case
Citation
Miscellaneous

Example

The following example is to perform contraction:

$$ G_{pqrs} = \sum_{\mu \nu \kappa \lambda} C_{\mu p} C_{\nu q} E_{\mu \nu \kappa \lambda} C_{\kappa r} C_{\lambda s} $$

This tensor contraction is utilized in electronic structure (electronic integral in atomic orbital basis $E_{\mu \nu \kappa \lambda}$ to molecular orbital basis $G_{pqrs}$).

To run the following code, you may need to

activate crate feature ndarray to make conversion between ndarray::{Array, ArrayView, ArrayViewMut} and tblis::TblisTensor;
properly link libtblis.so in your project (also see crate tblis-ffi and tblis-src for more information).

The following code snippet performs this contraction.

// Must declare crate `tblis-src` if you want link tblis dynamically.
// You can also call the following code in `build.rs`, instead of using crate `tblis-src`:
//     println!("cargo:rustc-link-lib=tblis");
extern crate tblis_src;

use ndarray::prelude::*;
use tblis::prelude::*;

// dummy setting of matrix C and tensor E
let (nao, nmo): (usize, usize) = (3, 2);
let vec_c: Vec<f64> = (0..nao * nmo).map(|x| x as f64).collect();
let vec_e: Vec<f64> = (0..nao * nao * nao * nao).map(|x| x as f64).collect();

let arr_c = ArrayView2::from_shape((nao, nmo), &vec_c).unwrap();
let arr_e = ArrayView4::from_shape((nao, nao, nao, nao), &vec_e).unwrap();

/// # Parameters
/// - `arr_c`: coefficient matrix $C_{\mu p}$
/// - `arr_s`: electronic integral $E_{\mu \nu \kappa \lambda}$ (in atomic orbital basis)
///
/// # Returns
/// - `arr_g`: electronic integral $G_{pqrs}$ (in molecular orbital basis)
fn ao2mo(arr_c: ArrayView2<f64>, arr_e: ArrayView4<f64>) -> Array4<f64> {
    let view_c = arr_c.view().into_dyn();
    let view_e = arr_e.view().into_dyn();
    let operands = [&view_c, &view_c, &view_e, &view_c, &view_c];
    let arr_g = tblis_einsum_ndarray(
        "μi,νa,μνκλ,κj,λb->iajb", // einsum subscripts
        &operands,                // tensors to be contracted
        "optimal",                // contraction strategy (see crate opt-einsum-path)
        None,                     // memory limit (None means no limit, see crate opt-einsum-path)
        true,                     // row-major (true) or col-major (false)
        None,                     // pre-allocated output tensor (None to allocate internally)
    )
    .unwrap();

    // transform to 4-dimensional array
    arr_g.into_dimensionality().unwrap()
}

let arr_g = ao2mo(arr_c, arr_e);
println!("{:?}", arr_g);

Cargo features

Optional features:

ndarray: Supports conversion from ndarray objects (Array, ArrayView, ArrayMut) to TblisTensor; conversion from TblisTensor to ndarray object (ArrayD).
dynamic_loading: Supports dynamic loading (for dependency crate tblis-ffi).

Installation

If you wish using dynamic loading (instead of dynamic/static linking), refer to the next subsection "Dynamic loading".

You can either

link library tblis manually with pre-built libtblis.so
use cargo crate tblis-src with pre-built libtblis.so
use cargo crate tblis-src and build-from-source

Refer TBLIS repository for information of installation of TBLIS. Notice that if you compile libtblis.so with CMake, please make sure -DCMAKE_BUILD_TYPE=Release.

Link library `tblis` manually

By this way, you can directly use cargo crate tblis or tblis-ffi, without using tblis-src.

It is recommended to link libtblis.so by dynamic linking. Making sure your library is in environment variable LD_LIBRARY_PATH, then

// build.rs
println!("cargo:rustc-link-lib=tblis");

Use cargo crate tblis-src with pre-built libtblis.so

By this way, you need to add tblis-src as Cargo.toml dependency:

tblis-src = { version = "0.1" }

and then export this crate in your lib.rs/main.rs:

extern crate tblis_src;

Use cargo crate tblis-src and build-from-source

You can use crago feature build_from_source to automatically build TBLIS with default configuration.

cargo crate tblis-src has the following cargo features:

build_from_source: This will use CMake (cmake > 3.23, c++20 standard), and use the code from git submodule to compile tblis. Though this option can be developer-friendly (you do not need to perform any other configurations to make program compile and run by cargo), build_from_source does not provide customized compilation.

CMake configurable variables (can be defined as environment variables):
- TBLIS_SRC: Git repository source directory or URL. All git submodules (marray, blis, tci) should be properly downloaded.
- TBLIS_VER: Git repository version (branch or tag). Default to be master.
static: This will link static libary instead of dynamic one. Please note that static linking may require additional dynamic library linking, which should be configured manually by developer in build.rs or environment variables RUSTFLAGS. Static linking can be difficult when searching symbols, and we recommend dynamic linking in most cases.

Dynamic loading

By default, the tblis shared library is loaded at runtime via libloading. The loading process will search for the library in multiple locations, by the following order:

User-defined candidates via environment variables TBLIS_DYLOAD or RSTSR_DYLOAD.
LD_LIBRARY_PATH style discovery via environment variables LD_LIBRARY_PATH (Linux), DYLD_LIBRARY_PATH and DYLD_FALLBACK_LIBRARY_PATH (macOS), PATH (Windows).
Python interpreter path discovery: if Python is at /path/bin/python, the library is expected at /path/lib/libtblis.so. This includes TBLIS_PYTHON_PATH, CONDA_PREFIX, and Python interpreters found in PATH.
Standard system candidates such as /usr/lib, /usr/local/lib, and /lib.

To disable dynamic loading and use static/dynamic linking instead, disable the dynamic_loading cargo feature.

Why TBLIS?

TBLIS can perform many types of einsum (tensor contraction), as well as tensor transposition, addition and reduction.

Some einsum tasks can transform to matrix multiplication (GEMM) tasks. For those tasks, TBLIS may probably not the best choice (this depends on efficiency of BLIS and some other factors).

However, TBLIS can be extremely useful if

Contraction is very difficult that usual GEMM or batched GEMM is not sutiable to handle;
Layout of your tensor is strided (not contiguous) in memory.

As an example, some benchmarks on my personal computer (AMD Ryzen 7945HX, estimated FP64 1.1 TFLOP/sec with 16 cores). The shape of input tensor is (96, 96, 96, 96). For the strided case, the stride of each dimension is 128.

Benchmark of contiguous case

case	description	FLOPs	TBLIS	NumPy (MKL)	PyTorch (CPU)
`abxy, xycd -> abcd`	naive GEMM	$2 n^6$	1.90 sec767 GFLOP/sec	2.13 sec683 GFLOP/sec	1.98 sec736 GFLOP/sec
`axyz, xyzb -> ab`	naive GEMM	$2 n^5$	132.3 msec112 GFLOP/sec	63.1 msec241 GFLOP/sec	63.4 msec240 GFLOP/sec
`axyz, bxyz -> ab`	naive SYRK	$n^5$	96.9 msec 77 GFLOP/sec	293.2 msec 26 GFLOP/sec	37.4 msec203 GFLOP/sec
`axyz, ybzx -> ab`	complicated GEMM	$2 n^5$	120.7 msec126 GFLOP/sec	207.7 msec 73 GFLOP/sec	211.1 msec 72 GFLOP/sec
`axby, yacx -> abc`	batched complicated GEMM	$2 n^5$	124.1 msec122 GFLOP/sec	29.7 sec0.5 GFLOP/sec	179.2 msec 85 GFLOP/sec
`xpay, aybx -> ab`	trace then complicated GEMM	$2 n^4$	36.4 msec4.3 GFLOP/sec	33.9 sec0.0 GFLOP/sec	106.9 msec1.5 GFLOP/sec

Benchmark of strided case

case	description	FLOPs	TBLIS	NumPy (MKL)	PyTorch (CPU)
`abxy, xycd -> abcd`	naive GEMM	$2 n^6$	2.02 sec722 GFLOP/sec	7.30 sec200 GFLOP/sec	2.10 sec694 GFLOP/sec
`axyz, xyzb -> ab`	naive GEMM	$2 n^5$	133.1 msec114 GFLOP/sec	776.8 msec 20 GFLOP/sec	204.4 msec 74 GFLOP/sec
`axyz, bxyz -> ab`	naive SYRK	$n^5$	98.3 msec 77 GFLOP/sec	455.5 msec 17 GFLOP/sec	211.4 msec 36 GFLOP/sec
`axyz, ybzx -> ab`	complicated GEMM	$2 n^5$	144.7 msec105 GFLOP/sec	725.0 msec 21 GFLOP/sec	406.7 msec 37 GFLOP/sec
`axby, yacx -> abc`	batched complicated GEMM	$2 n^5$	142.7 msec106 GFLOP/sec	27.1 sec0.6 GFLOP/sec	263.6 msec 58 GFLOP/sec
`xpay, aybx -> ab`	trace then complicated GEMM	$2 n^4$	232.3 msec0.7 GFLOP/sec	248.5 sec0.0 GFLOP/sec	147.3 msec1.1 GFLOP/sec

Citation

TBLIS for Rust is not the original work of TBLIS.

Please cite TBLIS as:

Matthews, D. A. High-Performance Tensor Contraction without Transposition. SIAM J. Sci. Comput. 2018, 40 (1), C1–C24. DOI: 10.1137/16M108968X. arXiv: 1607.00291.

Related work is:

Huang, J.; Matthews, D. A.; van de Geijn, R. A. Strassen’s Algorithm for Tensor Contraction. SIAM J. Sci. Comput. 2018, 40 (3), C305–C326. DOI: 10.1137/17M1135578. arXiv: 1704.03092.

Miscellaneous

Integration testing cases comes from Python libraries pytblis and opt_einsum.

tblis 0.2.6

TBLIS for Rust

Resources

Table of Contents

Example

Cargo features

Installation

Link library `tblis` manually

Use cargo crate tblis-src with pre-built libtblis.so

Use cargo crate tblis-src and build-from-source

Dynamic loading

Why TBLIS?

Benchmark of contiguous case

Benchmark of strided case

Citation

Miscellaneous

tblis 0.2.6

TBLIS for Rust

Resources

Table of Contents

Example

Cargo features

Installation

Link library tblis manually

Use cargo crate tblis-src with pre-built libtblis.so

Use cargo crate tblis-src and build-from-source

Dynamic loading

Why TBLIS?

Benchmark of contiguous case

Benchmark of strided case

Citation

Miscellaneous

Link library `tblis` manually