[−][src]Crate multiversion

This crate provides the target, target_clones, and multiversion attributes for implementing function multiversioning.

Many CPU architectures have a variety of instruction set extensions that provide additional functionality. Common examples are single instruction, multiple data (SIMD) extensions such as SSE and AVX on x86/x86-64 and NEON on ARM/AArch64. When available, these extended features can provide significant speed improvements to some functions. These optional features cannot be haphazardly compiled into programs–executing an unsupported instruction will result in a crash.

Function multiversioning is the practice of compiling multiple versions of a function with various features enabled and safely detecting which version to use at runtime.

Getting started

If you are unsure where to start, the target_clones attribute requires no knowledge of SIMD beyond understanding the available instruction set extensions for your architecture. For more advanced usage, hand-written SIMD code can be dispatched with target and multiversion.

Capabilities

Most functions can be multiversioned. The following are notable exceptions that are unsupported:

Methods, associated functions, inner functions, or any other function not at module level. In these cases, create a multiversioned function at module level and call it from the desired location.
Functions that take or return impl Trait (other than async, which is supported).

If any other functions do not work, please file a bug report.

Target specification strings

Targets for the target, target_clones, and multiversion attributes are specified as a combination of architecture (as specified in the target_arch attribute) and feature (as specified in the target_feature attribute). A single architecture can be specified as:

"arch"
"arch+feature"
"arch+feature1+feature2"

while multiple architectures can be specified as:

"[arch1|arch2]"
"[arch1|arch2]+feature"
"[arch1|arch2]+feature1+feature2"

The following are all valid target specification strings:

"x86" (matches the "x86" architecture)
"x86_64+avx+avx2" (matches the "x86_64" architecture with the "avx" and "avx2" features)
"[mips|mips64|powerpc|powerpc64]" (matches any of the "mips", "mips64", "powerpc" or "powerpc64" architectures)
"[arm|aarch64]+neon" (matches either the "arm" or "aarch64" architectures with the "neon" feature)

Example

The following example is a good candidate for optimization with SIMD. The function square optionally uses the AVX instruction set extension on x86 or x86-64. The SSE instruction set extension is part of x86-64, but is optional on x86 so the square function optionally detects that as well. This is automatically implemented by the target_clones attribute.

This works by compiling multiple clones of the function with various features enabled and detecting which to use at runtime. If none of the targets match the current CPU (e.g. an older x86-64 CPU, or another architecture such as ARM), a clone without any features enabled is used.

use multiversion::target_clones;

#[target_clones("[x86|x86_64]+avx", "x86+sse")]
fn square(x: &mut [f32]) {
    for v in x {
        *v *= *v;
    }
}

The following produces a nearly identical function, but instead of cloning the function, the implementations are manually specified. This is typically more useful when the implementations aren't identical, such as when using explicit SIMD instructions instead of relying on compiler optimizations. The multiversioned function is generated by the multiversion attribute.

use multiversion::{multiversion, target};

#[target("[x86|x86_64]+avx")]
unsafe fn square_avx(x: &mut [f32]) {
    for v in x {
        *v *= *v;
    }
}

#[target("x86+sse")]
unsafe fn square_sse(x: &mut [f32]) {
    for v in x {
        *v *= *v;
    }
}

#[multiversion(
    "[x86|x86_64]+avx" => square_avx,
    "x86+sse" => square_sse,
)]
fn square(x: &mut [f32]) {
    for v in x {
        *v *= *v;
    }
}

Static dispatching

Sometimes it may be useful to call multiversioned functions from other multiversioned functions. In these situations it would be inefficient to perform feature detection multiple times. Additionally, the runtime detection prevents the function from being inlined. In this situation, the #[static_dispatch] helper attribute allows bypassing feature detection.

The #[static_dispatch] attribute may be used on use statements to bring the implementation with matching features to the current function into scope. Functions created by target_clones and multiversion are capable of being statically dispatched. Functions tagged with target may statically dispatch functions in their body, but cannot themselves be statically dispatched.

Caveats:

The caller function must exactly match an available feature set in the called function. A function compiled for x86_64+avx+avx2 cannot statically dispatch a function compiled for x86_64+avx. A function compiled for x86_64+avx may statically dispatch a function compiled for [x86|x86_64]+avx, since an exact feature match exists for that architecture.
use groups are not supported (use foo::{bar, baz}). Renames are supported, however (use bar as baz)

use multiversion::target_clones;

#[target_clones("[x86|x86_64]+avx", "x86+sse")]
fn square(x: &mut [f32]) {
    for v in x {
        *v *= *v
    }
}

#[target_clones("[x86|x86_64]+avx", "x86+sse")]
fn square_plus_one(x: &mut [f32]) {
    #[static_dispatch]
    use self::square; // or just `use square` in with Rust 1.32.0+
    square(x); // this function call bypasses feature detection
    for v in x {
        *v += 1.0;
    }
}

Implementation details

The function version dispatcher consists of a function selector and an atomic function pointer. Initially the function pointer will point to the function selector. On invocation, this selector will then choose an implementation, store a pointer to it in the atomic function pointer for later use and then pass on control to the chosen function. On subsequent calls, the chosen function will be called without invoking the function selector.

Some comments on the benefits of this implementation:

The function selector is only invoked once. Subsequent calls are reduced to an atomic load and indirect function call (for non-generic, non-async functions). Generic and async functions cannot be stored in the atomic function pointer, which may result in additional branches.
If called in multiple threads, there is no contention. It is possible for two threads to hit the same function before function selection has completed, which results in each thread invoking the function selector, but the atomic ensures that these are synchronized correctly.

Attribute Macros

multiversion	Provides function multiversioning by explicitly specifying function versions.
target	Provides a less verbose equivalent to the `target_arch` and `target_feature` attributes.
target_clones	Provides automatic function multiversioning by compiling clones of the function for each target.