[][src]Crate multiversion

This crate provides the target_clones attribute and multiversion! macro for implementing function multiversioning.

[dependencies]
multiversion = "0.1"

Many CPU architectures have a variety of instruction set extensions that provide additional functionality. Common examples are single instruction, multiple data (SIMD) extensions such as SSE and AVX on x86/x86-64 and NEON on ARM/AArch64. When available, these extended features can provide significant speed improvements to some functions. These optional features cannot be haphazardly compiled into programs–executing an unsupported instruction will result in a crash. Function multiversioning is the practice of compiling multiple versions of a function with various features enabled and safely detecting which version to use at runtime.

Target specification strings

Targets for both the target_clones attribute and the multiversion! macro are specified as a combination of architecture (as specified in the target_arch attribute) and feature (as specified in the target_feature attribute). A single architecture can be specified as:

  • "arch"
  • "arch+feature"
  • "arch+feature1+feature2"

while multiple architectures can be specified as:

  • "[arch1|arch2]"
  • "[arch1|arch2]+feature"
  • "[arch1|arch2]+feature1+feature2"

The following are all valid target specification strings:

  • "x86" (matches the "x86" architecture)
  • "x86_64+avx+avx2" (matches the "x86_64" architecture with the "avx" and "avx2" features)
  • "[mips|mips64|powerpc|powerpc64]" (matches any of the "mips", "mips64", "powerpc" or "powerpc64" architectures)
  • "[arm|aarch64]+neon" (matches either the "arm" or "aarch64" architectures with the "neon" feature)

Example

The following example is a good candidate for optimization with SIMD. The function square optionally uses the AVX instruction set extension on x86 or x86-64. The SSE instruciton set extension is part of x86-64, but is optional on x86 so the square function optionally detects that as well. This is automatically implemented by the target_clones attribute.

This is works by compiling multiple clones of the function with various features enabled and detecting which to use at runtime. If none of the targets match the current CPU (e.g. an older x86-64 CPU, or another architecture such as ARM), a clone without any features enabled is used.

use multiversion::target_clones;

#[target_clones("[x86|x86_64]+avx", "x86+sse")]
fn square(x: &mut [f32]) {
    for v in x {
        *v *= *v;
    }
}

The following produces a nearly identical function, but instead of cloning the function, the implementations are manually specified. This is typically more useful when the implementations aren't identical, such as when using explicit SIMD instructions instead of relying on compiler optimizations. The multiversioned function is generated by the multiversion! macro.

use multiversion::multiversion;

multiversion!{
    fn square(x: &mut [f32])
    "[x86|x86_64]+avx" => square_avx,
    "x86+sse" => square_sse,
    default => square_generic,
}

#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
#[target_feature(enable = "avx")]
unsafe fn square_avx(x: &mut [f32]) {
    for v in x {
        *v *= *v;
    }
}

#[cfg(target_arch = "x86")]
#[target_feature(enable = "sse")]
unsafe fn square_sse(x: &mut [f32]) {
    for v in x {
        *v *= *v;
    }
}

fn square_generic(x: &mut [f32]) {
    for v in x {
        *v *= *v;
    }
}

Implementation details

The function version dispatcher consists of a function selector and an atomic function pointer. On the first invocation of a multiversioned function, the dispatcher loads the atomic and since it's null, invokes the function selector. The result of the function selector is stored in the atomic, then invoked. On subsequent calls, the atomic is not null and the contents are invoked.

Some comments on the benefits of this implementation:

  • The function selector is only invoked once. Subsequent calls are reduced to an atomic load, branch, and indirect function call.
  • If called in multiple threads, there is no contention. It is possible for two threads to hit the same function before function selection has completed, which results in each thread invoking the function selector, but the atomic ensures that these are synchronized correctly.

Macros

multiversion

Provides function multiversioning by explicitly specifying function versions.

Attribute Macros

target_clones

Provides automatic function multiversioning by compiling clones of the function for each target.