Portable packed SIMD vectors

This crate is proposed for stabilization as std::packed_simd in RFC2366: std::simd .

The examples available in the examples/ sub-directory of the crate showcase how to use the library in practice.

Introduction
Vector types
Conditional operations
Conversions
Performance guide

Introduction

This crate exports [Simd<[T; N]>][Simd]: a packed vector of N elements of type T as well as many type aliases for this type: for example, [f32x4], which is just an alias for Simd<[f32; 4]>.

The operations on packed vectors are, by default, "vertical", that is, they are applied to each vector lane in isolation of the others:

# use packed_simd::*;
let a = i32x4::new(1, 2, 3, 4);
let b = i32x4::new(5, 6, 7, 8);
assert_eq!(a + b, i32x4::new(6, 8, 10, 12));

Many "horizontal" operations are also provided:

# use packed_simd::*;
# let a = i32x4::new(1, 2, 3, 4);
assert_eq!(a.wrapping_sum(), 10);

In virtually all architectures vertical operations are fast, while horizontal operations are, by comparison, much slower. That is, the most portably-efficient way of performing a reduction over a slice is to collect the results into a vector using vertical operations, and performing a single horizontal operation at the end:

# use packed_simd::*;
fn reduce(x: &[i32]) -> i32 {
    assert!(x.len() % 4 == 0);
    let mut sum = i32x4::splat(0); // [0, 0, 0, 0]
    for i in (0..x.len()).step_by(4) {
        sum += i32x4::from_slice_unaligned(&x[i..]);
    }
    sum.wrapping_sum()
}

let x = [0, 1, 2, 3, 4, 5, 6, 7];
assert_eq!(reduce(&x), 28);

Vector types

The vector type aliases are named according to the following scheme:

{element_type}x{number_of_lanes} == Simd<[element_type; number_of_lanes]>

where the following element types are supported:

i{element_width}: signed integer
u{element_width}: unsigned integer
f{element_width}: float
m{element_width}: mask (see below)
*{const,mut} T: const and mut pointers

Basic operations

# use packed_simd::*;
// Sets all elements to `0`:
let a = i32x4::splat(0);

// Reads a vector from a slice:
let mut arr = [0, 0, 0, 1, 2, 3, 4, 5];
let b = i32x4::from_slice_unaligned(&arr);

// Reads the 4-th element of a vector:
assert_eq!(b.extract(3), 1);

// Returns a new vector where the 4-th element is replaced with `1`:
let a = a.replace(3, 1);
assert_eq!(a, b);

// Writes a vector to a slice:
let a = a.replace(2, 1);
a.write_to_slice_unaligned(&mut arr[4..]);
assert_eq!(arr, [0, 0, 0, 1, 0, 0, 1, 1]);

Conditional operations

One often needs to perform an operation on some lanes of the vector. Vector masks, like m32x4, allow selecting on which vector lanes an operation is to be performed:

# use packed_simd::*;
let a = i32x4::new(1, 1, 2, 2);

// Add `1` to the first two lanes of the vector.
let m = m16x4::new(true, true, false, false);
let a = m.select(a + 1, a);
assert_eq!(a, i32x4::splat(2));

The elements of a vector mask are either true or false. Here true means that a lane is "selected", while false means that a lane is not selected.

All vector masks implement a mask.select(a: T, b: T) -> T method that works on all vectors that have the same number of lanes as the mask. The resulting vector contains the elements of a for those lanes for which the mask is true, and the elements of b otherwise.

The example constructs a mask with the first two lanes set to true and the last two lanes set to false. This selects the first two lanes of a + 1 and the last two lanes of a, producing a vector where the first two lanes have been incremented by 1.

note: mask select can be used on vector types that have the same number of lanes as the mask. The example shows this by using [m16x4] instead of [m32x4]. It is typically more performant to use a mask element width equal to the element width of the vectors being operated upon. This is, however, not true for 512-bit wide vectors when targetting AVX-512, where the most efficient masks use only 1-bit per element.

All vertical comparison operations returns masks:

# use packed_simd::*;
let a = i32x4::new(1, 1, 3, 3);
let b = i32x4::new(2, 2, 0, 0);

// ge: >= (Greater Eequal; see also lt, le, gt, eq, ne).
let m = a.ge(i32x4::splat(2));

if m.any() {
    // all / any / none allow coherent control flow
    let d = m.select(a, b);
    assert_eq!(d, i32x4::new(2, 2, 3, 3));
}

Conversions

lossless widening conversions: [From]/[Into] are implemented for vectors with the same number of lanes when the conversion is value preserving (same as in std).
safe bitwise conversions: The cargo feature into_bits provides the IntoBits/FromBits traits (x.into_bits()). These perform safe bitwise transmutes when all bit patterns of the source type are valid bit patterns of the target type and are also implemented for the architecture-specific vector types of std::arch. For example, let x: u8x8 = m8x8::splat(true).into_bits(); is provided because all m8x8 bit patterns are valid u8x8 bit patterns. However, the opposite is not true, not all u8x8 bit patterns are valid m8x8 bit-patterns, so this operation cannot be peformed safely using x.into_bits(); one needs to use unsafe { crate::mem::transmute(x) } for that, making sure that the value in the u8x8 is a valid bit-pattern of m8x8.
numeric casts (as): are peformed using [FromCast]/[Cast] (x.cast()), just like as:
- casting integer vectors whose lane types have the same size (e.g. i32xN -> u32xN) is a no-op,
- casting from a larger integer to a smaller integer (e.g. u32xN -> u8xN) will truncate,
- casting from a smaller integer to a larger integer (e.g. u8xN -> u32xN) will:
  - zero-extend if the source is unsigned, or
  - sign-extend if the source is signed,
- casting from a float to an integer will round the float towards zero,
- casting from an integer to float will produce the floating point representation of the integer, rounding to nearest, ties to even,
- casting from an f32 to an f64 is perfect and lossless,
- casting from an f64 to an f32 rounds to nearest, ties to even.
Numeric casts are not very "precise": sometimes lossy, sometimes value preserving, etc.