# Portable packed SIMD vectors

This crate is proposed for stabilization as `std::packed_simd` in RFC2366: `std::simd` .

The examples available in the `examples/` sub-directory of the crate showcase how to use the library in practice.

## Introduction

This crate exports `Simd<[T; N]>`: a packed vector of `N` elements of type `T` as well as many type aliases for this type: for example, `f32x4`, which is just an alias for `Simd<[f32; 4]>`.

The operations on packed vectors are, by default, "vertical", that is, they are applied to each vector lane in isolation of the others:

```let a = i32x4::new(1, 2, 3, 4);
let b = i32x4::new(5, 6, 7, 8);
assert_eq!(a + b, i32x4::new(6, 8, 10, 12));```

Many "horizontal" operations are also provided:

`assert_eq!(a.wrapping_sum(), 10);`

In virtually all architectures vertical operations are fast, while horizontal operations are, by comparison, much slower. That is, the most portably-efficient way of performing a reduction over a slice is to collect the results into a vector using vertical operations, and performing a single horizontal operation at the end:

```fn reduce(x: &[i32]) -> i32 {
assert!(x.len() % 4 == 0);
let mut sum = i32x4::splat(0); // [0, 0, 0, 0]
for i in (0..x.len()).step_by(4) {
sum += i32x4::from_slice_unaligned(&x[i..]);
}
sum.wrapping_sum()
}

let x = [0, 1, 2, 3, 4, 5, 6, 7];
assert_eq!(reduce(&x), 28);```

## Vector types

The vector type aliases are named according to the following scheme:

`{element_type}x{number_of_lanes} == Simd<[element_type; number_of_lanes]>`

where the following element types are supported:

• `i{element_width}`: signed integer
• `u{element_width}`: unsigned integer
• `f{element_width}`: float
• `m{element_width}`: mask (see below)
• `*{const,mut} T`: `const` and `mut` pointers

## Basic operations

```// Sets all elements to `0`:
let a = i32x4::splat(0);

// Reads a vector from a slice:
let mut arr = [0, 0, 0, 1, 2, 3, 4, 5];
let b = i32x4::from_slice_unaligned(&arr);

// Reads the 4-th element of a vector:
assert_eq!(b.extract(3), 1);

// Returns a new vector where the 4-th element is replaced with `1`:
let a = a.replace(3, 1);
assert_eq!(a, b);

// Writes a vector to a slice:
let a = a.replace(2, 1);
a.write_to_slice_unaligned(&mut arr[4..]);
assert_eq!(arr, [0, 0, 0, 1, 0, 0, 1, 1]);```

## Conditional operations

One often needs to perform an operation on some lanes of the vector. Vector masks, like `m32x4`, allow selecting on which vector lanes an operation is to be performed:

```let a = i32x4::new(1, 1, 2, 2);

// Add `1` to the first two lanes of the vector.
let m = m16x4::new(true, true, false, false);
let a = m.select(a + 1, a);
assert_eq!(a, i32x4::splat(2));```

The elements of a vector mask are either `true` or `false`. Here `true` means that a lane is "selected", while `false` means that a lane is not selected.

All vector masks implement a `mask.select(a: T, b: T) -> T` method that works on all vectors that have the same number of lanes as the mask. The resulting vector contains the elements of `a` for those lanes for which the mask is `true`, and the elements of `b` otherwise.

The example constructs a mask with the first two lanes set to `true` and the last two lanes set to `false`. This selects the first two lanes of `a + 1` and the last two lanes of `a`, producing a vector where the first two lanes have been incremented by `1`.

note: mask `select` can be used on vector types that have the same number of lanes as the mask. The example shows this by using `m16x4` instead of `m32x4`. It is typically more performant to use a mask element width equal to the element width of the vectors being operated upon. This is, however, not true for 512-bit wide vectors when targetting AVX-512, where the most efficient masks use only 1-bit per element.

All vertical comparison operations returns masks:

```let a = i32x4::new(1, 1, 3, 3);
let b = i32x4::new(2, 2, 0, 0);

// ge: >= (Greater Eequal; see also lt, le, gt, eq, ne).
let m = a.ge(i32x4::splat(2));

if m.any() {
// all / any / none allow coherent control flow
let d = m.select(a, b);
assert_eq!(d, i32x4::new(2, 2, 3, 3));
}```

## Conversions

• lossless widening conversions: `From`/`Into` are implemented for vectors with the same number of lanes when the conversion is value preserving (same as in `std`).

• safe bitwise conversions: The cargo feature `into_bits` provides the `IntoBits/FromBits` traits (`x.into_bits()`). These perform safe bitwise `transmute`s when all bit patterns of the source type are valid bit patterns of the target type and are also implemented for the architecture-specific vector types of `std::arch`. For example, `let x: u8x8 = m8x8::splat(true).into_bits();` is provided because all `m8x8` bit patterns are valid `u8x8` bit patterns. However, the opposite is not true, not all `u8x8` bit patterns are valid `m8x8` bit-patterns, so this operation cannot be peformed safely using `x.into_bits()`; one needs to use `unsafe { crate::mem::transmute(x) }` for that, making sure that the value in the `u8x8` is a valid bit-pattern of `m8x8`.

• numeric casts (`as`): are peformed using `FromCast`/`Cast` (`x.cast()`), just like `as`:

• casting integer vectors whose lane types have the same size (e.g. `i32xN` -> `u32xN`) is a no-op,

• casting from a larger integer to a smaller integer (e.g. `u32xN` -> `u8xN`) will truncate,

• casting from a smaller integer to a larger integer (e.g. `u8xN` -> `u32xN`) will:

• zero-extend if the source is unsigned, or
• sign-extend if the source is signed,
• casting from a float to an integer will round the float towards zero,

• casting from an integer to float will produce the floating point representation of the integer, rounding to nearest, ties to even,

• casting from an `f32` to an `f64` is perfect and lossless,

• casting from an `f64` to an `f32` rounds to nearest, ties to even.

Numeric casts are not very "precise": sometimes lossy, sometimes value preserving, etc.

## Macros

 shuffle Shuffles vector elements.

## Structs

 LexicographicallyOrdered Wrapper over `T` implementing a lexicoraphical order via the `PartialOrd` and/or `Ord` traits. Simd Packed SIMD vector type. m8 8-bit wide mask. m16 16-bit wide mask. m32 32-bit wide mask. m64 64-bit wide mask. m128 128-bit wide mask. msize isize-wide mask.

## Traits

 Cast Numeric cast from `Self` to `T`. FromCast Numeric cast from `T` to `Self`. Mask This trait is implemented by all mask types SimdArray Trait implemented by arrays that can be SIMD types. SimdVector This trait is implemented by all SIMD vector types.

## Type Definitions

 cptrx2 A vector with 2 `*const T` lanes cptrx4 A vector with 4 `*const T` lanes cptrx8 A vector with 8 `*const T` lanes f32x2 A 64-bit vector with 2 `f32` lanes. f32x4 A 128-bit vector with 4 `f32` lanes. f32x8 A 256-bit vector with 8 `f32` lanes. f32x16 A 512-bit vector with 16 `f32` lanes. f64x2 A 128-bit vector with 2 `f64` lanes. f64x4 A 256-bit vector with 4 `f64` lanes. f64x8 A 512-bit vector with 8 `f64` lanes. i8x2 A 16-bit vector with 2 `i8` lanes. i8x4 A 32-bit vector with 4 `i8` lanes. i8x8 A 64-bit vector with 8 `i8` lanes. i8x16 A 128-bit vector with 16 `i8` lanes. i8x32 A 256-bit vector with 32 `i8` lanes. i8x64 A 512-bit vector with 64 `i8` lanes. i16x2 A 32-bit vector with 2 `i16` lanes. i16x4 A 64-bit vector with 4 `i16` lanes. i16x8 A 128-bit vector with 8 `i16` lanes. i16x16 A 256-bit vector with 16 `i16` lanes. i16x32 A 512-bit vector with 32 `i16` lanes. i32x2 A 64-bit vector with 2 `i32` lanes. i32x4 A 128-bit vector with 4 `i32` lanes. i32x8 A 256-bit vector with 8 `i32` lanes. i32x16 A 512-bit vector with 16 `i32` lanes. i64x2 A 128-bit vector with 2 `i64` lanes. i64x4 A 256-bit vector with 4 `i64` lanes. i64x8 A 512-bit vector with 8 `i64` lanes. i128x1 A 128-bit vector with 1 `i128` lane. i128x2 A 256-bit vector with 2 `i128` lanes. i128x4 A 512-bit vector with 4 `i128` lanes. isizex2 A vector with 2 `isize` lanes. isizex4 A vector with 4 `isize` lanes. isizex8 A vector with 8 `isize` lanes. m8x2 A 16-bit vector mask with 2 `m8` lanes. m8x4 A 32-bit vector mask with 4 `m8` lanes. m8x8 A 64-bit vector mask with 8 `m8` lanes. m8x16 A 128-bit vector mask with 16 `m8` lanes. m8x32 A 256-bit vector mask with 32 `m8` lanes. m8x64 A 512-bit vector mask with 64 `m8` lanes. m16x2 A 32-bit vector mask with 2 `m16` lanes. m16x4 A 64-bit vector mask with 4 `m16` lanes. m16x8 A 128-bit vector mask with 8 `m16` lanes. m16x16 A 256-bit vector mask with 16 `m16` lanes. m16x32 A 512-bit vector mask with 32 `m16` lanes. m32x2 A 64-bit vector mask with 2 `m32` lanes. m32x4 A 128-bit vector mask with 4 `m32` lanes. m32x8 A 256-bit vector mask with 8 `m32` lanes. m32x16 A 512-bit vector mask with 16 `m32` lanes. m64x2 A 128-bit vector mask with 2 `m64` lanes. m64x4 A 256-bit vector mask with 4 `m64` lanes. m64x8 A 512-bit vector mask with 8 `m64` lanes. m128x1 A 128-bit vector mask with 1 `m128` lane. m128x2 A 256-bit vector mask with 2 `m128` lanes. m128x4 A 512-bit vector mask with 4 `m128` lanes. mptrx2 A vector with 2 `*mut T` lanes mptrx4 A vector with 4 `*mut T` lanes mptrx8 A vector with 8 `*mut T` lanes msizex2 A vector mask with 2 `msize` lanes. msizex4 A vector mask with 4 `msize` lanes. msizex8 A vector mask with 8 `msize` lanes. u8x2 A 16-bit vector with 2 `u8` lanes. u8x4 A 32-bit vector with 4 `u8` lanes. u8x8 A 64-bit vector with 8 `u8` lanes. u8x16 A 128-bit vector with 16 `u8` lanes. u8x32 A 256-bit vector with 32 `u8` lanes. u8x64 A 512-bit vector with 64 `u8` lanes. u16x2 A 32-bit vector with 2 `u16` lanes. u16x4 A 64-bit vector with 4 `u16` lanes. u16x8 A 128-bit vector with 8 `u16` lanes. u16x16 A 256-bit vector with 16 `u16` lanes. u16x32 A 512-bit vector with 32 `u16` lanes. u32x2 A 64-bit vector with 2 `u32` lanes. u32x4 A 128-bit vector with 4 `u32` lanes. u32x8 A 256-bit vector with 8 `u32` lanes. u32x16 A 512-bit vector with 16 `u32` lanes. u64x2 A 128-bit vector with 2 `u64` lanes. u64x4 A 256-bit vector with 4 `u64` lanes. u64x8 A 512-bit vector with 8 `u64` lanes. u128x1 A 128-bit vector with 1 `u128` lane. u128x2 A 256-bit vector with 2 `u128` lanes. u128x4 A 512-bit vector with 4 `u128` lanes. usizex2 A vector with 2 `usize` lanes. usizex4 A vector with 4 `usize` lanes. usizex8 A vector with 8 `usize` lanes.