Crate packed_simd
source ·Expand description
Portable packed SIMD vectors
This crate is proposed for stabilization as std::packed_simd
in RFC2366:
std::simd
.
The examples available in the
examples/
subdirectory of the crate showcase how to use the library in practice.
Table of contents
Introduction
This crate exports Simd<[T; N]>
: a packed vector of N
elements of type T
as well as many type aliases for this type: for
example, f32x4
, which is just an alias for Simd<[f32; 4]>
.
The operations on packed vectors are, by default, “vertical”, that is, they are applied to each vector lane in isolation of the others:
let a = i32x4::new(1, 2, 3, 4);
let b = i32x4::new(5, 6, 7, 8);
assert_eq!(a + b, i32x4::new(6, 8, 10, 12));
Many “horizontal” operations are also provided:
assert_eq!(a.wrapping_sum(), 10);
In virtually all architectures vertical operations are fast, while horizontal operations are, by comparison, much slower. That is, the most portablyefficient way of performing a reduction over a slice is to collect the results into a vector using vertical operations, and performing a single horizontal operation at the end:
fn reduce(x: &[i32]) > i32 {
assert_eq!(x.len() % 4, 0);
let mut sum = i32x4::splat(0); // [0, 0, 0, 0]
for i in (0..x.len()).step_by(4) {
sum += i32x4::from_slice_unaligned(&x[i..]);
}
sum.wrapping_sum()
}
let x = [0, 1, 2, 3, 4, 5, 6, 7];
assert_eq!(reduce(&x), 28);
Vector types
The vector type aliases are named according to the following scheme:
{element_type}x{number_of_lanes} == Simd<[element_type; number_of_lanes]>
where the following element types are supported:
i{element_width}
: signed integeru{element_width}
: unsigned integerf{element_width}
: floatm{element_width}
: mask (see below)*{const,mut} T
:const
andmut
pointers
Basic operations
// Sets all elements to `0`:
let a = i32x4::splat(0);
// Reads a vector from a slice:
let mut arr = [0, 0, 0, 1, 2, 3, 4, 5];
let b = i32x4::from_slice_unaligned(&arr);
// Reads the 4th element of a vector:
assert_eq!(b.extract(3), 1);
// Returns a new vector where the 4th element is replaced with `1`:
let a = a.replace(3, 1);
assert_eq!(a, b);
// Writes a vector to a slice:
let a = a.replace(2, 1);
a.write_to_slice_unaligned(&mut arr[4..]);
assert_eq!(arr, [0, 0, 0, 1, 0, 0, 1, 1]);
Conditional operations
One often needs to perform an operation on some lanes of the vector. Vector
masks, like m32x4
, allow selecting on which vector lanes an operation is
to be performed:
let a = i32x4::new(1, 1, 2, 2);
// Add `1` to the first two lanes of the vector.
let m = m16x4::new(true, true, false, false);
let a = m.select(a + 1, a);
assert_eq!(a, i32x4::splat(2));
The elements of a vector mask are either true
or false
. Here true
means that a lane is “selected”, while false
means that a lane is not
selected.
All vector masks implement a mask.select(a: T, b: T) > T
method that
works on all vectors that have the same number of lanes as the mask. The
resulting vector contains the elements of a
for those lanes for which the
mask is true
, and the elements of b
otherwise.
The example constructs a mask with the first two lanes set to true
and
the last two lanes set to false
. This selects the first two lanes of a + 1
and the last two lanes of a
, producing a vector where the first two
lanes have been incremented by 1
.
note: mask
select
can be used on vector types that have the same number of lanes as the mask. The example shows this by usingm16x4
instead ofm32x4
. It is typically more performant to use a mask element width equal to the element width of the vectors being operated upon. This is, however, not true for 512bit wide vectors when targeting AVX512, where the most efficient masks use only 1bit per element.
All vertical comparison operations returns masks:
let a = i32x4::new(1, 1, 3, 3);
let b = i32x4::new(2, 2, 0, 0);
// ge: >= (Greater Eequal; see also lt, le, gt, eq, ne).
let m = a.ge(i32x4::splat(2));
if m.any() {
// all / any / none allow coherent control flow
let d = m.select(a, b);
assert_eq!(d, i32x4::new(2, 2, 3, 3));
}
Conversions

lossless widening conversions:
From
/Into
are implemented for vectors with the same number of lanes when the conversion is value preserving (same as instd
). 
safe bitwise conversions: The cargo feature
into_bits
provides theIntoBits/FromBits
traits (x.into_bits()
). These perform safe bitwisetransmute
s when all bit patterns of the source type are valid bit patterns of the target type and are also implemented for the architecturespecific vector types ofstd::arch
. For example,let x: u8x8 = m8x8::splat(true).into_bits();
is provided because allm8x8
bit patterns are validu8x8
bit patterns. However, the opposite is not true, not allu8x8
bit patterns are validm8x8
bitpatterns, so this operation cannot be performed safely usingx.into_bits()
; one needs to useunsafe { crate::mem::transmute(x) }
for that, making sure that the value in theu8x8
is a valid bitpattern ofm8x8
. 
numeric casts (
as
): are performed usingFromCast
/Cast
(x.cast()
), just likeas
:
casting integer vectors whose lane types have the same size (e.g.
i32xN
>u32xN
) is a noop, 
casting from a larger integer to a smaller integer (e.g.
u32xN
>u8xN
) will truncate, 
casting from a smaller integer to a larger integer (e.g.
u8xN
>u32xN
) will: zeroextend if the source is unsigned, or
 signextend if the source is signed,

casting from a float to an integer will round the float towards zero,

casting from an integer to float will produce the floating point representation of the integer, rounding to nearest, ties to even,

casting from an
f32
to anf64
is perfect and lossless, 
casting from an
f64
to anf32
rounds to nearest, ties to even.
Numeric casts are not very “precise”: sometimes lossy, sometimes value preserving, etc.

Hardware Features
This crate can use different hardware features based on your configured
RUSTFLAGS
. For example, with no configured RUSTFLAGS
, u64x8
on
x86_64 will use SSE2 operations like PCMPEQD
. If you configure
RUSTFLAGS='C targetfeature=+avx2,+avx'
on supported x86_64 hardware
the same u64x8
may use wider AVX2 operations like VPCMPEQQ
. It is
important for performance and for hardware support requirements that
you choose an appropriate set of targetfeature
and targetcpu
options during builds. For more information, see the Performance
guide
Reexports
pub use crate::sealed::Shuffle;
Macros
 Shuffles vector elements.
Structs
 Wrapper over
T
implementing a lexicoraphical order via thePartialOrd
and/orOrd
traits.  Packed SIMD vector type.
 8bit wide mask.
 16bit wide mask.
 32bit wide mask.
 64bit wide mask.
 128bit wide mask.
 isizewide mask.
Traits
 Numeric cast from
Self
toT
.  FromBits
into_bits
Safe lossless bitwise conversion fromT
toSelf
.  Numeric cast from
T
toSelf
.  IntoBits
into_bits
Safe lossless bitwise conversion fromSelf
toT
.  This trait is implemented by all mask types
 Trait implemented by arrays that can be SIMD types.
 This trait is implemented by all SIMD vector types.
Type Definitions
 A vector with 2
*const T
lanes  A vector with 4
*const T
lanes  A vector with 8
*const T
lanes  A 64bit vector with 2
f32
lanes.  A 128bit vector with 4
f32
lanes.  A 256bit vector with 8
f32
lanes.  A 512bit vector with 16
f32
lanes.  A 128bit vector with 2
f64
lanes.  A 256bit vector with 4
f64
lanes.  A 512bit vector with 8
f64
lanes.  A 16bit vector with 2
i8
lanes.  A 32bit vector with 4
i8
lanes.  A 64bit vector with 8
i8
lanes.  A 128bit vector with 16
i8
lanes.  A 256bit vector with 32
i8
lanes.  A 512bit vector with 64
i8
lanes.  A 32bit vector with 2
i16
lanes.  A 64bit vector with 4
i16
lanes.  A 128bit vector with 8
i16
lanes.  A 256bit vector with 16
i16
lanes.  A 512bit vector with 32
i16
lanes.  A 64bit vector with 2
i32
lanes.  A 128bit vector with 4
i32
lanes.  A 256bit vector with 8
i32
lanes.  A 512bit vector with 16
i32
lanes.  A 128bit vector with 2
i64
lanes.  A 256bit vector with 4
i64
lanes.  A 512bit vector with 8
i64
lanes.  A 128bit vector with 1
i128
lane.  A 256bit vector with 2
i128
lanes.  A 512bit vector with 4
i128
lanes.  A vector with 2
isize
lanes.  A vector with 4
isize
lanes.  A vector with 8
isize
lanes.  A 16bit vector mask with 2
m8
lanes.  A 32bit vector mask with 4
m8
lanes.  A 64bit vector mask with 8
m8
lanes.  A 128bit vector mask with 16
m8
lanes.  A 256bit vector mask with 32
m8
lanes.  A 512bit vector mask with 64
m8
lanes.  A 32bit vector mask with 2
m16
lanes.  A 64bit vector mask with 4
m16
lanes.  A 128bit vector mask with 8
m16
lanes.  A 256bit vector mask with 16
m16
lanes.  A 512bit vector mask with 32
m16
lanes.  A 64bit vector mask with 2
m32
lanes.  A 128bit vector mask with 4
m32
lanes.  A 256bit vector mask with 8
m32
lanes.  A 512bit vector mask with 16
m32
lanes.  A 128bit vector mask with 2
m64
lanes.  A 256bit vector mask with 4
m64
lanes.  A 512bit vector mask with 8
m64
lanes.  A 128bit vector mask with 1
m128
lane.  A 256bit vector mask with 2
m128
lanes.  A 512bit vector mask with 4
m128
lanes.  A vector with 2
*mut T
lanes  A vector with 4
*mut T
lanes  A vector with 8
*mut T
lanes  A vector mask with 2
msize
lanes.  A vector mask with 4
msize
lanes.  A vector mask with 8
msize
lanes.  A 16bit vector with 2
u8
lanes.  A 32bit vector with 4
u8
lanes.  A 64bit vector with 8
u8
lanes.  A 128bit vector with 16
u8
lanes.  A 256bit vector with 32
u8
lanes.  A 512bit vector with 64
u8
lanes.  A 32bit vector with 2
u16
lanes.  A 64bit vector with 4
u16
lanes.  A 128bit vector with 8
u16
lanes.  A 256bit vector with 16
u16
lanes.  A 512bit vector with 32
u16
lanes.  A 64bit vector with 2
u32
lanes.  A 128bit vector with 4
u32
lanes.  A 256bit vector with 8
u32
lanes.  A 512bit vector with 16
u32
lanes.  A 128bit vector with 2
u64
lanes.  A 256bit vector with 4
u64
lanes.  A 512bit vector with 8
u64
lanes.  A 128bit vector with 1
u128
lane.  A 256bit vector with 2
u128
lanes.  A 512bit vector with 4
u128
lanes.  A vector with 2
usize
lanes.  A vector with 4
usize
lanes.  A vector with 8
usize
lanes.