faster 0.4.3

Explicit SIMD for humans
Documentation
  • faster #+BEGIN_HTML

** SIMD for Humans Easy, powerful, portable, absurdly fast numerical calculations. Includes static dispatch with inlining based on your platform and vector types, zero-allocation iteration, vectorized loading/storing, and support for uneven collections.

It looks something like this: #+BEGIN_SRC rust let lots_of_3s = (&[-123.456f32; 128][..]).simd_iter() .simd_map(f32s(0.0), |v| { f32s(9.0) * v.abs().sqrt().rsqrt().ceil().sqrt() - f32s(4.0) - f32s(2.0) }) .scalar_collect(); #+END_SRC

Which is analogous to this scalar code: #+BEGIN_SRC rust let lots_of_3s = (&[-123.456f32; 128][..]).iter() .map(|v| { 9.0 * v.abs().sqrt().sqrt().recip().ceil().sqrt() - 4.0 - 2.0 }) .collect::<Vec>(); #+END_SRC

The vector size is entirely determined by the machine you're compiling for - it attempts to use the largest vector size supported by your machine, and works on any platform or architecture (see below for details).

Compare this to traditional explicit SIMD: #+BEGIN_SRC rust use std::mem::transmute; use stdsimd::{f32x4, f32x8};

let lots_of_3s = &mut [-123.456f32; 128][..];

if cfg!(all(not(target_feature = "avx"), target_feature = "sse")) { for ch in init.chunks_mut(4) { let v = f32x4::load(ch, 0); let scalar_abs_mask = unsafe { transmute::<u32, f32>(0x7fffffff) }; let abs_mask = f32x4::splat(scalar_abs_mask); // There isn't actually an absolute value intrinsic for floats - you // have to look at the IEEE 754 spec and do some bit flipping v = unsafe { _mm_and_ps(v, abs_mask) }; v = unsafe { _mm_sqrt_ps(v) }; v = unsafe { _mm_rsqrt_ps(v) }; v = unsafe { _mm_ceil_ps(v) }; v = unsafe { _mm_sqrt_ps(v) }; v = unsafe { _mm_mul_ps(v, 9.0) }; v = unsafe { _mm_sub_ps(v, 4.0) }; v = unsafe { _mm_sub_ps(v, 2.0) }; f32x4::store(ch, 0); } } else if cfg!(all(not(target_feature = "avx512"), target_feature = "avx")) { for ch in init.chunks_mut(8) { let v = f32x8::load(ch, 0); let scalar_abs_mask = unsafe { transmute::<u32, f32>(0x7fffffff) }; let abs_mask = f32x8::splat(scalar_abs_mask); v = unsafe { _mm256_and_ps(v, abs_mask) }; v = unsafe { _mm256_sqrt_ps(v) }; v = unsafe { _mm256_rsqrt_ps(v) }; v = unsafe { _mm256_ceil_ps(v) }; v = unsafe { _mm256_sqrt_ps(v) }; v = unsafe { _mm256_mul_ps(v, 9.0) }; v = unsafe { _mm256_sub_ps(v, 4.0) }; v = unsafe { _mm256_sub_ps(v, 2.0) }; f32x8::store(ch, 0); } } #+END_SRC Even with all of that boilerplate, this still only supports x86-64 machines with SSE or AVX - and you have to look up each intrinsic to ensure it's usable for your compilation target. ** Upcoming Features A rewrite of the iterator API is upcoming, as well as internal changes to better match the direction Rust is taking with explicit SIMD. ** Compatibility Faster currently supports any architecture with floating point support, although hardware acceleration is only enabled on machines with x86's vector extensions. ** Performance Here are some extremely unscientific benchmarks which, at least, prove that this isn't any worse than scalar iterators. Even on ancient CPUs, a lot of performance can be extracted out of SIMD. Surprisingly, using SIMD iterators performs better than scalar iterators even on the SSE-less Pentium.

However, intentionally worsening your program's locality for SIMD (as seen in tests::bench_determinmant2 and tests::bench_determinant3) is not a worthwhile tradeoff unless you are doing a significant amount of work per vector.

#+BEGIN_SRC shell $ RUSTFLAGS="-C target-cpu=ivybridge" cargo bench # host is ivybridge; target has AVX test tests::bench_determinant2_scalar ... bench: 391 ns/iter (+/- 2) test tests::bench_determinant2_simd ... bench: 375 ns/iter (+/- 1) test tests::bench_determinant3_scalar ... bench: 350 ns/iter (+/- 1) test tests::bench_determinant3_simd ... bench: 470 ns/iter (+/- 2) test tests::bench_map_scalar ... bench: 6,952 ns/iter (+/- 26) test tests::bench_map_simd ... bench: 875 ns/iter (+/- 3) test tests::bench_map_uneven_simd ... bench: 880 ns/iter (+/- 3) test tests::bench_nop_scalar ... bench: 37 ns/iter (+/- 0) test tests::bench_nop_simd ... bench: 34 ns/iter (+/- 0) test tests::bench_reduce_scalar ... bench: 6,876 ns/iter (+/- 16) test tests::bench_reduce_simd ... bench: 835 ns/iter (+/- 2) test tests::bench_reduce_uneven_simd ... bench: 836 ns/iter (+/- 2) test tests::bench_zip_nop_scalar ... bench: 624 ns/iter (+/- 2) test tests::bench_zip_nop_simd ... bench: 361 ns/iter (+/- 1) test tests::bench_zip_scalar ... bench: 862 ns/iter (+/- 4) test tests::bench_zip_simd ... bench: 771 ns/iter (+/- 2)

RUSTFLAGS="-C target-cpu=x86-64" cargo bench # host is ivybridge; target has SSE2 test tests::bench_determinant2_scalar ... bench: 426 ns/iter (+/- 2) test tests::bench_determinant2_simd ... bench: 376 ns/iter (+/- 2) test tests::bench_determinant3_scalar ... bench: 355 ns/iter (+/- 2) test tests::bench_determinant3_simd ... bench: 486 ns/iter (+/- 3) test tests::bench_map_scalar ... bench: 7,157 ns/iter (+/- 59) test tests::bench_map_simd ... bench: 1,886 ns/iter (+/- 10) test tests::bench_map_uneven_simd ... bench: 1,889 ns/iter (+/- 11) test tests::bench_nop_scalar ... bench: 38 ns/iter (+/- 0) test tests::bench_nop_simd ... bench: 34 ns/iter (+/- 0) test tests::bench_reduce_scalar ... bench: 7,002 ns/iter (+/- 29) test tests::bench_reduce_simd ... bench: 1,865 ns/iter (+/- 10) test tests::bench_reduce_uneven_simd ... bench: 1,937 ns/iter (+/- 7) test tests::bench_zip_nop_scalar ... bench: 623 ns/iter (+/- 1) test tests::bench_zip_nop_simd ... bench: 333 ns/iter (+/- 3) test tests::bench_zip_scalar ... bench: 971 ns/iter (+/- 5) test tests::bench_zip_simd ... bench: 525 ns/iter (+/- 3)

$ RUSTFLAGS="-C target-cpu=pentium" cargo bench # host is ivybridge; this only runs the polyfills! test tests::bench_determinant2_scalar ... bench: 427 ns/iter (+/- 2) test tests::bench_determinant2_simd ... bench: 402 ns/iter (+/- 1) test tests::bench_determinant3_scalar ... bench: 354 ns/iter (+/- 1) test tests::bench_determinant3_simd ... bench: 593 ns/iter (+/- 1) test tests::bench_map_scalar ... bench: 7,195 ns/iter (+/- 28) test tests::bench_map_simd ... bench: 6,271 ns/iter (+/- 22) test tests::bench_map_uneven_simd ... bench: 6,288 ns/iter (+/- 22) test tests::bench_nop_scalar ... bench: 38 ns/iter (+/- 0) test tests::bench_nop_simd ... bench: 69 ns/iter (+/- 0) test tests::bench_reduce_scalar ... bench: 7,004 ns/iter (+/- 17) test tests::bench_reduce_simd ... bench: 6,063 ns/iter (+/- 17) test tests::bench_reduce_uneven_simd ... bench: 6,107 ns/iter (+/- 11) test tests::bench_zip_nop_scalar ... bench: 623 ns/iter (+/- 2) test tests::bench_zip_nop_simd ... bench: 289 ns/iter (+/- 1) test tests::bench_zip_scalar ... bench: 972 ns/iter (+/- 3) test tests::bench_zip_simd ... bench: 621 ns/iter (+/- 3) #+END_SRC