rten_simd/
lib.rs

1//! Portable SIMD library.
2//!
3//! rten-simd is a library for defining operations that are accelerated using
4//! [SIMD](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data)
5//! instruction sets such as AVX2, Arm Neon or WebAssembly SIMD. Operations are
6//! defined once using safe, portable APIs, then _dispatched_ at runtime to
7//! evaluate the operation using the best available SIMD instruction set (ISA)
8//! on the current CPU.
9//!
10//! The design is inspired by Google's
11//! [Highway](https://github.com/google/highway) library for C++ and the
12//! [pulp](https://docs.rs/pulp/latest/pulp/) crate.
13//!
14//! ## Differences from `std::simd`
15//!
16//! In nightly Rust the standard library has a built-in portable SIMD API,
17//! `std::simd`. This library differs in several ways:
18//!
19//! 1. It is available on stable Rust
20//! 2. The instruction set is selected at runtime rather than compile time. On
21//!    x86 an operation may be compiled for AVX-512, AVX2 and generic (SSE). If
22//!    the binary is run on a system supporting AVX-512 that version will be
23//!    used. The same binary on an older system may use the generic (SSE)
24//!    version.
25//! 3. Operations use the full available SIMD vector width, which varies by
26//!    instruction set, as opposed to specifying a fixed width in the code.
27//!    For example a SIMD vector with f32 elements has 4 lanes on Arm Neon and
28//!    16 lanes under AVX-512.
29//!
30//!    The API is designed to support scalable vector ISAs such as [Arm
31//!    SVE](https://developer.arm.com/Architectures/Scalable%20Vector%20Extensions)
32//!    and RVV in future, where the vector length is known only at runtime.
33//!
34//! 4. Semantics are chosen to be "performance portable". This means that the
35//!    behavior is chosen based on what maps well to the hardware, rather than
36//!    strictly matching Rust behaviors for scalars as `std::simd` generally
37//!    does. It also means some operations may have different behaviors in edge
38//!    cases on different platforms. This is similar to [WebAssembly Relaxed
39//!    SIMD](https://github.com/WebAssembly/relaxed-simd/blob/main/proposals/relaxed-simd/Overview.md).
40//!
41//! ## Supported architectures
42//!
43//! The currently supported SIMD ISAs are:
44//!
45//! - AVX2
46//! - AVX-512
47//! - Arm Neon
48//! - WebAssembly SIMD (including relaxed SIMD)
49//!
50//! There is also a generic fallback implemented using 128-bit arrays which is
51//! designed to be autovectorization-friendly (ie. it compiles on all platforms,
52//! and should enable the compiler to use SSE or similar instructions).
53//!
54//! ## Example
55//!
56//! This code defines an operation which squares each value in a slice and
57//! evaluates it on a vector of floats:
58//!
59//! ```
60//! use rten_simd::{Isa, SimdOp};
61//! use rten_simd::ops::NumOps;
62//! use rten_simd::functional::simd_map;
63//!
64//! struct Square<'a> {
65//!     xs: &'a mut [f32],
66//! }
67//!
68//! impl<'a> SimdOp for Square<'a> {
69//!     type Output = &'a mut [f32];
70//!
71//!     #[inline(always)]
72//!     fn eval<I: Isa>(self, isa: I) -> Self::Output {
73//!         let ops = isa.f32();
74//!         simd_map(ops, self.xs, #[inline(always)] |x| ops.mul(x, x))
75//!     }
76//! }
77//!
78//! let mut buf: Vec<_> = (0..32).map(|x| x as f32).collect();
79//! let expected: Vec<_> = buf.iter().map(|x| *x * *x).collect();
80//! let squared = Square { xs: &mut buf }.dispatch();
81//! assert_eq!(squared, &expected);
82//! ```
83//!
84//! This example shows the basic steps to define a vectorized operation:
85//!
86//! 1. Create a struct containing the operation's parameters.
87//! 2. Implement the [`SimdOp`] trait for the struct to define how to evaluate
88//!    the operation.
89//! 3. Call [`SimdOp::dispatch`] to evaluate the operation using the best
90//!    available instruction set. Here "best" refers to the ISA with the
91//!    widest vectors, and thus the maximum amount of parallelism.
92//!
93//! Note the use of the `#[inline(always)]` attribute on closures and functions
94//! called within `eval`. See the section on inlining below for an explanation.
95//!
96//! ## Separation of vector types and operations
97//!
98//! SIMD vectors are effectively arrays (like `[T; N]`) with a larger alignment.
99//! A SIMD vector type can be created whether or not the associated instructions
100//! are supported on the system.
101//!
102//! Performing a SIMD operation however requires the caller to first ensure that
103//! the instructions are supported on the current system. To enforce this,
104//! operations are separated from the vector type, and types providing access to
105//! SIMD operations ([`Isa`]) can only be instantiated if the instruction set is
106//! supported.
107//!
108//! ## Overview of key traits
109//!
110//! The [`SimdOp`] trait defines an _operation_ which can be vectorized using
111//! different SIMD instruction sets. This trait has a
112//! [`dispatch`](SimdOp::dispatch) method to perform the operation.
113//!
114//! An implementation of the [`Isa`] trait is passed to [`SimdOp::eval`]. The
115//! [`Isa`] is the entry point for operations on SIMD vectors. It provides
116//! access to implementations of the [`NumOps`](ops::NumOps) trait and
117//! sub-traits for each element type. For example [`Isa::f32`] provides
118//! operations on SIMD vectors with `f32` elements.
119//!
120//! The [`NumOps`](ops::NumOps) trait provides operations that are available on
121//! all SIMD vectors. The sub-traits [`FloatOps`](ops::FloatOps) and
122//! [`IntOps`](ops::IntOps) provide operations that are only available on SIMD
123//! vectors with float and integer elements respectively. There is also
124//! [`SignedIntOps`](ops::SignedIntOps) for signed integer operations. Finally
125//! there are additional traits for operations only available for other subsets
126//! of element types. For example [`Extend`](ops::Extend) widens each lane to
127//! one with twice the bit-width.
128//!
129//! SIMD operations (eg. [`NumOps::add`](ops::NumOps::add) take SIMD vectors as
130//! arguments. These vectors are either platform-specific types (eg.
131//! `float32x4_t` on Arm) or transparent wrappers around them. The [`Simd`]
132//! trait is implemented for all vector types. The [`Elem`] trait is implemented
133//! for supported element types, providing required numeric operations.
134//!
135//! ## Use with slices
136//!
137//! SIMD operations are usually applied to a slice of elements. To support this,
138//! the [`SimdIterable`] trait provides a way to iterate over SIMD vector-sized
139//! chunks of a slice, using padding or masking to handle slice lengths that are
140//! not a multiple of the vector size.
141//!
142//! The [`functional`] module provides utilities for defining vectorized
143//! transforms on slices (eg. [`simd_map`](functional::simd_map)).
144//!
145//! The [`SliceWriter`] utility provides a way to incrementally initialize the
146//! contents of a slice with the results of SIMD operations, by writing one
147//! SIMD vector at a time.
148//!
149//! The [`SimdUnaryOp`] trait provides a convenient way to define unary
150//! operations (like [`Iterator::map`]) on slices.
151//!
152//! ## Importance of inlining
153//!
154//! In the above example `#[inline(always)]` attributes are used to ensure
155//! that the whole `eval` implementation is compiled to a single function. This
156//! is required to ensure that the platform-specific intrinsics (from
157//! [`core::arch`]) are compiled to direct instructions with no function call
158//! overhead.
159//!
160//! Failure to inline these intrinsics will significantly harm performance,
161//! since most of the runtime will be spent in function call overhead rather
162//! than actual computation. This issue affects platforms where the availability
163//! of the SIMD instruction set is not guaranteed at compile time.  This
164//! includes AVX2 and AVX-512 on x86-64, but not Arm Neon or WASM SIMD.
165//!
166//! If a vectorized operation performs more slowly than expected, use a profiler
167//! such as [samply](https://github.com/mstange/samply) to verify that the
168//! intrinsics have been inlined and thus do not appear in the list of called
169//! functions.
170//!
171//! The need for this forced inlining is expected to change in future with
172//! updates to how Rust's [`target_feature`
173//! attribute](https://github.com/rust-lang/rust/issues/69098) works.
174//!
175//! ## Generic operations
176//!
177//! It is possible to define operations which are generic over the element type
178//! by using the [`GetNumOps`](ops::GetNumOps) trait and related traits. These
179//! are implemented for supported element types and provide a way to get the
180//! [`NumOps`](ops::NumOps) implementation for that element type from an `Isa`.
181//! This can be used to define [`SimdOp`]s which are generic over the element
182//! type.
183//!
184//! This example defines an operation which can sum a slice of any supported
185//! element type:
186//!
187//! ```
188//! use std::iter::Sum;
189//! use rten_simd::{Isa, Simd, SimdIterable, SimdOp};
190//! use rten_simd::ops::{GetNumOps, NumOps};
191//!
192//! struct SimdSum<'a, T>(&'a [T]);
193//!
194//! impl<'a, T: GetNumOps + Sum> SimdOp for SimdSum<'a, T> {
195//!     type Output = T;
196//!
197//!     #[inline(always)]
198//!     fn eval<I: Isa>(self, isa: I) -> Self::Output {
199//!         let ops = T::num_ops(isa);
200//!         let partial_sums = self.0.simd_iter(ops).fold(
201//!             ops.zero(),
202//!             |sum, x| ops.add(sum, x)
203//!         );
204//!         partial_sums.to_array().into_iter().sum()
205//!     }
206//! }
207//!
208//! assert_eq!(SimdSum(&[1.0f32, 2.0, 3.0]).dispatch(), 6.0);
209//! assert_eq!(SimdSum(&[1i32, 2, 3]).dispatch(), 6);
210//! assert_eq!(SimdSum(&[1u8, 2, 3]).dispatch(), 6u8);
211//! ```
212
213mod arch;
214mod dispatch;
215mod elem;
216pub mod functional;
217pub mod isa_detection;
218mod iter;
219pub mod ops;
220mod simd;
221pub mod span;
222mod writer;
223
224/// Target-specific [`Isa`] implementations.
225///
226/// Most code using this library will not need to use these types. Instead the
227/// appropriate ISA will be constructed when using a dispatch method such as
228/// [`SimdOp::dispatch`]. These types are exported for use in downstream code
229/// which uses the portable SIMD APIs but also has ISA-specific properties.
230pub mod isa {
231    pub use super::arch::generic::GenericIsa;
232
233    #[cfg(target_arch = "aarch64")]
234    pub use super::arch::aarch64::ArmNeonIsa;
235
236    #[cfg(target_arch = "x86_64")]
237    pub use super::arch::x86_64::Avx2Isa;
238
239    #[cfg(target_arch = "x86_64")]
240    pub use super::arch::x86_64::Avx512Isa;
241
242    #[cfg(target_arch = "wasm32")]
243    #[cfg(target_feature = "simd128")]
244    pub use super::arch::wasm32::Wasm32Isa;
245}
246
247pub use dispatch::{SimdOp, SimdUnaryOp};
248pub use elem::Elem;
249pub use iter::{Iter, SimdIterable};
250pub use ops::Isa;
251pub use simd::{Mask, Simd};
252pub use writer::SliceWriter;
253
254#[cfg(target_arch = "x86_64")]
255pub use isa_detection::is_avx512_supported;
256
257#[cfg(test)]
258pub(crate) use dispatch::test_simd_op;
259
260/// Test that two [`Simd`] vectors are equal according to a [`PartialEq`]
261/// comparison of their array representations.
262#[cfg(test)]
263macro_rules! assert_simd_eq {
264    ($x:expr, $y:expr) => {
265        assert_eq!($x.to_array(), $y.to_array());
266    };
267}
268
269/// Test that two [`Simd`] vectors are not equal according to a [`PartialEq`]
270/// comparison of their array representations.
271#[cfg(test)]
272macro_rules! assert_simd_ne {
273    ($x:expr, $y:expr) => {
274        assert_ne!($x.to_array(), $y.to_array());
275    };
276}
277
278#[cfg(test)]
279pub(crate) use {assert_simd_eq, assert_simd_ne};
280
281#[cfg(test)]
282mod tests {
283    use super::functional::simd_map;
284    use super::ops::NumOps;
285    use super::{Isa, SimdOp};
286
287    #[test]
288    fn test_simd_f32_op() {
289        struct Square<'a> {
290            xs: &'a mut [f32],
291        }
292
293        impl<'a> SimdOp for Square<'a> {
294            type Output = &'a mut [f32];
295
296            fn eval<I: Isa>(self, isa: I) -> Self::Output {
297                let ops = isa.f32();
298                simd_map(ops, self.xs, |x| ops.mul(x, x))
299            }
300        }
301
302        let mut buf: Vec<_> = (0..32).map(|x| x as f32).collect();
303        let expected: Vec<_> = buf.iter().map(|x| *x * *x).collect();
304
305        let squared = Square { xs: &mut buf }.dispatch();
306
307        assert_eq!(squared, &expected);
308    }
309
310    #[test]
311    fn test_simd_i32_op() {
312        struct Square<'a> {
313            xs: &'a mut [i32],
314        }
315
316        impl<'a> SimdOp for Square<'a> {
317            type Output = &'a mut [i32];
318
319            fn eval<I: Isa>(self, isa: I) -> Self::Output {
320                let ops = isa.i32();
321                simd_map(ops, self.xs, |x| ops.mul(x, x))
322            }
323        }
324
325        let mut buf: Vec<_> = (0..32).collect();
326        let expected: Vec<_> = buf.iter().map(|x| *x * *x).collect();
327
328        let squared = Square { xs: &mut buf }.dispatch();
329
330        assert_eq!(squared, &expected);
331    }
332}