[−][src]Crate safe_arch

A crate that safely exposes arch intrinsics via #[cfg()].

safe_arch lets you safely use CPU intrinsics. Those things in the core::arch modules. It works purely via #[cfg()] and compile time CPU feature declaration. If you want to check for a feature at runtime and then call an intrinsic or use a fallback path based on that then this crate is sadly not for you.

SIMD register types are "newtype'd" so that better trait impls can be given to them, but the inner value is a pub field so feel free to just grab it out if you need to. Trait impls of the newtypes include: Default (zeroed), From/Into of appropriate data types, and appropriate operator overloading.

Most intrinsics (like addition and multiplication) are totally safe to use as long as the CPU feature is available. In this case, what you get is 1:1 with the actual intrinsic.
Some intrinsics take a pointer of an assumed minimum alignment and validity span. For these, the safe_arch function takes a reference of an appropriate type to uphold safety.
- Try the bytemuck crate (and turn on the bytemuck feature of this crate) if you want help safely casting between reference types.
Some intrinsics are not safe unless you're very careful about how you use them, such as the streaming operations requiring you to use them in combination with an appropriate memory fence. Those operations aren't exposed here.
Some intrinsics mess with the processor state, such as changing the floating point flags, saving and loading special register state, and so on. LLVM doesn't really support you messing with that within a high level language, so those operations aren't exposed here. Use assembly or something if you want to do that.

Naming Conventions

The actual names for each intrinsic are generally a flaming dumpster of letters that only make sense after you've learned all the names. They're very bad for learning what things do. Accordingly, safe_arch uses very verbose naming that (hopefully) improves the new-user experience.

Function names start with the primary "verb" of the operation, and then any adverbs go after that. This makes for slightly awkward English but helps the list of all the functions sort a little better.
- Eg: add_i32_m128i and add_i16_saturating_m128i
Function names end with the register type they're most associated with. I say "most" because while most operations only work with a single register type at a time there are occasional operations that use more than one register type.
- Eg: and_m128 (for m128) and and_m128d (for m128d)
If a function operates on just the lowest data lane it generally has _s after the register type, because it's a "scalar" operation. The higher lanes are generally just copied forward, or taken from a secondary argument, or something. Details vary.
- Eg: sqrt_m128 (all lanes) and sqrt_m128_s (low lane only)

Of course, people can't even always agree on what words mean. The common verb names for this crate, and their conventions, are as follows:

load: Reads memory into a register (deref &Foo to Foo).
store: Writes a register to memory (writes Foo to a &mut Foo).
set: Packs values into a register (works like [1, 2, 3, 4] to build an array).
splat: Modifies either a "load" or set". The input is copied as many times as possible across the bits of the output register size (works like [1_i32; LEN] array building).
extract: Get an individual lane out of a SIMD register (works like reg[i]). The lane to get has to be a const value.
insert: Duplicate a register and then replace the value of a specific lane (works like let mut reg2 = reg.copied(); reg[i] = new;). The lane to overwrite has to be a const value.
cast: change data types while preserving the bit pattern (like how transmute would do it).
convert: change data types while trying to stick close to the numeric value (which might change the bits, like how as would do it).

This crate is pre-1.0 and if you feel that an operation should have a better name to improve the crate's consistency please file an issue.

Current Support

x86 / x86_64 (Intel, AMD, etc)
- 128-bit: sse, sse2, sse3, ssse3, sse4.1, sse4.2
- 256-bit: avx, avx2
- Other: adx, aes, bmi1, bmi2, fma, lzcnt, pclmulqdq, popcnt, rdrand, rdseed

Compile Time CPU Target Features

At the time of me writing this, Rust enables the sse and sse2 CPU features by default for all i686 (x86) and x86_64 builds. Those CPU features are built into the design of x86_64, and you'd need a super old x86 CPU for it to not support at least sse and sse2, so they're a safe bet for the language to enable all the time. In fact, because the standard library is compiled with them enabled, simply trying to disable those features would actually cause ABI issues and fill your program with UB (link).

If you want additional CPU features available at compile time you'll have to enable them with an additional arg to rustc. For a feature named name you pass -C target-feature=+name, such as -C target-feature=+sse3 for sse3.

You can alternately enable all target features of the current CPU with -C target-cpu=native. This is primarily of use if you're building a program you'll only run on your own system.

It's sometimes hard to know if your target platform will support a given feature set, but the Steam Hardware Survey is generally taken as a guide to what you can expect people to have available. If you click "Other Settings" it'll expand into a list of CPU target features and how common they are. These days, it seems that sse3 can be safely assumed, and ssse3, sse4.1, and sse4.2 are pretty safe bets as well. The stuff above 128-bit isn't as common yet, give it another few years.

Please note that executing a program on a CPU that doesn't support the target features it was compiles for is Undefined Behavior.

Currently, Rust doesn't actually support an easy way for you to check that a feature enabled at compile time is actually available at runtime. There is the "feature_detected" family of macros, but if you enable a feature they will evaluate to a constant true instead of actually deferring the check for the feature to runtime. This means that, if you did want a check at the start of your program, to confirm that all the assumed features are present and error out when the assumptions don't hold, you can't use that macro. You gotta use CPUID and check manually. rip. Hopefully we can make that process easier in a future version of this crate.

A Note On Working With Cfg

There's two main ways to use cfg:

Via an attribute placed on an item, block, or expression:
- #[cfg(debug_assertions)] println!("hello");
Via a macro used within an expression position:
- if cfg!(debug_assertions) { println!("hello"); }

The difference might seem small but it's actually very important:

The attribute form will include code or not before deciding if all the items named and so forth really exist or not. This means that code that is configured via attribute can safely name things that don't always exist as long as the things they name do exist whenever that code is configured into the build.
The macro form will include the configured code no matter what, and then the macro resolves to a constant true or false and the compiler uses dead code elimination to cut out the path not taken.

This crate uses cfg via the attribute, so the functions it exposes don't exist at all when the appropriate CPU target features aren't enabled. Accordingly, if you plan to call this crate or not depending on what features are enabled in the build you'll also need to control your use of this crate via cfg attribute, not cfg macro.

Macros

aes_key_gen_assist_m128i	`aes` ?
blend_i32_m128i	`avx2` Blends the `i32` lanes in `$a` and `$b` into a single value.
blend_imm_i16_m128i	Blends the `i16` lanes according to the immediate mask.
blend_imm_i16_m256i	`avx2` Blends the `i16` lanes according to the immediate value.
blend_imm_i32_m256i	`avx2` Blends the `i32` lanes according to the immediate value.
blend_imm_m128d	Blends the lanes according to the immediate mask.
blend_imm_m128	Blends the lanes according to the immediate mask.
blend_imm_m256d	`avx` Blends the `f64` lanes according to the immediate mask.
blend_imm_m256	`avx` Blends the `f32` lanes according to the immediate mask.
byte_shl_u128_imm_m128i	Shifts all bits in the entire register left by a number of bytes.
byte_shl_u128_imm_m256i	`avx2` Shifts each `u128` lane left by a number of bytes.
byte_shr_u128_imm_m128i	Shifts all bits in the entire register right by a number of bytes.
byte_shr_u128_imm_m256i	`avx2` Shifts each `u128` lane right by a number of bytes.
cmp_op_mask_m128	`avx` Compare `f32` lanes according to the operation specified, mask output.
cmp_op_mask_m128_s	`avx` Compare `f32` lanes according to the operation specified, mask output.
cmp_op_mask_m128d	`avx` Compare `f64` lanes according to the operation specified, mask output.
cmp_op_mask_m128d_s	`avx` Compare `f64` lanes according to the operation specified, mask output.
cmp_op_mask_m256	`avx` Compare `f32` lanes according to the operation specified, mask output.
cmp_op_mask_m256d	`avx` Compare `f64` lanes according to the operation specified, mask output.
combined_byte_shr_imm_m128i	Counts `$a` as the high bytes and `$b` as the low bytes then performs a byte shift to the right by the immediate value.
combined_byte_shr_imm_m256i	Works like `combined_byte_shr_imm_m128i`, but twice as wide.
comparison_operator_translation	`avx` Turns a comparison operator token to the correct constant value.
dot_product_m128d	Performs a dot product of two `m128d` registers.
dot_product_m128	Performs a dot product of two `m128` registers.
dot_product_m256	`avx` This works like `dot_product_m128`, but twice as wide.
extract_f32_as_i32_bits_imm_m128	Gets the `f32` lane requested. Returns as an `i32` bit pattern.
extract_i16_as_i32_m128i	Gets an `i16` value out of an `m128i`, returns as `i32`.
extract_i16_as_i32_m256i	`avx2` Gets an `i16` value out of an `m256i`, returns as `i32`.
extract_i32_from_m256i	`avx` Extracts an `i32` lane from `m256i`
extract_i32_imm_m128i	Gets the `i32` lane requested. Only the lowest 2 bits are considered.
extract_i64_from_m256i	`avx` Extracts an `i64` lane from `m256i`
extract_i64_imm_m128i	Gets the `i64` lane requested. Only the lowest bit is considered.
extract_i8_as_i32_imm_m128i	Gets the `i8` lane requested. Only the lowest 4 bits are considered.
extract_i8_as_i32_m256i	`avx2` Gets an `i8` value out of an `m256i`, returns as `i32`.
extract_m128_from_m256	`avx` Extracts an `m128` from `m256`
extract_m128d_from_m256d	`avx` Extracts an `m128d` from `m256d`
extract_m128i_from_m256i	`avx` Extracts an `m128i` from `m256i`
extract_m128i_m256i	`avx2` Gets an `m128i` value out of an `m256i`.
insert_f32_imm_m128	Inserts a lane from `$b` into `$a`, optionally at a new position.
insert_i16_from_i32_m128i	Inserts the low 16 bits of an `i32` value into an `m128i`.
insert_i16_to_m256i	`avx` Inserts an `i16` to `m256i`
insert_i32_imm_m128i	Inserts a new value for the `i32` lane specified.
insert_i32_to_m256i	`avx` Inserts an `i32` to `m256i`
insert_i64_imm_m128i	Inserts a new value for the `i64` lane specified.
insert_i64_to_m256i	`avx` Inserts an `i64` to `m256i`
insert_i8_imm_m128i	Inserts a new value for the `i64` lane specified.
insert_i8_to_m256i	`avx` Inserts an `i8` to `m256i`
insert_m128_to_m256	`avx` Inserts an `m128` to `m256`
insert_m128d_to_m256d	`avx` Inserts an `m128d` to `m256d`
insert_m128i_to_m256i_slow_avx	`avx` Slowly inserts an `m128i` to `m256i`.
insert_m128i_to_m256i	`avx` Inserts an `m128i` to an `m256i` at the high or low position.
mul_i64_carryless_m128i	`pclmulqdq` Performs a "carryless" multiplication of two `i64` values.
multi_packed_sum_abs_diff_u8_m128i	Computes eight `u16` "sum of absolute difference" values according to the bytes selected.
multi_packed_sum_abs_diff_u8_m256i	`avx2` Computes eight `u16` "sum of absolute difference" values according to the bytes selected.
permute_2x128_m256i	`avx2` Permutes the lanes around.
permute_f128_in_m256d	`avx` Permutes the lanes around.
permute_f128_in_m256	`avx` Permutes the lanes around.
permute_i128_in_m256i	`avx` Permutes the lanes around.
permute_i64_m256i	`avx2` Permutes the lanes around.
permute_m128d	`avx` Permutes the lanes around.
permute_m128	`avx` Permutes the lanes around.
permute_m256	`avx` Permutes the lanes around.
permute_m256d	`avx2` Permutes the lanes around.
permute_within_m128d_m256d	`avx` Permutes the lanes around.
round_m128d	Rounds each lane in the style specified.
round_m128d_s	Rounds `$b` low as specified, keeps `$a` high.
round_m128	Rounds each lane in the style specified.
round_m128_s	Rounds `$b` low as specified, other lanes use `$a`.
round_m256d	`avx` Rounds each lane in the style specified.
round_m256	`avx` Rounds each lane in the style specified.
shl_i16_imm_m128i	Shifts all `i16` lanes left by an immediate.
shl_i16_imm_m256i	`avx2` Shifts all `i16` lanes left by an immediate.
shl_i32_imm_m128i	Shifts all `i32` lanes left by an immediate.
shl_i32_imm_m256i	`avx2` Shifts all `i32` lanes left by an immediate.
shl_i64_imm_m128i	Shifts both `i64` lanes left by an immediate.
shl_i64_imm_m256i	`avx2` Shifts all `i64` lanes left by an immediate.
shr_i16_imm_m128i	Shifts all `i16` lanes right by an immediate.
shr_i16_imm_m256i	`avx2` Shifts all `i16` lanes left by an immediate.
shr_i32_imm_m128i	Shifts all `i32` lanes right by an immediate.
shr_i32_imm_m256i	`avx2` Shifts all `i32` lanes left by an immediate.
shr_u16_imm_m128i	Shifts all `u16` lanes right by an immediate.
shr_u16_imm_m256i	`avx2` Shifts all `u16` lanes right by an immediate.
shr_u32_imm_m128i	Shifts all `u32` lanes right by an immediate.
shr_u32_imm_m256i	`avx2` Shifts all `u32` lanes right by an immediate.
shr_u64_imm_m128i	Shifts both `u64` lanes right by an immediate.
shr_u64_imm_m256i	`avx2` Shifts all `u64` lanes right by an immediate.
shuffle_i16_high_lanes_m128i	Shuffles the higher `i16` lanes, low lanes unaffected.
shuffle_i16_high_m256i	`avx2` Shuffles the upper `i16` lanes from each 128 bit region.
shuffle_i16_low_lanes_m128i	Shuffles the lower `i16` lanes, high lanes unaffected.
shuffle_i16_low_m256i	`avx2` Shuffles the lower `i16` lanes from each 128 bit region.
shuffle_i32_m128i	Shuffles the `i32` lanes around.
shuffle_i32_m256i	`avx2` Shuffles the lanes around.
shuffle_m128	Shuffles the lanes around.
shuffle_m128d	Shuffles the lanes around.
shuffle_m256d	`avx` Shuffles the `f64` lanes around.
shuffle_m256	`avx` Shuffles the `f32` lanes around.
string_search_for_index	`sse4.2` Looks for `$needle` in `$haystack` and gives the index of the either the first or last match.
string_search_for_mask	`sse4.2` Looks for `$needle` in `$haystack` and gives the mask of where the matches were.

Structs

m128	The data for a 128-bit SSE register of four `f32` lanes.
m128d	The data for a 128-bit SSE register of two `f64` values.
m128i	The data for a 128-bit SSE register of integer data.
m256	The data for a 256-bit AVX register of eight `f32` lanes.
m256d	The data for a 256-bit AVX register of four `f64` values.
m256i	The data for a 256-bit AVX register of integer data.

Enums

Permute_2x128_m256i

Selects the output style of a permute_2x128_m256i usage.

Functions

abs_i16_m128i	`ssse3` Lanewise absolute value with lanes as `i16`.
abs_i16_m256i	`avx2` Absolute value of `i16` lanes.
abs_i32_m128i	`ssse3` Lanewise absolute value with lanes as `i32`.
abs_i32_m256i	`avx2` Absolute value of `i32` lanes.
abs_i8_m128i	`ssse3` Lanewise absolute value with lanes as `i8`.
abs_i8_m256i	`avx2` Absolute value of `i8` lanes.
add_carry_u32	`adx` Add two `u32` with a carry value.
add_carry_u64	`adx` Add two `u64` with a carry value.
add_horizontal_i16_m128i	`ssse3` Add horizontal pairs of `i16` values, pack the outputs as `a` then `b`.
add_horizontal_i16_m256i	`avx2` Horizontal `a + b` with lanes as `i16`.
add_horizontal_i32_m128i	`ssse3` Add horizontal pairs of `i32` values, pack the outputs as `a` then `b`.
add_horizontal_i32_m256i	`avx2` Horizontal `a + b` with lanes as `i32`.
add_horizontal_m128d	`sse3` Add each lane horizontally, pack the outputs as `a` then `b`.
add_horizontal_m128	`sse3` Add each lane horizontally, pack the outputs as `a` then `b`.
add_horizontal_m256d	`avx` Add adjacent `f64` lanes.
add_horizontal_m256	`avx` Add adjacent `f32` lanes.
add_horizontal_saturating_i16_m128i	`ssse3` Add horizontal pairs of `i16` values, saturating, pack the outputs as `a` then `b`.
add_horizontal_saturating_i16_m256i	`avx2` Horizontal saturating `a + b` with lanes as `i16`.
add_i16_m128i	`sse2` Lanewise `a + b` with lanes as `i16`.
add_i16_m256i	`avx2` Lanewise `a + b` with lanes as `i16`.
add_i32_m128i	`sse2` Lanewise `a + b` with lanes as `i32`.
add_i32_m256i	`avx2` Lanewise `a + b` with lanes as `i32`.
add_i64_m128i	`sse2` Lanewise `a + b` with lanes as `i64`.
add_i64_m256i	`avx2` Lanewise `a + b` with lanes as `i64`.
add_i8_m128i	`sse2` Lanewise `a + b` with lanes as `i8`.
add_i8_m256i	`avx2` Lanewise `a + b` with lanes as `i8`.
add_m128	`sse` Lanewise `a + b`.
add_m128_s	`sse` Low lane `a + b`, other lanes unchanged.
add_m128d	`sse2` Lanewise `a + b`.
add_m128d_s	`sse2` Lowest lane `a + b`, high lane unchanged.
add_m256d	`avx` Lanewise `a + b` with `f64` lanes.
add_m256	`avx` Lanewise `a + b` with `f32` lanes.
add_saturating_i16_m128i	`sse2` Lanewise saturating `a + b` with lanes as `i16`.
add_saturating_i16_m256i	`avx2` Lanewise saturating `a + b` with lanes as `i16`.
add_saturating_i8_m128i	`sse2` Lanewise saturating `a + b` with lanes as `i8`.
add_saturating_i8_m256i	`avx2` Lanewise saturating `a + b` with lanes as `i8`.
add_saturating_u16_m128i	`sse2` Lanewise saturating `a + b` with lanes as `u16`.
add_saturating_u16_m256i	`avx2` Lanewise saturating `a + b` with lanes as `u16`.
add_saturating_u8_m128i	`sse2` Lanewise saturating `a + b` with lanes as `u8`.
add_saturating_u8_m256i	`avx2` Lanewise saturating `a + b` with lanes as `u8`.
add_sub_m128d	`sse3` Add the high lane and subtract the low lane.
add_sub_m128	`sse3` Alternately, from the top, add a lane and then subtract a lane.
add_sub_m256d	`avx` Alternately, from the top, add `f64` then sub `f64`.
add_sub_m256	`avx` Alternately, from the top, add `f32` then sub `f32`.
aes_decrypt_last_m128i	`aes` Perform the last round of AES decryption flow on `a` using the `round_key`.
aes_decrypt_m128i	`aes` Perform one round of AES decryption flow on `a` using the `round_key`.
aes_encrypt_last_m128i	`aes` Perform the last round of AES encryption flow on `a` using the `round_key`.
aes_encrypt_m128i	`aes` Perform one round of AES encryption flow on `a` using the `round_key`.
aes_inv_mix_columns_m128i	`aes` Perform the InvMixColumns transform on `a`.
and_m128	`sse` Bitwise `a & b`.
and_m128d	`sse2` Bitwise `a & b`.
and_m128i	`sse2` Bitwise `a & b`.
and_m256d	`avx` Bitwise `a & b`.
and_m256	`avx` Bitwise `a & b`.
and_m256i	`avx2` Bitwise `a & b`.
andnot_m128	`sse` Bitwise `(!a) & b`.
andnot_m128d	`sse2` Bitwise `(!a) & b`.
andnot_m128i	`sse2` Bitwise `(!a) & b`.
andnot_m256d	`avx` Bitwise `(!a) & b`.
andnot_m256	`avx` Bitwise `(!a) & b`.
andnot_m256i	`avx2` Bitwise `(!a) & b`.
andnot_u32	`bmi1` Bitwise `(!a) & b`, `u32`
andnot_u64	`bmi1` Bitwise `(!a) & b`, `u64`
average_u16_m128i	`sse2` Lanewise average of the `u16` values.
average_u16_m256i	`avx2` Average `u16` lanes.
average_u8_m128i	`sse2` Lanewise average of the `u8` values.
average_u8_m256i	`avx2` Average `u8` lanes.
bit_extract2_u32	`bmi1` Extract a span of bits from the `u32`, control value style.
bit_extract2_u64	`bmi1` Extract a span of bits from the `u64`, control value style.
bit_extract_u32	`bmi1` Extract a span of bits from the `u32`, start and len style.
bit_extract_u64	`bmi1` Extract a span of bits from the `u64`, start and len style.
bit_lowest_set_mask_u32	`bmi1` Gets the mask of all bits up to and including the lowest set bit in a `u32`.
bit_lowest_set_mask_u64	`bmi1` Gets the mask of all bits up to and including the lowest set bit in a `u64`.
bit_lowest_set_reset_u32	`bmi1` Resets (clears) the lowest set bit.
bit_lowest_set_reset_u64	`bmi1` Resets (clears) the lowest set bit.
bit_lowest_set_value_u32	`bmi1` Gets the value of the lowest set bit in a `u32`.
bit_lowest_set_value_u64	`bmi1` Gets the value of the lowest set bit in a `u64`.
bit_zero_high_index_u32	`bmi2` Zero out all high bits in a `u32` starting at the index given.
bit_zero_high_index_u64	`bmi2` Zero out all high bits in a `u64` starting at the index given.
blend_varying_i8_m128i	`sse4.1` Blend the `i8` lanes according to a runtime varying mask.
blend_varying_i8_m256i	`avx2` Blend `i8` lanes according to a runtime varying mask.
blend_varying_m128d	`sse4.1` Blend the lanes according to a runtime varying mask.
blend_varying_m128	`sse4.1` Blend the lanes according to a runtime varying mask.
blend_varying_m256d	`avx` Blend the lanes according to a runtime varying mask.
blend_varying_m256	`avx` Blend the lanes according to a runtime varying mask.
byte_swap_i32	Swap the bytes of the given 32-bit value.
byte_swap_i64	Swap the bytes of the given 64-bit value.
cast_from_m256_to_m256d	`avx` Bit-preserving cast from `m256` to `m256i`.
cast_from_m256_to_m256i	`avx` Bit-preserving cast from `m256` to `m256i`.
cast_from_m256d_to_m256	`avx` Bit-preserving cast from `m256d` to `m256`.
cast_from_m256d_to_m256i	`avx` Bit-preserving cast from `m256d` to `m256i`.
cast_from_m256i_to_m256d	`avx` Bit-preserving cast from `m256i` to `m256d`.
cast_from_m256i_to_m256	`avx` Bit-preserving cast from `m256i` to `m256`.
cast_to_m128_from_m128d	`sse2` Bit-preserving cast to `m128` from `m128d`
cast_to_m128_from_m128i	`sse2` Bit-preserving cast to `m128` from `m128i`
cast_to_m128d_from_m128	`sse2` Bit-preserving cast to `m128d` from `m128`
cast_to_m128d_from_m128i	`sse2` Bit-preserving cast to `m128d` from `m128i`
cast_to_m128i_from_m128d	`sse2` Bit-preserving cast to `m128i` from `m128d`
cast_to_m128i_from_m128	`sse2` Bit-preserving cast to `m128i` from `m128`
ceil_m128d	`sse4.1` Round each lane to a whole number, towards positive infinity
ceil_m128	`sse4.1` Round each lane to a whole number, towards positive infinity
ceil_m128d_s	`sse4.1` Round the low lane of `b` toward positive infinity, high lane is `a`.
ceil_m128_s	`sse4.1` Round the low lane of `b` toward positive infinity, other lanes `a`.
ceil_m256d	`avx` Round `f64` lanes towards positive infinity.
ceil_m256	`avx` Round `f32` lanes towards positive infinity.
cmp_eq_i32_m128_s	`sse` Low lane equality.
cmp_eq_i32_m128d_s	`sse2` Low lane `f64` equal to.
cmp_eq_mask_i16_m128i	`sse2` Lanewise `a == b` with lanes as `i16`.
cmp_eq_mask_i16_m256i	`avx2` Compare `i16` lanes for equality, mask output.
cmp_eq_mask_i32_m128i	`sse2` Lanewise `a == b` with lanes as `i32`.
cmp_eq_mask_i32_m256i	`avx2` Compare `i32` lanes for equality, mask output.
cmp_eq_mask_i64_m128i	`sse4.1` Lanewise `a == b` with lanes as `i64`.
cmp_eq_mask_i64_m256i	`avx2` Compare `i64` lanes for equality, mask output.
cmp_eq_mask_i8_m128i	`sse2` Lanewise `a == b` with lanes as `i8`.
cmp_eq_mask_i8_m256i	`avx2` Compare `i8` lanes for equality, mask output.
cmp_eq_mask_m128	`sse` Lanewise `a == b`.
cmp_eq_mask_m128_s	`sse` Low lane `a == b`, other lanes unchanged.
cmp_eq_mask_m128d	`sse2` Lanewise `a == b`, mask output.
cmp_eq_mask_m128d_s	`sse2` Low lane `a == b`, other lanes unchanged.
cmp_ge_i32_m128_s	`sse` Low lane greater than or equal to.
cmp_ge_i32_m128d_s	`sse2` Low lane `f64` greater than or equal to.
cmp_ge_mask_m128	`sse` Lanewise `a >= b`.
cmp_ge_mask_m128_s	`sse` Low lane `a >= b`, other lanes unchanged.
cmp_ge_mask_m128d	`sse2` Lanewise `a >= b`.
cmp_ge_mask_m128d_s	`sse2` Low lane `a >= b`, other lanes unchanged.
cmp_gt_i32_m128_s	`sse` Low lane greater than.
cmp_gt_i32_m128d_s	`sse2` Low lane `f64` greater than.
cmp_gt_mask_i16_m128i	`sse2` Lanewise `a > b` with lanes as `i16`.
cmp_gt_mask_i16_m256i	`avx2` Compare `i16` lanes for `a > b`, mask output.
cmp_gt_mask_i32_m128i	`sse2` Lanewise `a > b` with lanes as `i32`.
cmp_gt_mask_i32_m256i	`avx2` Compare `i32` lanes for `a > b`, mask output.
cmp_gt_mask_i64_m128i	`sse4.2` Lanewise `a > b` with lanes as `i64`.
cmp_gt_mask_i64_m256i	`avx2` Compare `i64` lanes for `a > b`, mask output.
cmp_gt_mask_i8_m128i	`sse2` Lanewise `a > b` with lanes as `i8`.
cmp_gt_mask_i8_m256i	`avx2` Compare `i8` lanes for `a > b`, mask output.
cmp_gt_mask_m128	`sse` Lanewise `a > b`.
cmp_gt_mask_m128_s	`sse` Low lane `a > b`, other lanes unchanged.
cmp_gt_mask_m128d	`sse2` Lanewise `a > b`.
cmp_gt_mask_m128d_s	`sse2` Low lane `a > b`, other lanes unchanged.
cmp_le_i32_m128_s	`sse` Low lane less than or equal to.
cmp_le_i32_m128d_s	`sse2` Low lane `f64` less than or equal to.
cmp_le_mask_m128	`sse` Lanewise `a <= b`.
cmp_le_mask_m128_s	`sse` Low lane `a <= b`, other lanes unchanged.
cmp_le_mask_m128d	`sse2` Lanewise `a <= b`.
cmp_le_mask_m128d_s	`sse2` Low lane `a <= b`, other lanes unchanged.
cmp_lt_i32_m128_s	`sse` Low lane less than.
cmp_lt_i32_m128d_s	`sse2` Low lane `f64` less than.
cmp_lt_mask_i16_m128i	`sse2` Lanewise `a < b` with lanes as `i16`.
cmp_lt_mask_i32_m128i	`sse2` Lanewise `a < b` with lanes as `i32`.
cmp_lt_mask_i8_m128i	`sse2` Lanewise `a < b` with lanes as `i8`.
cmp_lt_mask_m128	`sse` Lanewise `a < b`.
cmp_lt_mask_m128_s	`sse` Low lane `a < b`, other lanes unchanged.
cmp_lt_mask_m128d	`sse2` Lanewise `a < b`.
cmp_lt_mask_m128d_s	`sse2` Low lane `a < b`, other lane unchanged.
cmp_neq_i32_m128_s	`sse` Low lane not equal to.
cmp_neq_i32_m128d_s	`sse2` Low lane `f64` less than.
cmp_neq_mask_m128	`sse` Lanewise `a != b`.
cmp_neq_mask_m128_s	`sse` Low lane `a != b`, other lanes unchanged.
cmp_neq_mask_m128d	`sse2` Lanewise `a != b`.
cmp_neq_mask_m128d_s	`sse2` Low lane `a != b`, other lane unchanged.
cmp_nge_mask_m128	`sse` Lanewise `!(a >= b)`.
cmp_nge_mask_m128_s	`sse` Low lane `!(a >= b)`, other lanes unchanged.
cmp_nge_mask_m128d	`sse2` Lanewise `!(a >= b)`.
cmp_nge_mask_m128d_s	`sse2` Low lane `!(a >= b)`, other lane unchanged.
cmp_ngt_mask_m128	`sse` Lanewise `!(a > b)`.
cmp_ngt_mask_m128_s	`sse` Low lane `!(a > b)`, other lanes unchanged.
cmp_ngt_mask_m128d	`sse2` Lanewise `!(a > b)`.
cmp_ngt_mask_m128d_s	`sse2` Low lane `!(a > b)`, other lane unchanged.
cmp_nle_mask_m128	`sse` Lanewise `!(a <= b)`.
cmp_nle_mask_m128_s	`sse` Low lane `!(a <= b)`, other lanes unchanged.
cmp_nle_mask_m128d	`sse2` Lanewise `!(a <= b)`.
cmp_nle_mask_m128d_s	`sse2` Low lane `!(a <= b)`, other lane unchanged.
cmp_nlt_mask_m128	`sse` Lanewise `!(a < b)`.
cmp_nlt_mask_m128_s	`sse` Low lane `!(a < b)`, other lanes unchanged.
cmp_nlt_mask_m128d	`sse2` Lanewise `!(a < b)`.
cmp_nlt_mask_m128d_s	`sse2` Low lane `!(a < b)`, other lane unchanged.
cmp_ordinary_mask_m128	`sse` Lanewise `(!a.is_nan()) & (!b.is_nan())`.
cmp_ordinary_mask_m128_s	`sse` Low lane `(!a.is_nan()) & (!b.is_nan())`, other lanes unchanged.
cmp_ordinary_mask_m128d	`sse2` Lanewise `(!a.is_nan()) & (!b.is_nan())`.
cmp_ordinary_mask_m128d_s	`sse2` Low lane `(!a.is_nan()) & (!b.is_nan())`, other lane unchanged.
cmp_unord_mask_m128	`sse` Lanewise `a.is_nan() \| b.is_nan()`.
cmp_unord_mask_m128_s	`sse` Low lane `a.is_nan() \| b.is_nan()`, other lanes unchanged.
cmp_unord_mask_m128d	`sse2` Lanewise `a.is_nan() \| b.is_nan()`.
cmp_unord_mask_m128d_s	`sse2` Low lane `a.is_nan() \| b.is_nan()`, other lane unchanged.
convert_i16_lower2_to_i64_m128i	`sse4.1` Convert the lower two `i64` lanes to two `i32` lanes.
convert_i16_lower4_to_i32_m128i	`sse4.1` Convert the lower four `i16` lanes to four `i32` lanes.
convert_i16_m128i_lower4_m256i	`avx2` Sign extend `i16` values to `i64` values.
convert_i16_m128i_m256i	`avx2` Sign extend `i16` values to `i32` values.
convert_i32_lower2_to_i64_m128i	`sse4.1` Convert the lower two `i32` lanes to two `i64` lanes.
convert_i32_m128i_m256i	`avx2` Sign extend `i32` values to `i64` values.
convert_i32_replace_m128_s	`sse` Convert `i32` to `f32` and replace the low lane of the input.
convert_i32_replace_m128d_s	`sse2` Convert `i32` to `f64` and replace the low lane of the input.
convert_i64_replace_m128d_s	`sse2` Convert `i64` to `f64` and replace the low lane of the input.
convert_i8_lower2_to_i64_m128i	`sse4.1` Convert the lower two `i8` lanes to two `i64` lanes.
convert_i8_lower4_to_i32_m128i	`sse4.1` Convert the lower four `i8` lanes to four `i32` lanes.
convert_i8_lower8_to_i16_m128i	`sse4.1` Convert the lower eight `i8` lanes to eight `i16` lanes.
convert_i8_m128i_lower4_m256i	`avx2` Sign extend the lower 4 `i8` values to `i64` values.
convert_i8_m128i_lower8_m256i	`avx2` Sign extend the lower 8 `i8` values to `i32` values.
convert_i8_m128i_m256i	`avx2` Sign extend `i8` values to `i16` values.
convert_m128_s_replace_m128d_s	`sse2` Converts the lower `f32` to `f64` and replace the low lane of the input
convert_m128d_s_replace_m128_s	`sse2` Converts the low `f64` to `f32` and replaces the low lane of the input.
convert_to_f32_from_m256_s	`avx` Convert the lowest `f64` lane to a single `f64`.
convert_to_f64_from_m256d_s	`avx` Convert the lowest `f64` lane to a single `f64`.
convert_to_i32_from_m256i_s	`avx` Convert the lowest `f64` lane to a single `f64`.
convert_to_i32_m128i_from_m256d	`avx` Convert `f64` lanes to `i32` lanes.
convert_to_i32_m256i_from_m256	`avx` Convert `f32` lanes to `i32` lanes.
convert_to_m128_from_m128i	`sse2` Rounds the four `i32` lanes to four `f32` lanes.
convert_to_m128_from_m128d	`sse2` Rounds the two `f64` lanes to the low two `f32` lanes.
convert_to_m128_from_m256d	`avx` Convert `f64` lanes to be `f32` lanes.
convert_to_m128d_from_m128i	`sse2` Rounds the lower two `i32` lanes to two `f64` lanes.
convert_to_m128d_from_m128	`sse2` Rounds the two `f64` lanes to the low two `f32` lanes.
convert_to_m128i_from_m128d	`sse2` Rounds the two `f64` lanes to the low two `i32` lanes.
convert_to_m128i_from_m128	`sse2` Rounds the two `f64` lanes to the low two `i32` lanes.
convert_to_m128i_from_m256d	`avx` Convert `f64` lanes to be `i32` lanes.
convert_to_m256_from_i32_m256i	`avx` Convert `i32` lanes to be `f32` lanes.
convert_to_m256d_from_i32_m128i	`avx` Convert `i32` lanes to be `f64` lanes.
convert_to_m256d_from_m128	`avx` Convert `f32` lanes to be `f64` lanes.
convert_to_m256i_from_m256	`avx` Convert `f32` lanes to be `i32` lanes.
convert_u16_lower2_to_u64_m128i	`sse4.1` Convert the lower two `u16` lanes to two `u64` lanes.
convert_u16_lower4_to_u32_m128i	`sse4.1` Convert the lower four `u16` lanes to four `u32` lanes.
convert_u16_m128i_lower4_m256i	`avx2` Zero extend lower 4 `u16` values to `i64` values.
convert_u16_m128i_m256i	`avx2` Zero extend `u16` values to `i32` values.
convert_u32_lower2_to_u64_m128i	`sse4.1` Convert the lower two `u32` lanes to two `u64` lanes.
convert_u32_m128i_m256i	`avx2` Zero extend `u32` values to `i64` values.
convert_u8_lower2_to_u64_m128i	`sse4.1` Convert the lower two `u8` lanes to two `u64` lanes.
convert_u8_lower4_to_u32_m128i	`sse4.1` Convert the lower four `u8` lanes to four `u32` lanes.
convert_u8_lower8_to_u16_m128i	`sse4.1` Convert the lower eight `u8` lanes to eight `u16` lanes.
convert_u8_m128i_lower4_m256i	`avx2` Zero extend lower 4 `u8` values to `i16` values.
convert_u8_m128i_lower8_m256i	`avx2` Zero extend lower 8 `u8` values to `i16` values.
convert_u8_m128i_m256i	`avx2` Zero extend `u8` values to `i16` values.
copy_i64_m128i_s	`sse2` Copy the low `i64` lane to a new register, upper bits 0.
copy_replace_low_f64_m128d	`sse2` Copies the `a` value and replaces the low lane with the low `b` value.
crc32_u8	`sse4.2` Accumulates the `u8` into a running CRC32 value.
crc32_u16	`sse4.2` Accumulates the `u16` into a running CRC32 value.
crc32_u32	`sse4.2` Accumulates the `u32` into a running CRC32 value.
crc32_u64	`sse4.2` Accumulates the `u64` into a running CRC32 value.
div_m128	`sse` Lanewise `a / b`.
div_m128_s	`sse` Low lane `a / b`, other lanes unchanged.
div_m128d	`sse2` Lanewise `a / b`.
div_m128d_s	`sse2` Lowest lane `a / b`, high lane unchanged.
div_m256d	`avx` Lanewise `a / b` with `f64`.
div_m256	`avx` Lanewise `a / b` with `f32`.
duplicate_even_lanes_m128	`sse3` Duplicate the odd lanes to the even lanes.
duplicate_even_lanes_m256	`avx` Duplicate the even-indexed lanes to the odd lanes.
duplicate_low_lane_m128d_s	`sse3` Copy the low lane of the input to both lanes of the output.
duplicate_odd_lanes_m128	`sse3` Duplicate the odd lanes to the even lanes.
duplicate_odd_lanes_m256d	`avx` Duplicate the odd-indexed lanes to the even lanes.
duplicate_odd_lanes_m256	`avx` Duplicate the odd-indexed lanes to the even lanes.
floor_m128d	`sse4.1` Round each lane to a whole number, towards negative infinity
floor_m128	`sse4.1` Round each lane to a whole number, towards negative infinity
floor_m128d_s	`sse4.1` Round the low lane of `b` toward negative infinity, high lane is `a`.
floor_m128_s	`sse4.1` Round the low lane of `b` toward negative infinity, other lanes `a`.
floor_m256d	`avx` Round `f64` lanes towards negative infinity.
floor_m256	`avx` Round `f32` lanes towards negative infinity.
fused_mul_add_m128	`fma` Lanewise fused `(a * b) + c`
fused_mul_add_m128_s	`fma` Low lane fused `(a * b) + c`, other lanes unchanged
fused_mul_add_m128d	`fma` Lanewise fused `(a * b) + c`
fused_mul_add_m128d_s	`fma` Low lane fused `(a * b) + c`, other lanes unchanged
fused_mul_add_m256	`fma` Lanewise fused `(a * b) + c`
fused_mul_add_m256d	`fma` Lanewise fused `(a * b) + c`
fused_mul_addsub_m128	`fma` Lanewise fused `(a * b) addsub c` (adds odd lanes and subtracts even lanes)
fused_mul_addsub_m128d	`fma` Lanewise fused `(a * b) addsub c` (adds odd lanes and subtracts even lanes)
fused_mul_addsub_m256	`fma` Lanewise fused `(a * b) addsub c` (adds odd lanes and subtracts even lanes)
fused_mul_addsub_m256d	`fma` Lanewise fused `(a * b) addsub c` (adds odd lanes and subtracts even lanes)
fused_mul_neg_add_m128	`fma` Lanewise fused `-(a * b) + c`
fused_mul_neg_add_m128_s	`fma` Low lane `-(a * b) + c`, other lanes unchanged.
fused_mul_neg_add_m128d	`fma` Lanewise fused `-(a * b) + c`
fused_mul_neg_add_m128d_s	`fma` Low lane `-(a * b) + c`, other lanes unchanged.
fused_mul_neg_add_m256	`fma` Lanewise fused `-(a * b) + c`
fused_mul_neg_add_m256d	`fma` Lanewise fused `-(a * b) + c`
fused_mul_neg_sub_m128	`fma` Lanewise fused `-(a * b) - c`
fused_mul_neg_sub_m128_s	`fma` Low lane fused `-(a * b) - c`, other lanes unchanged.
fused_mul_neg_sub_m128d	`fma` Lanewise fused `-(a * b) - c`
fused_mul_neg_sub_m128d_s	`fma` Low lane fused `-(a * b) - c`, other lanes unchanged.
fused_mul_neg_sub_m256	`fma` Lanewise fused `-(a * b) - c`
fused_mul_neg_sub_m256d	`fma` Lanewise fused `-(a * b) - c`
fused_mul_sub_m128	`fma` Lanewise fused `(a * b) - c`
fused_mul_sub_m128_s	`fma` Low lane fused `(a * b) - c`, other lanes unchanged.
fused_mul_sub_m128d	`fma` Lanewise fused `(a * b) - c`
fused_mul_sub_m128d_s	`fma` Low lane fused `(a * b) - c`, other lanes unchanged.
fused_mul_sub_m256	`fma` Lanewise fused `(a * b) - c`
fused_mul_sub_m256d	`fma` Lanewise fused `(a * b) - c`
fused_mul_subadd_m128	`fma` Lanewise fused `(a * b) subadd c` (subtracts odd lanes and adds even lanes)
fused_mul_subadd_m128d	`fma` Lanewise fused `(a * b) subadd c` (subtracts odd lanes and adds even lanes)
fused_mul_subadd_m256	`fma` Lanewise fused `(a * b) subadd c` (subtracts odd lanes and adds even lanes)
fused_mul_subadd_m256d	`fma` Lanewise fused `(a * b) subadd c` (subtracts odd lanes and adds even lanes)
get_f32_from_m128_s	`sse` Gets the low lane as an individual `f32` value.
get_f64_from_m128d_s	`sse2` Gets the lower lane as an `f64` value.
get_i32_from_m128_s	`sse` Converts the low lane to `i32` and extracts as an individual value.
get_i32_from_m128d_s	`sse2` Converts the lower lane to an `i32` value.
get_i32_from_m128i_s	`sse2` Converts the lower lane to an `i32` value.
get_i64_from_m128d_s	`sse2` Converts the lower lane to an `i64` value.
get_i64_from_m128i_s	`sse2` Converts the lower lane to an `i64` value.
leading_zero_count_u32	`lzcnt` Count the leading zeroes in a `u32`.
leading_zero_count_u64	`lzcnt` Count the leading zeroes in a `u64`.
load_f32_m128_s	`sse` Loads the `f32` reference into the low lane of the register.
load_f32_splat_m128	`sse` Loads the `f32` reference into all lanes of a register.
load_f32_splat_m256	`avx` Load an `f32` and splat it to all lanes of an `m256d`
load_f64_m128d_s	`sse2` Loads the reference into the low lane of the register.
load_f64_splat_m128d	`sse2` Loads the `f64` reference into all lanes of a register.
load_f64_splat_m256d	`avx` Load an `f64` and splat it to all lanes of an `m256d`
load_i64_m128i_s	`sse2` Loads the low `i64` into a register.
load_m128	`sse` Loads the reference into a register.
load_m128d	`sse2` Loads the reference into a register.
load_m128i	`sse2` Loads the reference into a register.
load_m256d	`avx` Load data from memory into a register.
load_m256	`avx` Load data from memory into a register.
load_m256i	`avx` Load data from memory into a register.
load_m128_splat_m256	`avx` Load an `m128` and splat it to the lower and upper half of an `m256`
load_m128d_splat_m256d	`avx` Load an `m128d` and splat it to the lower and upper half of an `m256d`
load_masked_i32_m128i	`avx2` Loads the reference given and zeroes any `i32` lanes not in the mask.
load_masked_i32_m256i	`avx2` Loads the reference given and zeroes any `i32` lanes not in the mask.
load_masked_i64_m128i	`avx2` Loads the reference given and zeroes any `i64` lanes not in the mask.
load_masked_i64_m256i	`avx2` Loads the reference given and zeroes any `i64` lanes not in the mask.
load_masked_m128d	`avx` Load data from memory into a register according to a mask.
load_masked_m128	`avx` Load data from memory into a register according to a mask.
load_masked_m256d	`avx` Load data from memory into a register according to a mask.
load_masked_m256	`avx` Load data from memory into a register according to a mask.
load_replace_high_m128d	`sse2` Loads the reference into a register, replacing the high lane.
load_replace_low_m128d	`sse2` Loads the reference into a register, replacing the low lane.
load_reverse_m128	`sse` Loads the reference into a register with reversed order.
load_reverse_m128d	`sse2` Loads the reference into a register with reversed order.
load_unaligned_hi_lo_m256d	`avx` Load data from memory into a register.
load_unaligned_hi_lo_m256	`avx` Load data from memory into a register.
load_unaligned_hi_lo_m256i	`avx` Load data from memory into a register.
load_unaligned_m128	`sse` Loads the reference into a register.
load_unaligned_m128d	`sse2` Loads the reference into a register.
load_unaligned_m128i	`sse2` Loads the reference into a register.
load_unaligned_m256d	`avx` Load data from memory into a register.
load_unaligned_m256	`avx` Load data from memory into a register.
load_unaligned_m256i	`avx` Load data from memory into a register.
max_i16_m128i	`sse2` Lanewise `max(a, b)` with lanes as `i16`.
max_i16_m256i	`avx2` Lanewise `max(a, b)` with lanes as `i16`.
max_i32_m128i	`sse4.1` Lanewise `max(a, b)` with lanes as `i32`.
max_i32_m256i	`avx2` Lanewise `max(a, b)` with lanes as `i32`.
max_i8_m128i	`sse4.1` Lanewise `max(a, b)` with lanes as `i8`.
max_i8_m256i	`avx2` Lanewise `max(a, b)` with lanes as `i8`.
max_m128	`sse` Lanewise `max(a, b)`.
max_m128_s	`sse` Low lane `max(a, b)`, other lanes unchanged.
max_m128d	`sse2` Lanewise `max(a, b)`.
max_m128d_s	`sse2` Low lane `max(a, b)`, other lanes unchanged.
max_m256d	`avx` Lanewise `max(a, b)`.
max_m256	`avx` Lanewise `max(a, b)`.
max_u16_m128i	`sse4.1` Lanewise `max(a, b)` with lanes as `u16`.
max_u16_m256i	`avx2` Lanewise `max(a, b)` with lanes as `u16`.
max_u32_m128i	`sse4.1` Lanewise `max(a, b)` with lanes as `u32`.
max_u32_m256i	`avx2` Lanewise `max(a, b)` with lanes as `u32`.
max_u8_m128i	`sse2` Lanewise `max(a, b)` with lanes as `u8`.
max_u8_m256i	`avx2` Lanewise `max(a, b)` with lanes as `u8`.
min_i16_m128i	`sse2` Lanewise `min(a, b)` with lanes as `i16`.
min_i16_m256i	`avx2` Lanewise `min(a, b)` with lanes as `i16`.
min_i32_m128i	`sse4.1` Lanewise `min(a, b)` with lanes as `i32`.
min_i32_m256i	`avx2` Lanewise `min(a, b)` with lanes as `i32`.
min_i8_m128i	`sse4.1` Lanewise `min(a, b)` with lanes as `i8`.
min_i8_m256i	`avx2` Lanewise `min(a, b)` with lanes as `i8`.
min_m128	`sse` Lanewise `min(a, b)`.
min_m128_s	`sse` Low lane `min(a, b)`, other lanes unchanged.
min_m128d	`sse2` Lanewise `min(a, b)`.
min_m128d_s	`sse2` Low lane `min(a, b)`, other lanes unchanged.
min_m256d	`avx` Lanewise `min(a, b)`.
min_m256	`avx` Lanewise `min(a, b)`.
min_position_u16_m128i	`sse4.1` Min `u16` value, position, and other lanes zeroed.
min_u16_m128i	`sse4.1` Lanewise `min(a, b)` with lanes as `u16`.
min_u16_m256i	`avx2` Lanewise `min(a, b)` with lanes as `u16`.
min_u32_m128i	`sse4.1` Lanewise `min(a, b)` with lanes as `u32`.
min_u32_m256i	`avx2` Lanewise `min(a, b)` with lanes as `u32`.
min_u8_m128i	`sse2` Lanewise `min(a, b)` with lanes as `u8`.
min_u8_m256i	`avx2` Lanewise `min(a, b)` with lanes as `u8`.
move_high_low_m128	`sse` Move the high lanes of `b` to the low lanes of `a`, other lanes unchanged.
move_low_high_m128	`sse` Move the low lanes of `b` to the high lanes of `a`, other lanes unchanged.
move_m128_s	`sse` Move the low lane of `b` to `a`, other lanes unchanged.
move_mask_i8_m128i	`sse2` Gathers the `i8` sign bit of each lane.
move_mask_m128	`sse` Gathers the sign bit of each lane.
move_mask_m128d	`sse2` Gathers the sign bit of each lane.
move_mask_m256d	`avx` Collects the sign bit of each lane into a 4-bit value.
move_mask_m256	`avx` Collects the sign bit of each lane into a 4-bit value.
move_mask_m256i	`avx2` Create an `i32` mask of each sign bit in the `i8` lanes.
mul_extended_u32	`bmi2` Multiply two `u32`, outputting the low bits and storing the high bits in the reference.
mul_extended_u64	`bmi2` Multiply two `u64`, outputting the low bits and storing the high bits in the reference.
mul_i16_horizontal_add_m128i	`sse2` Multiply `i16` lanes producing `i32` values, horizontal add pairs of `i32` values to produce the final output.
mul_i16_horizontal_add_m256i	`avx2` Multiply `i16` lanes producing `i32` values, horizontal add pairs of `i32` values to produce the final output.
mul_i16_keep_high_m128i	`sse2` Lanewise `a * b` with lanes as `i16`, keep the high bits of the `i32` intermediates.
mul_i16_keep_high_m256i	`avx2` Multiply the `i16` lanes and keep the high half of each 32-bit output.
mul_i16_keep_low_m128i	`sse2` Lanewise `a * b` with lanes as `i16`, keep the low bits of the `i32` intermediates.
mul_i16_keep_low_m256i	`avx2` Multiply the `i16` lanes and keep the low half of each 32-bit output.
mul_i16_scale_round_m128i	`ssse3` Multiply `i16` lanes into `i32` intermediates, keep the high 18 bits, round by adding 1, right shift by 1.
mul_i16_scale_round_m256i	`avx2` Multiply `i16` lanes into `i32` intermediates, keep the high 18 bits, round by adding 1, right shift by 1.
mul_i32_keep_low_m128i	`sse4.1` Lanewise `a * b` with lanes as `i32`, keep the low bits of the `i64` intermediates.
mul_i32_keep_low_m256i	`avx2` Multiply the `i32` lanes and keep the low half of each 64-bit output.
mul_i64_low_bits_m256i	`avx2` Multiply the lower `i32` within each `i64` lane, `i64` output.
mul_i64_widen_low_bits_m128i	`sse4.1` Multiplies the lower 32 bits (only) of each `i64` lane into 64-bit `i64` values.
mul_m128	`sse` Lanewise `a * b`.
mul_m128_s	`sse` Low lane `a * b`, other lanes unchanged.
mul_m128d	`sse2` Lanewise `a * b`.
mul_m128d_s	`sse2` Lowest lane `a * b`, high lane unchanged.
mul_m256d	`avx` Lanewise `a * b` with `f64` lanes.
mul_m256	`avx` Lanewise `a * b` with `f32` lanes.
mul_u16_keep_high_m128i	`sse2` Lanewise `a * b` with lanes as `u16`, keep the high bits of the `u32` intermediates.
mul_u16_keep_high_m256i	`avx2` Multiply the `u16` lanes and keep the high half of each 32-bit output.
mul_u64_low_bits_m256i	`avx2` Multiply the lower `u32` within each `u64` lane, `u64` output.
mul_u64_widen_low_bits_m128i	`sse2` Multiplies the lower 32 bits (only) of each `u64` lane into 64-bit `u64` values.
mul_u8i8_add_horizontal_saturating_m128i	`ssse3` This is dumb and weird.
mul_u8i8_add_horizontal_saturating_m256i	`avx2` This is dumb and weird.
or_m128	`sse` Bitwise `a \| b`.
or_m128d	`sse2` Bitwise `a \| b`.
or_m128i	`sse2` Bitwise `a \| b`.
or_m256d	`avx` Bitwise `a \| b`.
or_m256	`avx` Bitwise `a \| b`.
or_m256i	`avx2` Bitwise `a \| b`
pack_i16_to_i8_m128i	`sse2` Saturating convert `i16` to `i8`, and pack the values.
pack_i16_to_i8_m256i	`avx2` Saturating convert `i16` to `i8`, and pack the values.
pack_i16_to_u8_m128i	`sse2` Saturating convert `i16` to `u8`, and pack the values.
pack_i16_to_u8_m256i	`avx2` Saturating convert `i16` to `u8`, and pack the values.
pack_i32_to_i16_m128i	`sse2` Saturating convert `i32` to `i16`, and pack the values.
pack_i32_to_i16_m256i	`avx2` Saturating convert `i32` to `i16`, and pack the values.
pack_i32_to_u16_m128i	`sse4.1` Saturating convert `i32` to `u16`, and pack the values.
pack_i32_to_u16_m256i	`avx2` Saturating convert `i32` to `u16`, and pack the values.
permute_i32_m256i	`avx2` Permutes the 32-bit integer lanes.
permute_m256	`avx2` Permutes the `f32` lanes.
permute_varying_m128d	`avx` Permute with a runtime varying pattern.
permute_varying_m128	`avx` Permute with a runtime varying pattern.
permute_varying_m256d	`avx` Permute with a runtime varying pattern.
permute_varying_m256	`avx` Permute with a runtime varying pattern.
population_count_i32	`popcnt` Count the number of bits set within an `i32`
population_count_i64	`popcnt` Count the number of bits set within an `i64`
population_deposit_u32	`bmi2` Deposit contiguous low bits from a `u32` according to a mask.
population_deposit_u64	`bmi2` Deposit contiguous low bits from a `u64` according to a mask.
population_extract_u32	`bmi2` Extract bits from a `u32` according to a mask.
population_extract_u64	`bmi2` Extract bits from a `u64` according to a mask.
rdrand_u16	`rdrand` Try to obtain a random `u16` from the hardware RNG.
rdrand_u32	`rdrand` Try to obtain a random `u32` from the hardware RNG.
rdrand_u64	`rdrand` Try to obtain a random `u64` from the hardware RNG.
rdseed_u16	`rdseed` Try to obtain a random `u16` from the hardware RNG.
rdseed_u32	`rdseed` Try to obtain a random `u32` from the hardware RNG.
rdseed_u64	`rdseed` Try to obtain a random `u64` from the hardware RNG.
read_timestamp_counter	Reads the CPU's timestamp counter value.
read_timestamp_counter_p	Reads the CPU's timestamp counter value and store the processor signature.
reciprocal_m128	`sse` Lanewise `1.0 / a` approximation.
reciprocal_m128_s	`sse` Low lane `1.0 / a` approximation, other lanes unchanged.
reciprocal_m256	`avx` Reciprocal of `f32` lanes.
reciprocal_sqrt_m128	`sse` Lanewise `1.0 / sqrt(a)` approximation.
reciprocal_sqrt_m128_s	`sse` Low lane `1.0 / sqrt(a)` approximation, other lanes unchanged.
reciprocal_sqrt_m256	`avx` Reciprocal of `f32` lanes.
set_i16_m128i	`sse2` Sets the args into an `m128i`, first arg is the high lane.
set_i16_m256i	`avx` Set `i16` args into an `m256i` lane.
set_i32_m128i_s	`sse2` Set an `i32` as the low 32-bit lane of an `m128i`, other lanes blank.
set_i32_m128i	`sse2` Sets the args into an `m128i`, first arg is the high lane.
set_i32_m256i	`avx` Set `i32` args into an `m256i` lane.
set_i64_m128i_s	`sse2` Set an `i64` as the low 64-bit lane of an `m128i`, other lanes blank.
set_i64_m128i	`sse2` Sets the args into an `m128i`, first arg is the high lane.
set_i8_m128i	`sse2` Sets the args into an `m128i`, first arg is the high lane.
set_i8_m256i	`avx` Set `i8` args into an `m256i` lane.
set_m128	`sse` Sets the args into an `m128`, first arg is the high lane.
set_m128_s	`sse` Sets the args into an `m128`, first arg is the high lane.
set_m128d	`sse2` Sets the args into an `m128d`, first arg is the high lane.
set_m128d_s	`sse2` Sets the args into the low lane of a `m128d`.
set_m256d	`avx` Set `f64` args into an `m256d` lane.
set_m256	`avx` Set `f32` args into an `m256` lane.
set_m128d_m256d	`avx` Set `m128d` args into an `m256d`.
set_m128i_m256i	`avx` Set `m128i` args into an `m256i`.
set_reversed_i16_m128i	`sse2` Sets the args into an `m128i`, first arg is the low lane.
set_reversed_i16_m256i	`avx` Set `i16` args into an `m256i` lane.
set_reversed_i32_m128i	`sse2` Sets the args into an `m128i`, first arg is the low lane.
set_reversed_i32_m256i	`avx` Set `i32` args into an `m256i` lane.
set_reversed_i8_m128i	`sse2` Sets the args into an `m128i`, first arg is the low lane.
set_reversed_i8_m256i	`avx` Set `i8` args into an `m256i` lane.
set_reversed_m128	`sse` Sets the args into an `m128`, first arg is the low lane.
set_reversed_m128d	`sse2` Sets the args into an `m128d`, first arg is the low lane.
set_reversed_m256d	`avx` Set `f64` args into an `m256d` lane.
set_reversed_m256	`avx` Set `f32` args into an `m256` lane.
set_reversed_m128d_m256d	`avx` Set `m128d` args into an `m256d`.
set_reversed_m128i_m256i	`avx` Set `m128i` args into an `m256i`.
set_splat_i16_m128i	`sse2` Splats the `i16` to all lanes of the `m128i`.
set_splat_i16_m256i	`avx` Splat an `i16` arg into an `m256i` lane.
set_splat_i16_m128i_s_m256i	`avx2` Sets the lowest `i16` lane of an `m128i` as all lanes of an `m256i`.
set_splat_i32_m128i	`sse2` Splats the `i32` to all lanes of the `m128i`.
set_splat_i32_m256i	`avx` Splat an `i32` arg into an `m256i` lane.
set_splat_i32_m128i_s_m256i	`avx2` Sets the lowest `i32` lane of an `m128i` as all lanes of an `m256i`.
set_splat_i64_m128i	`sse2` Splats the `i64` to both lanes of the `m128i`.
set_splat_i64_m128i_s_m256i	`avx2` Sets the lowest `i64` lane of an `m128i` as all lanes of an `m256i`.
set_splat_i8_m128i	`sse2` Splats the `i8` to all lanes of the `m128i`.
set_splat_i8_m256i	`avx` Splat an `i8` arg into an `m256i` lane.
set_splat_i8_m128i_s_m256i	`avx2` Sets the lowest `i8` lane of an `m128i` as all lanes of an `m256i`.
set_splat_m128	`sse` Splats the value to all lanes.
set_splat_m128d	`sse2` Splats the args into both lanes of the `m128d`.
set_splat_m256d	`avx` Splat an `f64` arg into an `m256d` lane.
set_splat_m256	`avx` Splat an `f32` arg into an `m256` lane.
set_splat_m128_s_m256	`avx2` Sets the lowest lane of an `m128` as all lanes of an `m256`.
set_splat_m128d_s_m256d	`avx2` Sets the lowest lane of an `m128d` as all lanes of an `m256d`.
shl_i16_m128i	`sse2` Shift each `i16` lane to the left by the `count` in the lower `i64` lane.
shl_i16_m256i	`avx2` Lanewise `i16` shift left by the lower `i64` lane of `count`.
shl_i32_each_m256i	`avx2` Lanewise `i32` shift left by the matching `i32` lane in `count`.
shl_i32_m128i	`sse2` Shift each `i32` lane to the left by the `count` in the lower `i64` lane.
shl_i32_m256i	`avx2` Lanewise `i32` shift left by the lower `i64` lane of `count`.
shl_i64_each_m256i	`avx2` Lanewise `i64` shift left by the matching `i64` lane in `count`.
shl_i64_m128i	`sse2` Shift each `i64` lane to the left by the `count` in the lower `i64` lane.
shl_i64_m256i	`avx2` Lanewise `i64` shift left by the lower `i64` lane of `count`.
shl_u32_each_m128i	`avx2` Shift `u32` values to the left by `count` bits.
shl_u64_each_m128i	`avx2` Shift `u64` values to the left by `count` bits.
shr_i16_m128i	`sse2` Shift each `i16` lane to the right by the `count` in the lower `i64` lane.
shr_i16_m256i	`avx2` Lanewise `i16` shift right by the lower `i64` lane of `count`.
shr_i32_each_m128i	`avx2` Shift `i32` values to the right by `count` bits.
shr_i32_each_m256i	`avx2` Lanewise `i32` shift right by the matching `i32` lane in `count`.
shr_i32_m128i	`sse2` Shift each `i32` lane to the right by the `count` in the lower `i64` lane.
shr_i32_m256i	`avx2` Lanewise `i32` shift right by the lower `i64` lane of `count`.
shr_u16_m128i	`sse2` Shift each `u16` lane to the right by the `count` in the lower `i64` lane.
shr_u16_m256i	`avx2` Lanewise `u16` shift right by the lower `i64` lane of `count`.
shr_u32_each_m128i	`avx2` Shift `u32` values to the left by `count` bits.
shr_u32_each_m256i	`avx2` Lanewise `u32` shift right by the matching `i32` lane in `count`.
shr_u32_m128i	`sse2` Shift each `u32` lane to the right by the `count` in the lower `i64` lane.
shr_u32_m256i	`avx2` Lanewise `u32` shift right by the lower `i64` lane of `count`.
shr_u64_each_m128i	`avx2` Shift `u64` values to the left by `count` bits.
shr_u64_each_m256i	`avx2` Lanewise `u64` shift right by the matching `i64` lane in `count`.
shr_u64_m128i	`sse2` Shift each `u64` lane to the right by the `count` in the lower `i64` lane.
shr_u64_m256i	`avx2` Lanewise `u64` shift right by the lower `i64` lane of `count`.
shuffle_i8_m128i	`ssse3` Shuffles the `i8` lanes according to the pattern in `b`.
shuffle_i8_m256i	`avx2` Shuffle `a` according to `control`.
sign_apply_i16_m128i	`ssse3` Applies the sign of `i16` values in `b` to the values in `a`.
sign_apply_i16_m256i	`avx2` Lanewise `a * signum(b)` with lanes as `i16`
sign_apply_i32_m128i	`ssse3` Applies the sign of `i32` values in `b` to the values in `a`.
sign_apply_i32_m256i	`avx2` Lanewise `a * signum(b)` with lanes as `i32`
sign_apply_i8_m128i	`ssse3` Applies the sign of `i8` values in `b` to the values in `a`.
sign_apply_i8_m256i	`avx2` Lanewise `a * signum(b)` with lanes as `i8`
splat_i16_m128i_s_m128i	`avx2` Splat the lowest 16-bit lane across the entire 128 bits.
splat_i32_m128i_s_m128i	`avx2` Splat the lowest 32-bit lane across the entire 128 bits.
splat_i64_m128i_s_m128i	`avx2` Splat the lowest 64-bit lane across the entire 128 bits.
splat_i8_m128i_s_m128i	`avx2` Splat the lowest 8-bit lane across the entire 128 bits.
splat_m128_s_m128	`avx2` Splat the lowest `f32` across all four lanes.
splat_m128d_s_m128d	`avx2` Splat the lower `f64` across both lanes of `m128d`.
splat_m128i_m256i	`avx2` Splat the 128-bits across 256-bits.
sqrt_m128	`sse` Lanewise `sqrt(a)`.
sqrt_m128_s	`sse` Low lane `sqrt(a)`, other lanes unchanged.
sqrt_m128d	`sse2` Lanewise `sqrt(a)`.
sqrt_m128d_s	`sse2` Low lane `sqrt(b)`, upper lane is unchanged from `a`.
sqrt_m256d	`avx` Lanewise `sqrt` on `f64` lanes.
sqrt_m256	`avx` Lanewise `sqrt` on `f64` lanes.
store_high_m128d_s	`sse2` Stores the high lane value to the reference given.
store_i64_m128i_s	`sse2` Stores the value to the reference given.
store_m128	`sse` Stores the value to the reference given.
store_m128_s	`sse` Stores the low lane value to the reference given.
store_m128d	`sse2` Stores the value to the reference given.
store_m128d_s	`sse2` Stores the low lane value to the reference given.
store_m128i	`sse2` Stores the value to the reference given.
store_m256d	`avx` Store data from a register into memory.
store_m256	`avx` Store data from a register into memory.
store_m256i	`avx` Store data from a register into memory.
store_masked_i32_m128i	`avx2` Stores the `i32` masked lanes given to the reference.
store_masked_i32_m256i	`avx2` Stores the `i32` masked lanes given to the reference.
store_masked_i64_m128i	`avx2` Stores the `i32` masked lanes given to the reference.
store_masked_i64_m256i	`avx2` Stores the `i32` masked lanes given to the reference.
store_masked_m128d	`avx` Store data from a register into memory according to a mask.
store_masked_m128	`avx` Store data from a register into memory according to a mask.
store_masked_m256d	`avx` Store data from a register into memory according to a mask.
store_masked_m256	`avx` Store data from a register into memory according to a mask.
store_reverse_m128	`sse` Stores the value to the reference given in reverse order.
store_reversed_m128d	`sse2` Stores the value to the reference given.
store_splat_m128	`sse` Stores the low lane value to all lanes of the reference given.
store_splat_m128d	`sse2` Stores the low lane value to all lanes of the reference given.
store_unaligned_hi_lo_m256d	`avx` Store data from a register into memory.
store_unaligned_hi_lo_m256	`avx` Store data from a register into memory.
store_unaligned_hi_lo_m256i	`avx` Store data from a register into memory.
store_unaligned_m128	`sse` Stores the value to the reference given.
store_unaligned_m128d	`sse2` Stores the value to the reference given.
store_unaligned_m128i	`sse2` Stores the value to the reference given.
store_unaligned_m256d	`avx` Store data from a register into memory.
store_unaligned_m256	`avx` Store data from a register into memory.
store_unaligned_m256i	`avx` Store data from a register into memory.
sub_horizontal_i16_m128i	`ssse3` Subtract horizontal pairs of `i16` values, pack the outputs as `a` then `b`.
sub_horizontal_i16_m256i	`avx2` Horizontal `a - b` with lanes as `i16`.
sub_horizontal_i32_m128i	`ssse3` Subtract horizontal pairs of `i32` values, pack the outputs as `a` then `b`.
sub_horizontal_i32_m256i	`avx2` Horizontal `a - b` with lanes as `i32`.
sub_horizontal_m128d	`sse3` Subtract each lane horizontally, pack the outputs as `a` then `b`.
sub_horizontal_m128	`sse3` Subtract each lane horizontally, pack the outputs as `a` then `b`.
sub_horizontal_m256d	`avx` Subtract adjacent `f64` lanes.
sub_horizontal_m256	`avx` Subtract adjacent `f32` lanes.
sub_horizontal_saturating_i16_m128i	`ssse3` Subtract horizontal pairs of `i16` values, saturating, pack the outputs as `a` then `b`.
sub_horizontal_saturating_i16_m256i	`avx2` Horizontal saturating `a - b` with lanes as `i16`.
sub_i16_m128i	`sse2` Lanewise `a - b` with lanes as `i16`.
sub_i16_m256i	`avx2` Lanewise `a - b` with lanes as `i16`.
sub_i32_m128i	`sse2` Lanewise `a - b` with lanes as `i32`.
sub_i32_m256i	`avx2` Lanewise `a - b` with lanes as `i32`.
sub_i64_m128i	`sse2` Lanewise `a - b` with lanes as `i64`.
sub_i64_m256i	`avx2` Lanewise `a - b` with lanes as `i64`.
sub_i8_m128i	`sse2` Lanewise `a - b` with lanes as `i8`.
sub_i8_m256i	`avx2` Lanewise `a - b` with lanes as `i8`.
sub_m128	`sse` Lanewise `a - b`.
sub_m128_s	`sse` Low lane `a - b`, other lanes unchanged.
sub_m128d	`sse2` Lanewise `a - b`.
sub_m128d_s	`sse2` Lowest lane `a - b`, high lane unchanged.
sub_m256d	`avx` Lanewise `a - b` with `f64` lanes.
sub_m256	`avx` Lanewise `a - b` with `f32` lanes.
sub_saturating_i16_m128i	`sse2` Lanewise saturating `a - b` with lanes as `i16`.
sub_saturating_i16_m256i	`avx2` Lanewise saturating `a - b` with lanes as `i16`.
sub_saturating_i8_m128i	`sse2` Lanewise saturating `a - b` with lanes as `i8`.
sub_saturating_i8_m256i	`avx2` Lanewise saturating `a - b` with lanes as `i8`.
sub_saturating_u16_m128i	`sse2` Lanewise saturating `a - b` with lanes as `u16`.
sub_saturating_u16_m256i	`avx2` Lanewise saturating `a - b` with lanes as `u16`.
sub_saturating_u8_m128i	`sse2` Lanewise saturating `a - b` with lanes as `u8`.
sub_saturating_u8_m256i	`avx2` Lanewise saturating `a - b` with lanes as `u8`.
sum_of_u8_abs_diff_m128i	`sse2` Compute "sum of `u8` absolute differences".
sum_of_u8_abs_diff_m256i	`avx2` Compute "sum of `u8` absolute differences".
test_all_ones_m128i	`sse4.1` Tests if all bits are 1.
test_all_zeroes_m128i	`sse4.1` Returns if all masked bits are 0, `(a & mask) as u128 == 0`
test_mixed_ones_and_zeroes_m128i	`sse4.1` Returns if, among the masked bits, there's both 0s and 1s
trailing_zero_count_u32	`bmi1` Counts the number of trailing zero bits in a `u32`.
trailing_zero_count_u64	`bmi1` Counts the number of trailing zero bits in a `u64`.
transpose_four_m128	`sse` Transpose four `m128` as if they were a 4x4 matrix.
truncate_m128_to_m128i	`sse2` Truncate the `f32` lanes to `i32` lanes.
truncate_m128d_to_m128i	`sse2` Truncate the `f64` lanes to the lower `i32` lanes (upper `i32` lanes 0).
truncate_to_i32_m128d_s	`sse2` Truncate the lower lane into an `i32`.
truncate_to_i64_m128d_s	`sse2` Truncate the lower lane into an `i64`.
unpack_hi_m256d	`avx` Unpack and interleave the high lanes.
unpack_hi_m256	`avx` Unpack and interleave the high lanes.
unpack_high_i16_m128i	`sse2` Unpack and interleave high `i16` lanes of `a` and `b`.
unpack_high_i16_m256i	`avx2` Unpack and interleave high `i16` lanes of `a` and `b`.
unpack_high_i32_m128i	`sse2` Unpack and interleave high `i32` lanes of `a` and `b`.
unpack_high_i32_m256i	`avx2` Unpack and interleave high `i32` lanes of `a` and `b`.
unpack_high_i64_m128i	`sse2` Unpack and interleave high `i64` lanes of `a` and `b`.
unpack_high_i64_m256i	`avx2` Unpack and interleave high `i64` lanes of `a` and `b`.
unpack_high_i8_m128i	`sse2` Unpack and interleave high `i8` lanes of `a` and `b`.
unpack_high_i8_m256i	`avx2` Unpack and interleave high `i8` lanes of `a` and `b`.
unpack_high_m128	`sse` Unpack and interleave high lanes of `a` and `b`.
unpack_high_m128d	`sse2` Unpack and interleave high lanes of `a` and `b`.
unpack_lo_m256d	`avx` Unpack and interleave the high lanes.
unpack_lo_m256	`avx` Unpack and interleave the high lanes.
unpack_low_i16_m128i	`sse2` Unpack and interleave low `i16` lanes of `a` and `b`.
unpack_low_i16_m256i	`avx2` Unpack and interleave low `i16` lanes of `a` and `b`.
unpack_low_i32_m128i	`sse2` Unpack and interleave low `i32` lanes of `a` and `b`.
unpack_low_i32_m256i	`avx2` Unpack and interleave low `i32` lanes of `a` and `b`.
unpack_low_i64_m128i	`sse2` Unpack and interleave low `i64` lanes of `a` and `b`.
unpack_low_i64_m256i	`avx2` Unpack and interleave low `i64` lanes of `a` and `b`.
unpack_low_i8_m128i	`sse2` Unpack and interleave low `i8` lanes of `a` and `b`.
unpack_low_i8_m256i	`avx2` Unpack and interleave low `i8` lanes of `a` and `b`.
unpack_low_m128	`sse` Unpack and interleave low lanes of `a` and `b`.
unpack_low_m128d	`sse2` Unpack and interleave low lanes of `a` and `b`.
xor_m128	`sse` Bitwise `a ^ b`.
xor_m128d	`sse2` Bitwise `a ^ b`.
xor_m128i	`sse2` Bitwise `a ^ b`.
xor_m256d	`avx` Bitwise `a ^ b`.
xor_m256	`avx` Bitwise `a ^ b`.
xor_m256i	`avx2` Bitwise `a ^ b`.
zero_extend_m128d	`avx` Zero extend an `m128d` to `m256d`
zero_extend_m128	`avx` Zero extend an `m128` to `m256`
zero_extend_m128i	`avx` Zero extend an `m128i` to `m256i`
zeroed_m128	`sse` All lanes zero.
zeroed_m128i	`sse2` All lanes zero.
zeroed_m128d	`sse2` Both lanes zero.
zeroed_m256d	`avx` A zeroed `m256d`
zeroed_m256	`avx` A zeroed `m256`
zeroed_m256i	`avx` A zeroed `m256i`