Module simd_gather

Module simd_gather 

Source
Expand description

SIMD gather operations for efficient indexed memory access

This module provides vectorized gather functions that load multiple values from memory using SIMD indices. On AVX-512, these use hardware gather instructions (_mm512_i32gather_epi32, _mm512_mask_i32gather_epi32); on other platforms, they fall back to scalar loops.

§CPU Feature Requirements (Intel x86_64)

§Optimal Performance (AVX-512)

  • gather_u32index_u8 / gather_masked_u32index_u8: Requires AVX512F + AVX512BW

    • Uses VGATHERDPS (_mm512_i32gather_epi32) + VPMOVDB (_mm512_cvtepi32_epi8)
    • Available on: Intel Skylake-X (Xeon), Ice Lake, Tiger Lake, and later
    • Fallback: Scalar loop (works on all architectures)
  • gather_u32index_u32 / gather_masked_u32index_u32: Requires AVX512F

    • Uses VGATHERDPS (_mm512_i32gather_epi32)
    • Available on: Intel Skylake-X (Xeon), Ice Lake, Tiger Lake, and later
    • Fallback: Scalar loop (works on all architectures)

§Fallback Behavior

All functions automatically fall back to scalar implementations when AVX-512 features are not available. The fallback implementations work on:

  • x86_64 without AVX-512 (uses AVX2 gather if available, or scalar)
  • aarch64 (ARM NEON) - scalar fallback
  • All other architectures (scalar fallback)

§Functions

§Important: Masked Gather Behavior on Intel

When using masked gather functions, be aware of two distinct behaviors:

§1. Architectural Fault Suppression (AVX-512)

AVX-512 masked gathers are architecturally designed to suppress page faults for masked-off elements. If a masked element (mask bit = 0) points to an invalid address, it will NOT cause a page fault. This is documented in the Intel® 64 and IA-32 Architectures SDM, Vol. 1, Section 15.6.4.

This means masked gathers are safe to use when some indices may be invalid, as long as those lanes are masked off.

§2. Speculative Memory Access (Performance Reality)

Despite the mask, the hardware may still speculatively access all memory locations regardless of mask state. This was the root cause of the Gather Data Sampling (GDS) vulnerability (CVE-2022-40982).

From Intel’s GDS documentation:

“When a gather instruction performs loads from memory, different data elements are merged into the destination vector register according to the mask specified. In some situations, due to hardware optimizations specific to gather instructions, stale data from previous usage of architectural or internal vector registers may get transiently forwarded to dependent instructions without being updated by the gather loads.”

Practical implications:

  • The mask does NOT reduce memory bandwidth - all lanes likely issue loads
  • The mask does NOT skip cache misses on masked lanes
  • Post-GDS microcode updates add latency but fix the speculation issue

§Architecture Comparison

FeatureAVX2 GatherAVX-512 Gather
Masked fault suppressionLimited/NoneArchitecturally guaranteed
Speculative access (pre-GDS)YesYes
Post-GDS microcodeN/AAdds latency, fixes spec

§When to Use Masked Gathers

Good use cases:

  • Conditional semantics (keeping fallback values for some lanes)
  • Fault suppression (safe to have invalid pointers in masked lanes on AVX-512)
  • Avoiding branching in vectorized code

NOT useful for:

  • Reducing memory bandwidth (all locations still accessed)
  • Skipping expensive cache misses on masked lanes
  • Performance gains from partial masking

§References

§Example

use wide::u32x16;
use simd_lookup::simd_gather::gather_u32index_u8;

let data: Vec<u8> = (0..256).map(|i| i as u8).collect();
let indices = u32x16::from([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]);
let result = gather_u32index_u8(indices, &data, 1);
assert_eq!(result.to_array(), [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]);

Functions§

gather_masked_u32index_u8
Masked gather of 16 bytes from memory using u32 indices.
gather_masked_u32index_u32
Masked gather of 16 u32 values from memory using u32 indices.
gather_u32index_u8
Gather 16 bytes from memory using u32 indices.
gather_u32index_u32
Gather 16 u32 values from memory using u32 indices.