Expand description
SIMD gather operations for efficient indexed memory access
This module provides vectorized gather functions that load multiple values from
memory using SIMD indices. On AVX-512, these use hardware gather instructions
(_mm512_i32gather_epi32, _mm512_mask_i32gather_epi32); on other platforms,
they fall back to scalar loops.
§CPU Feature Requirements (Intel x86_64)
§Optimal Performance (AVX-512)
-
gather_u32index_u8/gather_masked_u32index_u8: Requires AVX512F + AVX512BW- Uses
VGATHERDPS(_mm512_i32gather_epi32) +VPMOVDB(_mm512_cvtepi32_epi8) - Available on: Intel Skylake-X (Xeon), Ice Lake, Tiger Lake, and later
- Fallback: Scalar loop (works on all architectures)
- Uses
-
gather_u32index_u32/gather_masked_u32index_u32: Requires AVX512F- Uses
VGATHERDPS(_mm512_i32gather_epi32) - Available on: Intel Skylake-X (Xeon), Ice Lake, Tiger Lake, and later
- Fallback: Scalar loop (works on all architectures)
- Uses
§Fallback Behavior
All functions automatically fall back to scalar implementations when AVX-512 features are not available. The fallback implementations work on:
- x86_64 without AVX-512 (uses AVX2 gather if available, or scalar)
- aarch64 (ARM NEON) - scalar fallback
- All other architectures (scalar fallback)
§Functions
gather_u32index_u8- Gather 16 bytes using u32 indicesgather_masked_u32index_u8- Masked gather of bytes with fallback valuesgather_u32index_u32- Gather 16 u32 values using u32 indicesgather_masked_u32index_u32- Masked gather of u32 values with fallback
§Important: Masked Gather Behavior on Intel
When using masked gather functions, be aware of two distinct behaviors:
§1. Architectural Fault Suppression (AVX-512)
AVX-512 masked gathers are architecturally designed to suppress page faults for masked-off elements. If a masked element (mask bit = 0) points to an invalid address, it will NOT cause a page fault. This is documented in the Intel® 64 and IA-32 Architectures SDM, Vol. 1, Section 15.6.4.
This means masked gathers are safe to use when some indices may be invalid, as long as those lanes are masked off.
§2. Speculative Memory Access (Performance Reality)
Despite the mask, the hardware may still speculatively access all memory locations regardless of mask state. This was the root cause of the Gather Data Sampling (GDS) vulnerability (CVE-2022-40982).
From Intel’s GDS documentation:
“When a gather instruction performs loads from memory, different data elements are merged into the destination vector register according to the mask specified. In some situations, due to hardware optimizations specific to gather instructions, stale data from previous usage of architectural or internal vector registers may get transiently forwarded to dependent instructions without being updated by the gather loads.”
Practical implications:
- The mask does NOT reduce memory bandwidth - all lanes likely issue loads
- The mask does NOT skip cache misses on masked lanes
- Post-GDS microcode updates add latency but fix the speculation issue
§Architecture Comparison
| Feature | AVX2 Gather | AVX-512 Gather |
|---|---|---|
| Masked fault suppression | Limited/None | Architecturally guaranteed |
| Speculative access (pre-GDS) | Yes | Yes |
| Post-GDS microcode | N/A | Adds latency, fixes spec |
§When to Use Masked Gathers
Good use cases:
- Conditional semantics (keeping fallback values for some lanes)
- Fault suppression (safe to have invalid pointers in masked lanes on AVX-512)
- Avoiding branching in vectorized code
NOT useful for:
- Reducing memory bandwidth (all locations still accessed)
- Skipping expensive cache misses on masked lanes
- Performance gains from partial masking
§References
- Intel® 64 and IA-32 Architectures SDM, Vol. 1, Section 15.6.4 (AVX-512 Masking)
- Intel Gather Data Sampling (GDS) Documentation
§Example
use wide::u32x16;
use simd_lookup::simd_gather::gather_u32index_u8;
let data: Vec<u8> = (0..256).map(|i| i as u8).collect();
let indices = u32x16::from([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]);
let result = gather_u32index_u8(indices, &data, 1);
assert_eq!(result.to_array(), [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]);Functions§
- gather_
masked_ u32index_ u8 - Masked gather of 16 bytes from memory using u32 indices.
- gather_
masked_ u32index_ u32 - Masked gather of 16 u32 values from memory using u32 indices.
- gather_
u32index_ u8 - Gather 16 bytes from memory using u32 indices.
- gather_
u32index_ u32 - Gather 16 u32 values from memory using u32 indices.