Expand description
SIMD compress operations
This module provides compress/compact operations similar to AVX-512’s VCOMPRESS instruction. Elements where the corresponding mask bit is set are packed contiguously to the front of the destination buffer.
§CPU Feature Requirements
§Intel x86_64 - Optimal Performance (AVX-512)
-
compress_store_u32x8/compress_u32x8: Requires AVX512F + AVX512VL- Uses
VPCOMPRESSDinstruction (_mm256_mask_compressstoreu_epi32) - Available on: Intel Skylake-X (Xeon), Ice Lake, Tiger Lake, and later
- Fallback: Shuffle-based table lookup (works on all architectures)
- Uses
-
compress_store_u32x16/compress_u32x16: Requires AVX512F- Uses
VPCOMPRESSDinstruction (_mm512_mask_compressstoreu_epi32) - Available on: Intel Skylake-X (Xeon), Ice Lake, Tiger Lake, and later
- Fallback: Two
compress_store_u32x8operations (works on all architectures)
- Uses
-
compress_store_u8x16/compress_u8x16: Requires AVX512VBMI2 + AVX512VL- Uses
VPCOMPRESSBinstruction (_mm256_mask_compressstoreu_epi8) - Available on: Intel Ice Lake, Tiger Lake, and later (not available on Skylake-X)
- Fallback: NEON TBL shuffle on ARM, gather-style writes elsewhere
- Uses
§ARM aarch64 - NEON Optimizations (Apple Silicon M1/M2/M3)
On ARM processors, this module uses NEON-optimized implementations:
-
compress_store_u8x16: Uses NEONTBLinstruction via shuffle + copy- Eliminates 16 conditional branches from the scalar fallback
- Uses precomputed shuffle index tables for O(1) index lookup
-
compress_store_u32x8: Uses NEONTBLwith byte-level shuffle indices- Uses
vqtbl1q_u8for efficient 16-byte permutation (processes as 2 halves) - Precomputed byte-index table avoids runtime index conversion overhead
- Uses
-
Bitmask expansion: Uses NEON parallel bit operations
- Converts bitmask to vector mask without scalar loops
§Fallback Behavior
All functions automatically fall back to scalar/shuffle implementations when architecture-specific features are not available:
- x86_64 without AVX-512 (uses AVX2/SSE if available, or scalar)
- aarch64 without NEON (rare, uses scalar)
- All other architectures (scalar fallback)
§Performance Impact
- AVX-512 compress instructions are 3-5× faster than shuffle-based fallbacks
- ARM NEON shuffle-based compress is ~2× faster than scalar conditional branches for typical mask densities (10-50% of elements selected)
§Example
use wide::u32x8;
use simd_lookup::simd_compress::compress_store_u32x8;
let data = u32x8::from([10, 20, 30, 40, 50, 60, 70, 80]);
let mask = 0b10110010u8; // Select elements at positions 1, 4, 5, 7
let mut output = [0u32; 8];
let count = compress_store_u32x8(data, mask, &mut output);
// count == 4
// output[0..4] == [20, 50, 60, 80]Functions§
- compress_
store_ u8x16 - Compress and store u8x16 elements where mask bits are set.
- compress_
store_ u32x8 - Compress and store u32x8 elements where mask bits are set.
- compress_
store_ u32x16 - Compress and store u32x16 elements where mask bits are set.
- compress_
u8x16 - Compress u8x16 and return both the compressed vector and element count. Unwritten lanes contain undefined values.
- compress_
u32x8 - Compress u32x8 and return both the compressed vector and element count. Unwritten lanes contain undefined values.
- compress_
u32x16 - Compress u32x16 and return both the compressed vector and element count. Unwritten lanes contain undefined values.