Module simd_compress

Expand description

SIMD compress operations

This module provides compress/compact operations similar to AVX-512’s VCOMPRESS instruction. Elements where the corresponding mask bit is set are packed contiguously to the front of the destination buffer.

§CPU Feature Requirements

§Intel x86_64 - Optimal Performance (AVX-512)

compress_store_u32x8 / compress_u32x8: Requires AVX512F + AVX512VL
- Uses VPCOMPRESSD instruction (_mm256_mask_compressstoreu_epi32)
- Available on: Intel Skylake-X (Xeon), Ice Lake, Tiger Lake, and later
- Fallback: Shuffle-based table lookup (works on all architectures)
compress_store_u32x16 / compress_u32x16: Requires AVX512F
- Uses VPCOMPRESSD instruction (_mm512_mask_compressstoreu_epi32)
- Available on: Intel Skylake-X (Xeon), Ice Lake, Tiger Lake, and later
- Fallback: Two compress_store_u32x8 operations (works on all architectures)
compress_store_u8x16 / compress_u8x16: Requires AVX512VBMI2 + AVX512VL
- Uses VPCOMPRESSB instruction (_mm256_mask_compressstoreu_epi8)
- Available on: Intel Ice Lake, Tiger Lake, and later (not available on Skylake-X)
- Fallback: NEON TBL shuffle on ARM, gather-style writes elsewhere

§ARM aarch64 - NEON Optimizations (Apple Silicon M1/M2/M3)

On ARM processors, this module uses NEON-optimized implementations:

compress_store_u8x16: Uses NEON TBL instruction via shuffle + copy
- Eliminates 16 conditional branches from the scalar fallback
- Uses precomputed shuffle index tables for O(1) index lookup
compress_store_u32x8: Uses NEON TBL with byte-level shuffle indices
- Uses vqtbl1q_u8 for efficient 16-byte permutation (processes as 2 halves)
- Precomputed byte-index table avoids runtime index conversion overhead
Bitmask expansion: Uses NEON parallel bit operations
- Converts bitmask to vector mask without scalar loops

§Fallback Behavior

All functions automatically fall back to scalar/shuffle implementations when architecture-specific features are not available:

x86_64 without AVX-512 (uses AVX2/SSE if available, or scalar)
aarch64 without NEON (rare, uses scalar)
All other architectures (scalar fallback)

§Performance Impact

AVX-512 compress instructions are 3-5× faster than shuffle-based fallbacks
ARM NEON shuffle-based compress is ~2× faster than scalar conditional branches for typical mask densities (10-50% of elements selected)

§Example

use wide::u32x8;
use simd_lookup::simd_compress::compress_store_u32x8;

let data = u32x8::from([10, 20, 30, 40, 50, 60, 70, 80]);
let mask = 0b10110010u8; // Select elements at positions 1, 4, 5, 7
let mut output = [0u32; 8];

let count = compress_store_u32x8(data, mask, &mut output);
// count == 4
// output[0..4] == [20, 50, 60, 80]

Functions§

compress_store_u8x16: Compress and store u8x16 elements where mask bits are set.
compress_store_u32x8: Compress and store u32x8 elements where mask bits are set.
compress_store_u32x16: Compress and store u32x16 elements where mask bits are set.
compress_u8x16: Compress u8x16 and return both the compressed vector and element count. Unwritten lanes contain undefined values.
compress_u32x8: Compress u32x8 and return both the compressed vector and element count. Unwritten lanes contain undefined values.
compress_u32x16: Compress u32x16 and return both the compressed vector and element count. Unwritten lanes contain undefined values.

Module simd_compress

Module simd_compress Copy item path

§CPU Feature Requirements

§Intel x86_64 - Optimal Performance (AVX-512)

§ARM aarch64 - NEON Optimizations (Apple Silicon M1/M2/M3)

§Fallback Behavior

§Performance Impact

§Example

Functions§

Module simd_compress