Module simd_compress

Module simd_compress 

Source
Expand description

SIMD compress operations

This module provides compress/compact operations similar to AVX-512’s VCOMPRESS instruction. Elements where the corresponding mask bit is set are packed contiguously to the front of the destination buffer.

§CPU Feature Requirements

§Intel x86_64 - Optimal Performance (AVX-512)

  • compress_store_u32x8 / compress_u32x8: Requires AVX512F + AVX512VL

    • Uses VPCOMPRESSD instruction (_mm256_mask_compressstoreu_epi32)
    • Available on: Intel Skylake-X (Xeon), Ice Lake, Tiger Lake, and later
    • Fallback: Shuffle-based table lookup (works on all architectures)
  • compress_store_u32x16 / compress_u32x16: Requires AVX512F

    • Uses VPCOMPRESSD instruction (_mm512_mask_compressstoreu_epi32)
    • Available on: Intel Skylake-X (Xeon), Ice Lake, Tiger Lake, and later
    • Fallback: Two compress_store_u32x8 operations (works on all architectures)
  • compress_store_u8x16 / compress_u8x16: Requires AVX512VBMI2 + AVX512VL

    • Uses VPCOMPRESSB instruction (_mm256_mask_compressstoreu_epi8)
    • Available on: Intel Ice Lake, Tiger Lake, and later (not available on Skylake-X)
    • Fallback: NEON TBL shuffle on ARM, gather-style writes elsewhere

§ARM aarch64 - NEON Optimizations (Apple Silicon M1/M2/M3)

On ARM processors, this module uses NEON-optimized implementations:

  • compress_store_u8x16: Uses NEON TBL instruction via shuffle + copy

    • Eliminates 16 conditional branches from the scalar fallback
    • Uses precomputed shuffle index tables for O(1) index lookup
  • compress_store_u32x8: Uses NEON TBL with byte-level shuffle indices

    • Uses vqtbl1q_u8 for efficient 16-byte permutation (processes as 2 halves)
    • Precomputed byte-index table avoids runtime index conversion overhead
  • Bitmask expansion: Uses NEON parallel bit operations

    • Converts bitmask to vector mask without scalar loops

§Fallback Behavior

All functions automatically fall back to scalar/shuffle implementations when architecture-specific features are not available:

  • x86_64 without AVX-512 (uses AVX2/SSE if available, or scalar)
  • aarch64 without NEON (rare, uses scalar)
  • All other architectures (scalar fallback)

§Performance Impact

  • AVX-512 compress instructions are 3-5× faster than shuffle-based fallbacks
  • ARM NEON shuffle-based compress is ~2× faster than scalar conditional branches for typical mask densities (10-50% of elements selected)

§Example

use wide::u32x8;
use simd_lookup::simd_compress::compress_store_u32x8;

let data = u32x8::from([10, 20, 30, 40, 50, 60, 70, 80]);
let mask = 0b10110010u8; // Select elements at positions 1, 4, 5, 7
let mut output = [0u32; 8];

let count = compress_store_u32x8(data, mask, &mut output);
// count == 4
// output[0..4] == [20, 50, 60, 80]

Functions§

compress_store_u8x16
Compress and store u8x16 elements where mask bits are set.
compress_store_u32x8
Compress and store u32x8 elements where mask bits are set.
compress_store_u32x16
Compress and store u32x16 elements where mask bits are set.
compress_u8x16
Compress u8x16 and return both the compressed vector and element count. Unwritten lanes contain undefined values.
compress_u32x8
Compress u32x8 and return both the compressed vector and element count. Unwritten lanes contain undefined values.
compress_u32x16
Compress u32x16 and return both the compressed vector and element count. Unwritten lanes contain undefined values.