Skip to main content

Module bmi2_paths

Module bmi2_paths 

Source
Expand description

BMI2 Fast Paths for Bit Manipulation

This module provides PEXT/PDEP-accelerated bit packing/unpacking operations with proper fallback ladder:

  1. BMI2 (Intel Haswell+, AMD Zen3+): Native PEXT/PDEP
  2. AVX2: SIMD-based bit extraction (no PEXT)
  3. Scalar: Portable loop-based implementation

§Operations

  • PEXT (Parallel Extract): Extract bits at mask positions
  • PDEP (Parallel Deposit): Deposit bits at mask positions

§Use Cases

  • Unpacking 4-bit quantized values from packed storage
  • Extracting specific dimensions from compressed vectors
  • Bitmap operations for filtered candidate sets

§Performance Warning

AMD Zen/Zen2 have slow microcode PEXT/PDEP (~18 cycles vs 3 cycles on Intel). Use feature detection to choose appropriate path.

Functions§

bmi2_available
Check if BMI2 is available on current CPU.
bmi2_fast
Check if BMI2 is fast (Intel or AMD Zen3+). Returns false for AMD Zen/Zen2 where PEXT/PDEP are slow.
deposit_4bit_batch
Deposit multiple 4-bit values using PDEP. Processes 16 values per u64 word.
dispatch_info
Dispatch info for debugging.
extract_4bit_batch
Extract multiple 4-bit values using PEXT. Processes 16 values per u64 word.
pack_4bit
Pack 4-bit values into a byte array. Each input value should be 0-15.
pack_nbits
Pack N-bit values (1-8 bits per value).
pdep_u32
pdep_u64
Parallel bit deposit: deposit bits from src to positions specified by mask.
pext_u32
32-bit versions.
pext_u64
Parallel bit extract: extract bits from src at positions specified by mask.
unpack_4bit
Unpack 4-bit values from a byte array.
unpack_nbits
Unpack N-bit values (1-8 bits per value).