Expand description
BMI2 Fast Paths for Bit Manipulation
This module provides PEXT/PDEP-accelerated bit packing/unpacking operations with proper fallback ladder:
- BMI2 (Intel Haswell+, AMD Zen3+): Native PEXT/PDEP
- AVX2: SIMD-based bit extraction (no PEXT)
- Scalar: Portable loop-based implementation
§Operations
- PEXT (Parallel Extract): Extract bits at mask positions
- PDEP (Parallel Deposit): Deposit bits at mask positions
§Use Cases
- Unpacking 4-bit quantized values from packed storage
- Extracting specific dimensions from compressed vectors
- Bitmap operations for filtered candidate sets
§Performance Warning
AMD Zen/Zen2 have slow microcode PEXT/PDEP (~18 cycles vs 3 cycles on Intel). Use feature detection to choose appropriate path.
Functions§
- bmi2_
available - Check if BMI2 is available on current CPU.
- bmi2_
fast - Check if BMI2 is fast (Intel or AMD Zen3+). Returns false for AMD Zen/Zen2 where PEXT/PDEP are slow.
- deposit_
4bit_ batch - Deposit multiple 4-bit values using PDEP. Processes 16 values per u64 word.
- dispatch_
info - Dispatch info for debugging.
- extract_
4bit_ batch - Extract multiple 4-bit values using PEXT. Processes 16 values per u64 word.
- pack_
4bit - Pack 4-bit values into a byte array. Each input value should be 0-15.
- pack_
nbits - Pack N-bit values (1-8 bits per value).
- pdep_
u32 - pdep_
u64 - Parallel bit deposit: deposit bits from
srcto positions specified bymask. - pext_
u32 - 32-bit versions.
- pext_
u64 - Parallel bit extract: extract bits from
srcat positions specified bymask. - unpack_
4bit - Unpack 4-bit values from a byte array.
- unpack_
nbits - Unpack N-bit values (1-8 bits per value).