simd-lookup
High-performance SIMD utilities for fast table lookups, compression and data processing in Rust.
Features
- Cross-platform SIMD: Automatic dispatch to optimal implementation (AVX-512, AVX2, NEON)
- Zero-cost abstractions: Thin wrappers over platform intrinsics via the
widecrate - Comprehensive utilities: Compress, shuffle, widen, split, and bitmask operations
CPU Feature Requirements
This crate automatically detects and uses the best available CPU features, with fallbacks for older CPUs. The crate is optimized for both ARM NEON (aarch64) and Intel AVX-512 (x86_64) architectures.
Note: Table64 is primarily optimized for ARM NEON using the TBL4 instruction, which provides
excellent performance on Apple Silicon and other ARMv8+ CPUs. On Intel x86_64, it requires newer AVX-512
features (Ice Lake+).
Summary Table
| Module/Feature | Required CPU Features | Available CPUs | Fallback |
|---|---|---|---|
simd_compress (compress_store_u32x8) |
AVX512F + AVX512VL (x86), NEON TBL2 (ARM) | Skylake-X+, Ice Lake+, All ARM | NEON TBL on ARM, Shuffle table elsewhere |
simd_compress (compress_store_u32x16) |
AVX512F | Skylake-X+, Ice Lake+ | Two u32x8 compresses |
simd_compress (compress_store_u8x16) |
AVX512VBMI2 + AVX512VL (x86), NEON TBL (ARM) | Ice Lake+, Tiger Lake+, All ARM | NEON TBL on ARM, gather-style writes elsewhere |
simd_gather (gather_u32index_u8) |
AVX512F + AVX512BW | Skylake-X+, Ice Lake+ | Scalar loop |
simd_gather (gather_u32index_u32) |
AVX512F | Skylake-X+, Ice Lake+ | Scalar loop |
| Table64 | ARM NEON TBL4 (aarch64) or AVX512BW + AVX512VBMI (x86_64) | All ARMv8+ (Apple Silicon), Ice Lake+ | Scalar lookup (x86_64 only) |
| Table2dU8xU8 | AVX512F + AVX512BW | Skylake-X+, Ice Lake+ | Scalar lookup |
| Cascading Lookup Kernel | AVX512F + AVX512VL + AVX512BW + AVX512VBMI2 | Ice Lake+, Tiger Lake+ | Scalar lookup |
Detailed Requirements
SIMD Compress Kernels (simd_compress module)
-
compress_store_u32x8:- Intel x86_64: Requires AVX512F + AVX512VL, uses
VPCOMPRESSDinstruction - ARM aarch64: Uses NEON TBL2 with precomputed byte-level shuffle indices
- Eliminates 8 conditional branches from scalar fallback
- 256×32 byte lookup table for O(1) index computation
- Available on: Intel Skylake-X+, All ARMv8+ (Apple Silicon M1/M2/M3)
- Fallback: Shuffle-based table lookup (other architectures)
- Intel x86_64: Requires AVX512F + AVX512VL, uses
-
compress_store_u32x16: Requires AVX512F- Uses
VPCOMPRESSDinstruction (512-bit variant) - Available on: Intel Skylake-X (Xeon), Ice Lake, Tiger Lake, and later
- Fallback: Two
compress_store_u32x8operations
- Uses
-
compress_store_u8x16:- Intel x86_64: Requires AVX512VBMI2 + AVX512VL, uses
VPCOMPRESSBinstruction - ARM aarch64: Uses NEON TBL (
vqtbl1q_u8) with precomputed shuffle indices- Eliminates 16 conditional branches from scalar fallback
- 64KB lookup table (65536×16 bytes) for O(1) index computation
- Single TBL instruction performs entire 16-byte shuffle
- Available on: Intel Ice Lake+, All ARMv8+ (Apple Silicon M1/M2/M3)
- Fallback: Gather-style direct writes (other architectures)
- Intel x86_64: Requires AVX512VBMI2 + AVX512VL, uses
SIMD Gather Operations (simd_gather module)
-
gather_u32index_u8: Requires AVX512F + AVX512BW- Uses
VGATHERDPS+VPMOVDBinstructions - Available on: Intel Skylake-X (Xeon), Ice Lake, Tiger Lake, and later
- Fallback: Scalar loop
- Uses
-
gather_u32index_u32: Requires AVX512F- Uses
VGATHERDPSinstruction - Available on: Intel Skylake-X (Xeon), Ice Lake, Tiger Lake, and later
- Fallback: Scalar loop
- Uses
Small Table Lookups (small_table module)
-
Table64: Highly optimized for ARM NEON (primary optimization target)- ARM aarch64 (Apple Silicon, etc.): Uses ARM NEON
TBL4instruction (vqtbl4q_u8)- Native hardware support on all ARMv8+ CPUs (including Apple M1/M2/M3)
- Extremely efficient single-instruction 64-byte table lookup
- No fallback needed - full SIMD acceleration on ARM
- Intel x86_64: Requires AVX512BW + AVX512VBMI
- Uses
VPERMBinstruction (_mm512_permutexvar_epi8) for 64-byte table lookups - Available on: Intel Ice Lake, Tiger Lake, and later (not available on Skylake-X)
- Fallback: Scalar lookup (works on all x86_64 CPUs)
- Uses
- ARM aarch64 (Apple Silicon, etc.): Uses ARM NEON
-
Table2dU8xU8: Requires AVX512F + AVX512BW (viasimd_gather)- Uses
VGATHERDPS+VPMOVDBfor parallel lookups - Available on: Intel Skylake-X (Xeon), Ice Lake, Tiger Lake, and later
- Fallback: Scalar lookup
- Uses
Cascading Lookup Kernels (lookup_kernel module)
SimdCascadingTableU32U8Lookup: Requires AVX512F + AVX512VL + AVX512BW + AVX512VBMI2- Uses
compress_store_u8x16,compress_store_u32x16, andgather_u32index_u8 - Provides 40-50% speedup over scalar implementations on large tables
- Available on: Intel Ice Lake, Tiger Lake, and later (not available on Skylake-X)
- Fallback: Scalar lookup (works on all architectures)
- Uses
CPU Generation Reference
- Skylake-X (2017): AVX512F, AVX512VL, AVX512BW ✅ | AVX512VBMI ❌ | AVX512VBMI2 ❌
- Ice Lake (2019): AVX512F, AVX512VL, AVX512BW, AVX512VBMI, AVX512VBMI2 ✅
- Tiger Lake (2020): AVX512F, AVX512VL, AVX512BW, AVX512VBMI, AVX512VBMI2 ✅
- Apple Silicon (M1/M2/M3): ARM NEON (TBL4) ✅ - no AVX-512 equivalent needed
Checking CPU Features
You can check which features your CPU supports:
# Linux
|
# Or use Rust's feature detection
All functions automatically detect available CPU features at runtime and use the best available implementation.
SIMD Utilities (wide_utils module)
This crate provides a rich set of SIMD utilities built on top of the wide crate, with optimized implementations for x86_64 (AVX-512/AVX2) and aarch64 (NEON).
Compress Operations (simd_compress module)
Stream compaction similar to AVX-512's VCOMPRESS instruction — pack selected elements contiguously based on a bitmask.
🚀 Highly optimized for ARM NEON — achieves up to 12 Gelem/s on Apple Silicon!
use ;
use ;
// Compress u32x8: select elements where mask bits are set
let data = from;
let mask = 0b10110010u8; // Select positions 1, 4, 5, 7
let mut output = ; // Must have room for full vector!
let count = compress_store_u32x8;
// count == 4, output[0..4] == [20, 50, 60, 80]
// Also available for u32x16 (512-bit) and u8x16
Note: Destination buffer must have room for the full uncompressed vector (8/16 elements). This enables fast direct NEON stores instead of variable-length copies.
| Function | AVX-512 | ARM NEON | Throughput (ARM) |
|---|---|---|---|
compress_store_u32x8 |
VPCOMPRESSD |
TBL2 + direct store |
~4.3 Gelem/s |
compress_store_u32x16 |
VPCOMPRESSD |
2× NEON u32x8 | ~5.3 Gelem/s |
compress_store_u8x16 |
VPCOMPRESSB |
TBL + direct store |
~12 Gelem/s |
Shuffle/Permute Operations
Variable-index shuffle using the same SIMD type for indices (zero-copy from lookup tables):
use WideUtilsExt;
use u32x8;
let data = from;
let indices = from; // Reverse
let reversed = data.shuffle;
// reversed == [80, 70, 60, 50, 40, 30, 20, 10]
| Type | AVX2 | NEON | Scalar |
|---|---|---|---|
u32x8 |
VPERMD |
TBL2 (byte-level) |
Loop |
u32x4 |
— | TBL (byte-level) |
Loop |
u8x16 |
PSHUFB |
TBL |
Loop |
Vector Splitting (SimdSplit trait)
Efficiently extract high/low halves of wide vectors:
use SimdSplit;
use u32x16;
let data = from;
let = data.split_low_high;
// lo: u32x8 = [1,2,3,4,5,6,7,8]
// hi: u32x8 = [9,10,11,12,13,14,15,16]
// Or extract just one half
let low_half = data.low_half;
let high_half = data.high_half;
| Type | AVX-512 | Fallback |
|---|---|---|
u32x16 → u32x8 |
_mm512_extracti64x4_epi64 |
Array slicing |
u64x8 → u64x4 |
_mm512_extracti64x4_epi64 |
Array slicing |
Widening Operations
Zero-extend smaller types to larger types:
use WideUtilsExt;
use ;
let input = from;
let widened: u64x8 = input.widen_to_u64x8;
// widened == [1u64, 2, 3, 4, 5, 6, 7, 8]
| Type | AVX-512 | AVX2 | NEON |
|---|---|---|---|
u32x8 → u64x8 |
VPMOVZXDQ |
2× VPMOVZXDQ |
VMOVL |
u32x4 → u64x4 |
— | VPMOVZXDQ |
VMOVL |
Bitmask to Vector Conversion
Convert a scalar bitmask to a SIMD mask vector:
use FromBitmask;
use u64x8;
let mask = 0b10101010u8;
let mask_vec: u64x8 = from_bitmask;
// mask_vec == [0, MAX, 0, MAX, 0, MAX, 0, MAX]
| Type | AVX-512 | ARM NEON | AVX2/Other |
|---|---|---|---|
u64x8 |
VPBROADCASTQ + mask |
VCEQ + VMOVL chain |
Loop |
u32x8 |
VPBROADCASTD + mask |
VCEQ + VMOVL chain |
Loop |
Double (double() method on WideUtilsExt)
Efficiently double each element via self + self. Addition is well-supported on all architectures (NEON vaddq, SSE paddb), making this the most efficient way to multiply by powers of 2:
use WideUtilsExt;
use u8x16;
let a = from;
// x * 2
let doubled = a.double;
// doubled == [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]
// x * 8 (chain three doubles)
let times_8 = a.double.double.double;
// times_8 == [8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128]
This is more efficient than scalar multiplication for types like u8x16 where x86 lacks native byte multiply/shift instructions.
Shuffle Index Tables
Pre-computed shuffle indices for compress operations (256 entries for 8-element masks):
use ;
// Raw array access
let indices: = SHUFFLE_COMPRESS_IDX_U32X8;
// indices == [1, 4, 5, 7, 7, 7, 7, 7] (unused positions filled with 7)
// Zero-cost SIMD access via transmute
let simd_indices = get_compress_indices_u32x8;
Other Modules
small_table — Small Table SIMD Lookup
64-entry lookup table primarily optimized for ARM NEON TBL4 (excellent performance on Apple Silicon)
and also supports AVX-512 VPERMB on Intel Ice Lake+. Useful for fast pattern detection and small dictionary lookups.
prefetch — SIMD Memory Prefetch
Cross-platform memory prefetch utilities including masked prefetch for 8 addresses at once. Supports L1/L2/L3 cache hints.
bulk_vec_extender — Efficient Vec Extension
Utilities for efficiently extending Vec with SIMD-produced results, minimizing bounds checks and reallocations.
entropy_map_lookup — Entropy-Optimized Lookups
Lookup structures optimized for low-entropy (few unique values) data, using bitpacking and small lookup tables.
eight_value_lookup — 8-Value Fast Path
Specialized lookup for tables with ≤8 unique values, using SIMD comparison and bitmask extraction.
Performance Notes
ARM NEON Compress Performance (Apple Silicon M1/M2/M3)
The NEON compress operations achieve exceptional throughput through optimized direct vector stores:
| Operation | Throughput | vs Scalar |
|---|---|---|
compress_store_u8x16 |
~12 Gelem/s | ~8× faster |
compress_store_u32x8 |
~4.3 Gelem/s | ~3-4× faster |
compress_store_u32x16 |
~5.3 Gelem/s | ~5-6× faster |
Key optimizations:
- Direct NEON stores: Uses
vst1q_u8to write full vectors instead of variable-length copies - Single TBL instruction:
compress_store_u8x16uses onevqtbl1q_u8for 16-byte shuffle - Precomputed byte indices: Lookup tables eliminate runtime index computation
- No branches: Mask-dependent branching eliminated entirely
API note: Destination buffers must have room for the full uncompressed vector (8/16 elements). This enables the fast path—the mask is unknown at compile time, so callers should always allocate worst-case.
General Performance Notes
- AVX-512: Native compress instructions (
VPCOMPRESSD,VPCOMPRESSB) are ~3-5× faster than shuffle-based fallback - NEON u32 shuffle: Uses
TBL/TBL2with byte-level indexing (converts u32 indices to byte offsets) - Bitmask expansion: Parallel
vceq/vmovlchain replaces scalar loop - Lookup tables:
- u32x8 compress indices: 256×8×4 = 8KB (fits in L1 cache)
- u32x8 byte indices for NEON: 256×32 = 8KB (fits in L1 cache)
- u8x16 compress indices for NEON: 65536×16 = 1MB (may cause cache pressure on hot paths)
- SimdSplit: AVX-512 uses single extract instruction; fallback is zero-cost transmute
TODO list
- Build proper SIMD extensions for memory prefetch, masked VGATHER, etc that are reusable in different places. For example, build traits on top of wide's SIMD types and implement them for different architectures.
- Refactor and get rid of all of the ugly AI generated intrinsic code
- Good looking SIMD bitvec core, no AI generated intrinsics
- As we build the SIMD intrinsics and other lookup utilities, add plenty of RustDoc detailing the WHY's, performance space/memory and other tradeoffs.