simd-lookup

High-performance SIMD utilities for fast table lookups, compression and data processing in Rust.

Features

Cross-platform SIMD: Automatic dispatch to optimal implementation (AVX-512, AVX2, NEON)
Zero-cost abstractions: Thin wrappers over platform intrinsics via the wide crate
Comprehensive utilities: Compress, shuffle, widen, split, and bitmask operations

CPU Feature Requirements

This crate automatically detects and uses the best available CPU features, with fallbacks for older CPUs. The crate is optimized for both ARM NEON (aarch64) and Intel AVX-512 (x86_64) architectures.

Note: Table64 is primarily optimized for ARM NEON using the TBL4 instruction, which provides excellent performance on Apple Silicon and other ARMv8+ CPUs. On Intel x86_64, it requires newer AVX-512 features (Ice Lake+).

Summary Table

Module/Feature	Required CPU Features	Available CPUs	Fallback
simd_compress (`compress_store_u32x8`)	AVX512F + AVX512VL (x86), NEON TBL2 (ARM)	Skylake-X+, Ice Lake+, All ARM	NEON TBL on ARM, Shuffle table elsewhere
simd_compress (`compress_store_u32x16`)	AVX512F	Skylake-X+, Ice Lake+	Two u32x8 compresses
simd_compress (`compress_store_u8x16`)	AVX512VBMI2 + AVX512VL (x86), NEON TBL (ARM)	Ice Lake+, Tiger Lake+, All ARM	NEON TBL on ARM, gather-style writes elsewhere
simd_gather (`gather_u32index_u8`)	AVX512F + AVX512BW	Skylake-X+, Ice Lake+	Scalar loop
simd_gather (`gather_u32index_u32`)	AVX512F	Skylake-X+, Ice Lake+	Scalar loop
Table64	ARM NEON TBL4 (aarch64) or AVX512BW + AVX512VBMI (x86_64)	All ARMv8+ (Apple Silicon), Ice Lake+	Scalar lookup (x86_64 only)
Table2dU8xU8	AVX512F + AVX512BW	Skylake-X+, Ice Lake+	Scalar lookup
Cascading Lookup Kernel	AVX512F + AVX512VL + AVX512BW + AVX512VBMI2	Ice Lake+, Tiger Lake+	Scalar lookup

Detailed Requirements

SIMD Compress Kernels (`simd_compress` module)

compress_store_u32x8:
- Intel x86_64: Requires AVX512F + AVX512VL, uses VPCOMPRESSD instruction
- ARM aarch64: Uses NEON TBL2 with precomputed byte-level shuffle indices
  - Eliminates 8 conditional branches from scalar fallback
  - 256×32 byte lookup table for O(1) index computation
- Available on: Intel Skylake-X+, All ARMv8+ (Apple Silicon M1/M2/M3)
- Fallback: Shuffle-based table lookup (other architectures)
compress_store_u32x16: Requires AVX512F
- Uses VPCOMPRESSD instruction (512-bit variant)
- Available on: Intel Skylake-X (Xeon), Ice Lake, Tiger Lake, and later
- Fallback: Two compress_store_u32x8 operations
compress_store_u8x16:
- Intel x86_64: Requires AVX512VBMI2 + AVX512VL, uses VPCOMPRESSB instruction
- ARM aarch64: Uses NEON TBL (vqtbl1q_u8) with precomputed shuffle indices
  - Eliminates 16 conditional branches from scalar fallback
  - 64KB lookup table (65536×16 bytes) for O(1) index computation
  - Single TBL instruction performs entire 16-byte shuffle
- Available on: Intel Ice Lake+, All ARMv8+ (Apple Silicon M1/M2/M3)
- Fallback: Gather-style direct writes (other architectures)

SIMD Gather Operations (`simd_gather` module)

gather_u32index_u8: Requires AVX512F + AVX512BW
- Uses VGATHERDPS + VPMOVDB instructions
- Available on: Intel Skylake-X (Xeon), Ice Lake, Tiger Lake, and later
- Fallback: Scalar loop
gather_u32index_u32: Requires AVX512F
- Uses VGATHERDPS instruction
- Available on: Intel Skylake-X (Xeon), Ice Lake, Tiger Lake, and later
- Fallback: Scalar loop

Small Table Lookups (`small_table` module)

Table64: Highly optimized for ARM NEON (primary optimization target)
- ARM aarch64 (Apple Silicon, etc.): Uses ARM NEON TBL4 instruction (vqtbl4q_u8)
  - Native hardware support on all ARMv8+ CPUs (including Apple M1/M2/M3)
  - Extremely efficient single-instruction 64-byte table lookup
  - No fallback needed - full SIMD acceleration on ARM
- Intel x86_64: Requires AVX512BW + AVX512VBMI
  - Uses VPERMB instruction (_mm512_permutexvar_epi8) for 64-byte table lookups
  - Available on: Intel Ice Lake, Tiger Lake, and later (not available on Skylake-X)
  - Fallback: Scalar lookup (works on all x86_64 CPUs)
Table2dU8xU8: Requires AVX512F + AVX512BW (via simd_gather)
- Uses VGATHERDPS + VPMOVDB for parallel lookups
- Available on: Intel Skylake-X (Xeon), Ice Lake, Tiger Lake, and later
- Fallback: Scalar lookup

Cascading Lookup Kernels (`lookup_kernel` module)

SimdCascadingTableU32U8Lookup: Requires AVX512F + AVX512VL + AVX512BW + AVX512VBMI2
- Uses compress_store_u8x16, compress_store_u32x16, and gather_u32index_u8
- Provides 40-50% speedup over scalar implementations on large tables
- Available on: Intel Ice Lake, Tiger Lake, and later (not available on Skylake-X)
- Fallback: Scalar lookup (works on all architectures)

CPU Generation Reference

Skylake-X (2017): AVX512F, AVX512VL, AVX512BW ✅ | AVX512VBMI ❌ | AVX512VBMI2 ❌
Ice Lake (2019): AVX512F, AVX512VL, AVX512BW, AVX512VBMI, AVX512VBMI2 ✅
Tiger Lake (2020): AVX512F, AVX512VL, AVX512BW, AVX512VBMI, AVX512VBMI2 ✅
Apple Silicon (M1/M2/M3): ARM NEON (TBL4) ✅ - no AVX-512 equivalent needed

Checking CPU Features

You can check which features your CPU supports:

# Linux
grep flags /proc/cpuinfo | head -1

# Or use Rust's feature detection
cargo run --example check_features

All functions automatically detect available CPU features at runtime and use the best available implementation.

SIMD Utilities (`wide_utils` module)

This crate provides a rich set of SIMD utilities built on top of the wide crate, with optimized implementations for x86_64 (AVX-512/AVX2) and aarch64 (NEON).

Compress Operations (`simd_compress` module)

Stream compaction similar to AVX-512's VCOMPRESS instruction — pack selected elements contiguously based on a bitmask.

🚀 Highly optimized for ARM NEON — achieves up to 12 Gelem/s on Apple Silicon!

use simd_lookup::{compress_store_u32x8, compress_store_u32x16, compress_store_u8x16};
use wide::{u32x8, u32x16, u8x16};

// Compress u32x8: select elements where mask bits are set
let data = u32x8::from([10, 20, 30, 40, 50, 60, 70, 80]);
let mask = 0b10110010u8; // Select positions 1, 4, 5, 7
let mut output = [0u32; 8];  // Must have room for full vector!

let count = compress_store_u32x8(data, mask, &mut output);
// count == 4, output[0..4] == [20, 50, 60, 80]

// Also available for u32x16 (512-bit) and u8x16

Note: Destination buffer must have room for the full uncompressed vector (8/16 elements). This enables fast direct NEON stores instead of variable-length copies.

Function	AVX-512	ARM NEON	Throughput (ARM)
`compress_store_u32x8`	`VPCOMPRESSD`	`TBL2` + direct store	~4.3 Gelem/s
`compress_store_u32x16`	`VPCOMPRESSD`	2× NEON u32x8	~5.3 Gelem/s
`compress_store_u8x16`	`VPCOMPRESSB`	`TBL` + direct store	~12 Gelem/s

Shuffle/Permute Operations

Variable-index shuffle using the same SIMD type for indices (zero-copy from lookup tables):

use simd_lookup::WideUtilsExt;
use wide::u32x8;

let data = u32x8::from([10, 20, 30, 40, 50, 60, 70, 80]);
let indices = u32x8::from([7, 6, 5, 4, 3, 2, 1, 0]); // Reverse

let reversed = data.shuffle(indices);
// reversed == [80, 70, 60, 50, 40, 30, 20, 10]

Type	AVX2	NEON	Scalar
`u32x8`	`VPERMD`	`TBL2` (byte-level)	Loop
`u32x4`	—	`TBL` (byte-level)	Loop
`u8x16`	`PSHUFB`	`TBL`	Loop

Vector Splitting (`SimdSplit` trait)

Efficiently extract high/low halves of wide vectors:

use simd_lookup::SimdSplit;
use wide::u32x16;

let data = u32x16::from([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]);
let (lo, hi) = data.split_low_high();
// lo: u32x8 = [1,2,3,4,5,6,7,8]
// hi: u32x8 = [9,10,11,12,13,14,15,16]

// Or extract just one half
let low_half = data.low_half();
let high_half = data.high_half();

Type	AVX-512	Fallback
`u32x16 → u32x8`	`_mm512_extracti64x4_epi64`	Array slicing
`u64x8 → u64x4`	`_mm512_extracti64x4_epi64`	Array slicing

Widening Operations

Zero-extend smaller types to larger types:

use simd_lookup::WideUtilsExt;
use wide::{u32x8, u64x8};

let input = u32x8::from([1, 2, 3, 4, 5, 6, 7, 8]);
let widened: u64x8 = input.widen_to_u64x8();
// widened == [1u64, 2, 3, 4, 5, 6, 7, 8]

Type	AVX-512	AVX2	NEON
`u32x8 → u64x8`	`VPMOVZXDQ`	2× `VPMOVZXDQ`	`VMOVL`
`u32x4 → u64x4`	—	`VPMOVZXDQ`	`VMOVL`

Bitmask to Vector Conversion

Convert a scalar bitmask to a SIMD mask vector:

use simd_lookup::FromBitmask;
use wide::u64x8;

let mask = 0b10101010u8;
let mask_vec: u64x8 = u64x8::from_bitmask(mask);
// mask_vec == [0, MAX, 0, MAX, 0, MAX, 0, MAX]

Type	AVX-512	ARM NEON	AVX2/Other
`u64x8`	`VPBROADCASTQ` + mask	`VCEQ` + `VMOVL` chain	Loop
`u32x8`	`VPBROADCASTD` + mask	`VCEQ` + `VMOVL` chain	Loop

Double (`double()` method on `WideUtilsExt`)

Efficiently double each element via self + self. Addition is well-supported on all architectures (NEON vaddq, SSE paddb), making this the most efficient way to multiply by powers of 2:

use simd_lookup::WideUtilsExt;
use wide::u8x16;

let a = u8x16::from([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]);

// x * 2
let doubled = a.double();
// doubled == [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]

// x * 8 (chain three doubles)
let times_8 = a.double().double().double();
// times_8 == [8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128]

This is more efficient than scalar multiplication for types like u8x16 where x86 lacks native byte multiply/shift instructions.

Shuffle Index Tables

Pre-computed shuffle indices for compress operations (256 entries for 8-element masks):

use simd_lookup::{SHUFFLE_COMPRESS_IDX_U32X8, get_compress_indices_u32x8};

// Raw array access
let indices: [u32; 8] = SHUFFLE_COMPRESS_IDX_U32X8[0b10110010];
// indices == [1, 4, 5, 7, 7, 7, 7, 7] (unused positions filled with 7)

// Zero-cost SIMD access via transmute
let simd_indices = get_compress_indices_u32x8(0b10110010u8);

Other Modules

`small_table` — Small Table SIMD Lookup

64-entry lookup table primarily optimized for ARM NEON TBL4 (excellent performance on Apple Silicon) and also supports AVX-512 VPERMB on Intel Ice Lake+. Useful for fast pattern detection and small dictionary lookups.

`prefetch` — SIMD Memory Prefetch

Cross-platform memory prefetch utilities including masked prefetch for 8 addresses at once. Supports L1/L2/L3 cache hints.

`bulk_vec_extender` — Efficient Vec Extension

Utilities for efficiently extending Vec with SIMD-produced results, minimizing bounds checks and reallocations.

`entropy_map_lookup` — Entropy-Optimized Lookups

Lookup structures optimized for low-entropy (few unique values) data, using bitpacking and small lookup tables.

`eight_value_lookup` — 8-Value Fast Path

Specialized lookup for tables with ≤8 unique values, using SIMD comparison and bitmask extraction.

Performance Notes

ARM NEON Compress Performance (Apple Silicon M1/M2/M3)

The NEON compress operations achieve exceptional throughput through optimized direct vector stores:

Operation	Throughput	vs Scalar
`compress_store_u8x16`	~12 Gelem/s	~8× faster
`compress_store_u32x8`	~4.3 Gelem/s	~3-4× faster
`compress_store_u32x16`	~5.3 Gelem/s	~5-6× faster

Key optimizations:

Direct NEON stores: Uses vst1q_u8 to write full vectors instead of variable-length copies
Single TBL instruction: compress_store_u8x16 uses one vqtbl1q_u8 for 16-byte shuffle
Precomputed byte indices: Lookup tables eliminate runtime index computation
No branches: Mask-dependent branching eliminated entirely

API note: Destination buffers must have room for the full uncompressed vector (8/16 elements). This enables the fast path—the mask is unknown at compile time, so callers should always allocate worst-case.

General Performance Notes

AVX-512: Native compress instructions (VPCOMPRESSD, VPCOMPRESSB) are ~3-5× faster than shuffle-based fallback
NEON u32 shuffle: Uses TBL/TBL2 with byte-level indexing (converts u32 indices to byte offsets)
Bitmask expansion: Parallel vceq/vmovl chain replaces scalar loop
Lookup tables:
- u32x8 compress indices: 256×8×4 = 8KB (fits in L1 cache)
- u32x8 byte indices for NEON: 256×32 = 8KB (fits in L1 cache)
- u8x16 compress indices for NEON: 65536×16 = 1MB (may cause cache pressure on hot paths)
SimdSplit: AVX-512 uses single extract instruction; fallback is zero-cost transmute

TODO list

Build proper SIMD extensions for memory prefetch, masked VGATHER, etc that are reusable in different places. For example, build traits on top of wide's SIMD types and implement them for different architectures.
Refactor and get rid of all of the ugly AI generated intrinsic code
Good looking SIMD bitvec core, no AI generated intrinsics
As we build the SIMD intrinsics and other lookup utilities, add plenty of RustDoc detailing the WHY's, performance space/memory and other tradeoffs.

simd-lookup 0.1.0

simd-lookup

Features

CPU Feature Requirements

Summary Table

Detailed Requirements

SIMD Compress Kernels (simd_compress module)

SIMD Gather Operations (simd_gather module)

Small Table Lookups (small_table module)

Cascading Lookup Kernels (lookup_kernel module)