Module lookup_kernel

Expand description

Arrow-style “lookup kernel” similar to arrow-select::take::take kernel. There are “columnar style” which does table lookups for one table first, and then cascading to another table, and the Cascading kernel uses SIMD extensively. Other kernels just do scalar lookups which are often as fast as SIMD GATHER, but all allow SIMD functions to operate on looked up values.

Anecdotally, this is not really faster than SIMD gather.

—————– SIMD based lookup kernels - “Arrow” style ———————————— These operate by leveraging constant-sized array reads from a slice with SIMD operations on looked up values, especially fast for multiple table lookups and combinations thereof. It turns out these don’t significantly improve on a series of scalar lookups.

§CPU Feature Requirements

§Intel x86_64 (AVX-512)

§Cascading Lookup Kernel (`SimdCascadingTableU32U8Lookup`)

The cascading kernel is heavily optimized for AVX-512 and provides significant performance improvements (40-50% speedup) over scalar implementations on large tables.

Required CPU features:

AVX512F + AVX512VL + AVX512VBMI2 (for compress_store_u8x16)
AVX512F (for compress_store_u32x16)
AVX512F + AVX512BW (for gather_u32index_u8 via simd_gather module)

Summary: Requires AVX512F, AVX512VL, AVX512BW, and AVX512VBMI2

Available on: Intel Ice Lake, Tiger Lake, and later (not available on Skylake-X)

§ARM aarch64 (Apple Silicon M1/M2/M3)

On ARM processors, the compress operations (compress_store_u8x16, compress_store_u32x8, etc.) use NEON-optimized implementations that provide significant speedups over scalar fallbacks:

compress_store_u8x16: Uses NEON TBL instruction with precomputed shuffle indices
- Eliminates 16 conditional branches from the scalar fallback
- Single vqtbl1q_u8 instruction performs all 16-byte shuffle
compress_store_u32x8: Uses NEON TBL2 instruction with byte-level indices
- Precomputed 256×32 byte index table avoids runtime index conversion
- Uses vqtbl2q_u8 for efficient 32-byte permutation
Bitmask expansion: Uses NEON parallel bit operations (vceq, vmovl)
- Converts 8-bit mask to vector mask without scalar loops

These optimizations are automatically enabled on ARM and require no user configuration.

§Fallback Behavior

All kernels fall back to scalar/shuffle implementations when architecture-specific features are not available. The fallback works on all architectures.

§Other Kernels

SimdSingleTableU32U8Lookup: Uses scalar lookups (works on all architectures)
SimdDualTableU32U8Lookup: Uses scalar lookups (works on all architectures)
PipelinedSingleTableU32U8Lookup: Uses scalar lookups with software prefetching

Structs§

PipelinedSingleTableU32U8Lookup: Pipelined single table lookup kernel - u32 to u8 lookup table kernel with prefetch pipelining
SimdCascadingTableU32U8Lookup: SIMD “Cascading” 2nd/3rd Table Lookup Kernel
SimdDualTableU32U8Lookup: Dual lookup table kernel - u32 to u8 lookup table kernel with custom SIMD function for combining the results. Second lookup table is only looked up if the first lookup table returns a non-zero value. The lookup functions are scalar and other than the SIMD combining function, no SIMD is used - and the code is very simple, so this function is a good fit for non-AVX architectures (like Apple M*) where faster SIMD instructions like VCOMPRESS are not available. OTOH, for Intel/AVX512+ architectures, the [CascadingTableU32U8Lookup] kernel is faster.
SimdDualTableU32U8LookupV2: Dual table lookup kernel - u32 to u8 lookup table kernel with custom SIMD function for combining the results. It tries to eliminate thrashing by using internally the single table kernel to write results out first to a local temporary buffer, which is saved, then it looks up the second table, only if the first table returns nonzero results - thus minimizing the number of reads from the second table. By sequencing in this order, we hope to minimize the cache thrashing.
SimdDualTableWithHashLookup: Dual table lookup kernel using FxHashMap for table2.
SimdSingleTableU32U8Lookup: Single table lookup kernel with SIMD function - u32 to u8 lookup table kernel. The user is responsible for generating the lookup table - so this can be used for different use cases, including CASE..WHEN and bitmasking/filtering.