Expand description
Arrow-style “lookup kernel” similar to arrow-select::take::take kernel. There are “columnar style” which does table lookups for one table first, and then cascading to another table, and the Cascading kernel uses SIMD extensively. Other kernels just do scalar lookups which are often as fast as SIMD GATHER, but all allow SIMD functions to operate on looked up values.
Anecdotally, this is not really faster than SIMD gather.
—————– SIMD based lookup kernels - “Arrow” style ———————————— These operate by leveraging constant-sized array reads from a slice with SIMD operations on looked up values, especially fast for multiple table lookups and combinations thereof. It turns out these don’t significantly improve on a series of scalar lookups.
§CPU Feature Requirements
§Intel x86_64 (AVX-512)
§Cascading Lookup Kernel (SimdCascadingTableU32U8Lookup)
The cascading kernel is heavily optimized for AVX-512 and provides significant performance improvements (40-50% speedup) over scalar implementations on large tables.
Required CPU features:
- AVX512F + AVX512VL + AVX512VBMI2 (for
compress_store_u8x16) - AVX512F (for
compress_store_u32x16) - AVX512F + AVX512BW (for
gather_u32index_u8viasimd_gathermodule)
Summary: Requires AVX512F, AVX512VL, AVX512BW, and AVX512VBMI2
Available on: Intel Ice Lake, Tiger Lake, and later (not available on Skylake-X)
§ARM aarch64 (Apple Silicon M1/M2/M3)
On ARM processors, the compress operations (compress_store_u8x16, compress_store_u32x8, etc.)
use NEON-optimized implementations that provide significant speedups over scalar fallbacks:
-
compress_store_u8x16: Uses NEONTBLinstruction with precomputed shuffle indices- Eliminates 16 conditional branches from the scalar fallback
- Single
vqtbl1q_u8instruction performs all 16-byte shuffle
-
compress_store_u32x8: Uses NEONTBL2instruction with byte-level indices- Precomputed 256×32 byte index table avoids runtime index conversion
- Uses
vqtbl2q_u8for efficient 32-byte permutation
-
Bitmask expansion: Uses NEON parallel bit operations (
vceq,vmovl)- Converts 8-bit mask to vector mask without scalar loops
These optimizations are automatically enabled on ARM and require no user configuration.
§Fallback Behavior
All kernels fall back to scalar/shuffle implementations when architecture-specific features are not available. The fallback works on all architectures.
§Other Kernels
SimdSingleTableU32U8Lookup: Uses scalar lookups (works on all architectures)SimdDualTableU32U8Lookup: Uses scalar lookups (works on all architectures)PipelinedSingleTableU32U8Lookup: Uses scalar lookups with software prefetching
Structs§
- Pipelined
Single Table U32U8 Lookup - Pipelined single table lookup kernel - u32 to u8 lookup table kernel with prefetch pipelining
- Simd
Cascading Table U32U8 Lookup - SIMD “Cascading” 2nd/3rd Table Lookup Kernel
- Simd
Dual Table U32U8 Lookup - Dual lookup table kernel - u32 to u8 lookup table kernel with custom SIMD function for combining the results. Second lookup table is only looked up if the first lookup table returns a non-zero value. The lookup functions are scalar and other than the SIMD combining function, no SIMD is used - and the code is very simple, so this function is a good fit for non-AVX architectures (like Apple M*) where faster SIMD instructions like VCOMPRESS are not available. OTOH, for Intel/AVX512+ architectures, the [CascadingTableU32U8Lookup] kernel is faster.
- Simd
Dual Table U32U8 Lookup V2 - Dual table lookup kernel - u32 to u8 lookup table kernel with custom SIMD function for combining the results. It tries to eliminate thrashing by using internally the single table kernel to write results out first to a local temporary buffer, which is saved, then it looks up the second table, only if the first table returns nonzero results - thus minimizing the number of reads from the second table. By sequencing in this order, we hope to minimize the cache thrashing.
- Simd
Dual Table With Hash Lookup - Dual table lookup kernel using FxHashMap for table2.
- Simd
Single Table U32U8 Lookup - Single table lookup kernel with SIMD function - u32 to u8 lookup table kernel. The user is responsible for generating the lookup table - so this can be used for different use cases, including CASE..WHEN and bitmasking/filtering.