Struct SimdCascadingTableU32U8Lookup

Source

pub struct SimdCascadingTableU32U8Lookup<'a> { /* private fields */ }

Expand description

SIMD “Cascading” 2nd/3rd Table Lookup Kernel

This kernel is designed to “cascade” and build on top of the primary SingleTable kernel to efficiently look up secondary or additional tables. How does this work?

First call SimdSingleTableU32U8Lookup to look up the primary table, using the lookup_compress_into_nonzeroes() method. This returns compressed results and indices of the nonzero results.
Now feed these Vecs into this kernel, which uses compressed output to do a packed lookup into the second table. This is faster than having to filter all the results from the first kernel.
The lookup function is called for nonzero table1 results and looked up second table lookups, and should return results for all 16 values in the u8x16.
Then, this kernel will COMPRESS the results and again output nonzero results and indices, filtered from the input.

Basically, this kernel can be cascaded for additional tables.

The theory is that this cascading and packed lookup approach allows us to come closest to kernels where even with multiple tables, the runtime is roughly O(num_nonzero_lookups). UPDATE 12/2/2025: Intel Xeon results show that, even at huge (15M) tables, this results in a 40% speedup over the V2 kernel. The speedups increase for smaller table sizes - 4M shows over 50% increase, and even bigger for smaller tables - which shows that this design inherently scales well.

Implementations§

Source §

impl<'a> SimdCascadingTableU32U8Lookup<'a>

Source

pub fn new(lookup_table: &'a [u8]) -> Self

Source

pub fn cascading_lookup<F>( &self, values: &[u32], in_nonzero_results: &[u8], in_indices: &[u32], f: F, out_results: &mut Vec<u8>, out_indices: &mut Vec<u32>, )
where F: Fn(u8x16, u8x16) -> u8x16,

Given a slice of u32 values, looks up each one. Designed to work in cascading mode. One needs to pass in the nonzero_results and indices output from SimdSingleTableU32U8Lookup::lookup_compress_into_nonzeroes(), along with the values (which are the keys for the lookup table in this struct).

For this to be efficient, the length of values probably should be at least hundreds or thousands of values.

§Arguments

values - &u32 of indices to lookup. NOTE: these are ORIGINAL values, NOT filtered, thus its length should be the same length as the values fed into SimdSingleTableU32U8Lookup kernel. In other words, the length of values will probably be larger than in_nonzero_results.
in_nonzero_results - &u8 of nonzero results from SimdSingleTableU32U8Lookup::lookup_compress_into_nonzeroes()
in_indices - &u32 of indices from SimdSingleTableU32U8Lookup::lookup_compress_into_nonzeroes() These indices should be indices into the values array.
f - function to mix the results from nonzero_results and the looked up values from this lookup table. The results (u8x16) returned from this function, will be zero-compressed along with indices to generate more nonzero output.
out_results - &mut Vec to store the nonzero results from the lookup function f
out_indices - &mut Vec, basically same as input indices but with nonzeroes compressed out