Module simd_ops

Module simd_ops 

Source
Expand description

Auto-vectorized SIMD operations for columnar data processing.

§Performance Architecture

These functions are structured to enable LLVM auto-vectorization. They achieve equivalent performance to explicit SIMD (e.g., the wide crate) without the complexity of platform-specific code.

§Why This Pattern Works

LLVM can auto-vectorize loops when:

  1. Loop bounds are known or predictable
  2. Memory access is sequential
  3. Operations are independent across lanes

The 4-accumulator pattern breaks loop-carried dependencies, allowing LLVM to use SIMD registers effectively:

// BAD: Single accumulator creates dependency chain
for x in data { sum += x; }  // Each add waits for previous

// GOOD: Four accumulators enable parallel execution
for chunk in data.chunks(4) {
    s0 += chunk[0];  // These four adds can execute
    s1 += chunk[1];  // simultaneously in SIMD lanes
    s2 += chunk[2];
    s3 += chunk[3];
}

§Benchmark Results (10M elements, Apple Silicon)

Operationwide crateauto-vectorizednaive iter
sum_f642.0 ms2.0 ms (1.0x)7.8 ms
min_f641.5 ms1.5 ms (1.0x)1.4 ms

§WARNING

DO NOT “simplify” these functions to use .iter().sum() or similar patterns. While cleaner-looking, they can be 3-4x slower due to floating-point associativity constraints preventing vectorization.

If you need to modify these functions, run the SIMD benchmark first:

cargo bench --bench tpch -- Q6

Structs§

PackedMask
A packed bitmask where each bit represents whether a row passes a filter.

Functions§

and_masks_inplace
In-place AND of two boolean masks.
between_f64
Check if f64 values are in range [low, high] (inclusive) with epsilon tolerance. More efficient than separate ge + le comparisons + AND. Uses epsilon tolerance to handle f32→f64 conversion precision artifacts.
between_f64_packed
Check if f64 values are in range [low, high] (inclusive), returning packed mask. Uses epsilon tolerance to handle f32→f64 conversion precision artifacts.
between_i32
Check if i32 values are in range [low, high] (inclusive). More efficient than separate ge + le comparisons + AND.
between_i64
Check if i64 values are in range [low, high] (inclusive). More efficient than separate ge + le comparisons + AND.
between_i32_packed
Check if i32 values are in range [low, high] (inclusive), returning packed mask.
between_i64_packed
Check if i64 values are in range [low, high] (inclusive), returning packed mask.
count_filtered
Count of true values in filter mask (number of rows passing filter).
count_packed_filtered
Count of set bits in filter mask (number of rows passing filter).
eq_f64
Equality comparison for f64 with epsilon tolerance.
eq_f64_packed
Equality comparison for f64 returning packed mask with epsilon tolerance.
eq_i32
eq_i64
eq_i32_packed
Equality comparison for i32 returning packed mask.
eq_i64_packed
Equality comparison returning packed mask.
ge_f64
Greater-than-or-equal comparison for f64 with epsilon tolerance.
ge_f64_packed
Greater-than-or-equal comparison for f64 returning packed mask with epsilon tolerance.
ge_i32
ge_i64
ge_i32_packed
Greater-than-or-equal comparison for i32 returning packed mask.
ge_i64_packed
Greater-than-or-equal comparison returning packed mask.
gt_f64
Greater-than comparison for f64 with epsilon tolerance.
gt_f64_packed
Greater-than comparison for f64 returning packed mask with epsilon tolerance.
gt_i32
gt_i64
gt_i32_packed
Greater-than comparison for i32 returning packed mask.
gt_i64_packed
Greater-than comparison returning packed mask.
le_f64
Less-than-or-equal comparison for f64 with epsilon tolerance.
le_f64_packed
Less-than-or-equal comparison for f64 returning packed mask with epsilon tolerance.
le_i32
le_i64
le_i32_packed
Less-than-or-equal comparison for i32 returning packed mask.
le_i64_packed
Less-than-or-equal comparison returning packed mask.
lt_f64
Less-than comparison for f64 with epsilon tolerance.
lt_f64_packed
Less-than comparison for f64 returning packed mask with epsilon tolerance.
lt_i32
lt_i64
lt_i32_packed
Less-than comparison for i32 returning packed mask.
lt_i64_packed
Less-than comparison returning packed mask.
max_f64
Maximum of f64 values using 4-lane parallel reduction.
max_f64_filtered
Max of f64 values where filter_mask[i] == true.
max_f64_packed_filtered
Max of f64 values where the corresponding bit in filter_mask is set.
max_i64
Maximum of i64 values using 4-lane parallel reduction.
max_i64_filtered
Max of i64 values where filter_mask[i] == true.
max_i64_packed_filtered
Max of i64 values where the corresponding bit in filter_mask is set.
min_f64
Minimum of f64 values using 4-lane parallel reduction.
min_f64_filtered
Min of f64 values where filter_mask[i] == true.
min_f64_packed_filtered
Min of f64 values where the corresponding bit in filter_mask is set.
min_i64
Minimum of i64 values using 4-lane parallel reduction.
min_i64_filtered
Min of i64 values where filter_mask[i] == true.
min_i64_packed_filtered
Min of i64 values where the corresponding bit in filter_mask is set.
ne_f64
Inequality comparison for f64 with epsilon tolerance.
ne_f64_packed
Inequality comparison for f64 returning packed mask with epsilon tolerance.
ne_i32
ne_i64
ne_i32_packed
Inequality comparison for i32 returning packed mask.
ne_i64_packed
Inequality comparison returning packed mask.
sum_f64
Sum of f64 values using 4-accumulator auto-vectorization pattern.
sum_f64_filtered
Sum of f64 values where filter_mask[i] == true, using 4-accumulator pattern.
sum_f64_packed_filtered
Sum of f64 values where the corresponding bit in filter_mask is set.
sum_i64
Sum of i64 values using 4-accumulator auto-vectorization pattern.
sum_i64_filtered
Sum of i64 values where filter_mask[i] == true.
sum_i64_packed_filtered
Sum of i64 values where the corresponding bit in filter_mask is set.
sum_product_f64
Sum of element-wise product of two f64 arrays using 4-accumulator pattern.
sum_product_f64_filtered
Sum of element-wise product of two f64 arrays with filter mask.
sum_product_f64_filtered_masked
Sum of element-wise product with filter mask AND null masks.
sum_product_f64_masked
Sum of element-wise product with null mask using 4-accumulator pattern.
sum_product_f64_packed_filtered
Sum of element-wise product of two f64 arrays with packed filter mask.
sum_product_f64_packed_filtered_masked
Sum of element-wise product with packed filter mask AND null masks.