Module simd_ops

Expand description

Auto-vectorized SIMD operations for columnar data processing.

§Performance Architecture

These functions are structured to enable LLVM auto-vectorization. They achieve equivalent performance to explicit SIMD (e.g., the wide crate) without the complexity of platform-specific code.

§Why This Pattern Works

LLVM can auto-vectorize loops when:

Loop bounds are known or predictable
Memory access is sequential
Operations are independent across lanes

The 4-accumulator pattern breaks loop-carried dependencies, allowing LLVM to use SIMD registers effectively:

// BAD: Single accumulator creates dependency chain
for x in data { sum += x; }  // Each add waits for previous

// GOOD: Four accumulators enable parallel execution
for chunk in data.chunks(4) {
    s0 += chunk[0];  // These four adds can execute
    s1 += chunk[1];  // simultaneously in SIMD lanes
    s2 += chunk[2];
    s3 += chunk[3];
}

§Benchmark Results (10M elements, Apple Silicon)

Operation	wide crate	auto-vectorized	naive iter
sum_f64	2.0 ms	2.0 ms (1.0x)	7.8 ms
min_f64	1.5 ms	1.5 ms (1.0x)	1.4 ms

§WARNING

DO NOT “simplify” these functions to use .iter().sum() or similar patterns. While cleaner-looking, they can be 3-4x slower due to floating-point associativity constraints preventing vectorization.

If you need to modify these functions, run the SIMD benchmark first:

cargo bench --bench tpch -- Q6

Structs§

PackedMask: A packed bitmask where each bit represents whether a row passes a filter.

Functions§

and_masks_inplace: In-place AND of two boolean masks.
between_f64: Check if f64 values are in range [low, high] (inclusive) with epsilon tolerance. More efficient than separate ge + le comparisons + AND. Uses epsilon tolerance to handle f32→f64 conversion precision artifacts.
between_f64_packed: Check if f64 values are in range [low, high] (inclusive), returning packed mask. Uses epsilon tolerance to handle f32→f64 conversion precision artifacts.
between_i32: Check if i32 values are in range [low, high] (inclusive). More efficient than separate ge + le comparisons + AND.
between_i64: Check if i64 values are in range [low, high] (inclusive). More efficient than separate ge + le comparisons + AND.
between_i32_packed: Check if i32 values are in range [low, high] (inclusive), returning packed mask.
between_i64_packed: Check if i64 values are in range [low, high] (inclusive), returning packed mask.
count_filtered: Count of true values in filter mask (number of rows passing filter).
count_packed_filtered: Count of set bits in filter mask (number of rows passing filter).
eq_f64: Equality comparison for f64 with epsilon tolerance.
eq_f64_packed: Equality comparison for f64 returning packed mask with epsilon tolerance.
eq_i32
eq_i64
eq_i32_packed: Equality comparison for i32 returning packed mask.
eq_i64_packed: Equality comparison returning packed mask.
ge_f64: Greater-than-or-equal comparison for f64 with epsilon tolerance.
ge_f64_packed: Greater-than-or-equal comparison for f64 returning packed mask with epsilon tolerance.
ge_i32
ge_i64
ge_i32_packed: Greater-than-or-equal comparison for i32 returning packed mask.
ge_i64_packed: Greater-than-or-equal comparison returning packed mask.
gt_f64: Greater-than comparison for f64 with epsilon tolerance.
gt_f64_packed: Greater-than comparison for f64 returning packed mask with epsilon tolerance.
gt_i32
gt_i64
gt_i32_packed: Greater-than comparison for i32 returning packed mask.
gt_i64_packed: Greater-than comparison returning packed mask.
le_f64: Less-than-or-equal comparison for f64 with epsilon tolerance.
le_f64_packed: Less-than-or-equal comparison for f64 returning packed mask with epsilon tolerance.
le_i32
le_i64
le_i32_packed: Less-than-or-equal comparison for i32 returning packed mask.
le_i64_packed: Less-than-or-equal comparison returning packed mask.
lt_f64: Less-than comparison for f64 with epsilon tolerance.
lt_f64_packed: Less-than comparison for f64 returning packed mask with epsilon tolerance.
lt_i32
lt_i64
lt_i32_packed: Less-than comparison for i32 returning packed mask.
lt_i64_packed: Less-than comparison returning packed mask.
max_f64: Maximum of f64 values using 4-lane parallel reduction.
max_f64_filtered: Max of f64 values where filter_mask[i] == true.
max_f64_packed_filtered: Max of f64 values where the corresponding bit in filter_mask is set.
max_i64: Maximum of i64 values using 4-lane parallel reduction.
max_i64_filtered: Max of i64 values where filter_mask[i] == true.
max_i64_packed_filtered: Max of i64 values where the corresponding bit in filter_mask is set.
min_f64: Minimum of f64 values using 4-lane parallel reduction.
min_f64_filtered: Min of f64 values where filter_mask[i] == true.
min_f64_packed_filtered: Min of f64 values where the corresponding bit in filter_mask is set.
min_i64: Minimum of i64 values using 4-lane parallel reduction.
min_i64_filtered: Min of i64 values where filter_mask[i] == true.
min_i64_packed_filtered: Min of i64 values where the corresponding bit in filter_mask is set.
ne_f64: Inequality comparison for f64 with epsilon tolerance.
ne_f64_packed: Inequality comparison for f64 returning packed mask with epsilon tolerance.
ne_i32
ne_i64
ne_i32_packed: Inequality comparison for i32 returning packed mask.
ne_i64_packed: Inequality comparison returning packed mask.
sum_f64: Sum of f64 values using 4-accumulator auto-vectorization pattern.
sum_f64_filtered: Sum of f64 values where filter_mask[i] == true, using 4-accumulator pattern.
sum_f64_packed_filtered: Sum of f64 values where the corresponding bit in filter_mask is set.
sum_i64: Sum of i64 values using 4-accumulator auto-vectorization pattern.
sum_i64_filtered: Sum of i64 values where filter_mask[i] == true.
sum_i64_packed_filtered: Sum of i64 values where the corresponding bit in filter_mask is set.
sum_product_f64: Sum of element-wise product of two f64 arrays using 4-accumulator pattern.
sum_product_f64_filtered: Sum of element-wise product of two f64 arrays with filter mask.
sum_product_f64_filtered_masked: Sum of element-wise product with filter mask AND null masks.
sum_product_f64_masked: Sum of element-wise product with null mask using 4-accumulator pattern.
sum_product_f64_packed_filtered: Sum of element-wise product of two f64 arrays with packed filter mask.
sum_product_f64_packed_filtered_masked: Sum of element-wise product with packed filter mask AND null masks.

Module simd_ops

Module simd_ops Copy item path

§Performance Architecture

§Why This Pattern Works

§Benchmark Results (10M elements, Apple Silicon)

§WARNING

Structs§

Functions§

Module simd_ops