Expand description
Auto-vectorized SIMD operations for columnar data processing.
§Performance Architecture
These functions are structured to enable LLVM auto-vectorization. They achieve
equivalent performance to explicit SIMD (e.g., the wide crate) without the
complexity of platform-specific code.
§Why This Pattern Works
LLVM can auto-vectorize loops when:
- Loop bounds are known or predictable
- Memory access is sequential
- Operations are independent across lanes
The 4-accumulator pattern breaks loop-carried dependencies, allowing LLVM to use SIMD registers effectively:
// BAD: Single accumulator creates dependency chain
for x in data { sum += x; } // Each add waits for previous
// GOOD: Four accumulators enable parallel execution
for chunk in data.chunks(4) {
s0 += chunk[0]; // These four adds can execute
s1 += chunk[1]; // simultaneously in SIMD lanes
s2 += chunk[2];
s3 += chunk[3];
}§Benchmark Results (10M elements, Apple Silicon)
| Operation | wide crate | auto-vectorized | naive iter |
|---|---|---|---|
| sum_f64 | 2.0 ms | 2.0 ms (1.0x) | 7.8 ms |
| min_f64 | 1.5 ms | 1.5 ms (1.0x) | 1.4 ms |
§WARNING
DO NOT “simplify” these functions to use .iter().sum() or similar patterns.
While cleaner-looking, they can be 3-4x slower due to floating-point
associativity constraints preventing vectorization.
If you need to modify these functions, run the SIMD benchmark first:
cargo bench --bench tpch -- Q6Structs§
- Packed
Mask - A packed bitmask where each bit represents whether a row passes a filter.
Functions§
- and_
masks_ inplace - In-place AND of two boolean masks.
- between_
f64 - Check if f64 values are in range [low, high] (inclusive) with epsilon tolerance. More efficient than separate ge + le comparisons + AND. Uses epsilon tolerance to handle f32→f64 conversion precision artifacts.
- between_
f64_ packed - Check if f64 values are in range [low, high] (inclusive), returning packed mask. Uses epsilon tolerance to handle f32→f64 conversion precision artifacts.
- between_
i32 - Check if i32 values are in range [low, high] (inclusive). More efficient than separate ge + le comparisons + AND.
- between_
i64 - Check if i64 values are in range [low, high] (inclusive). More efficient than separate ge + le comparisons + AND.
- between_
i32_ packed - Check if i32 values are in range [low, high] (inclusive), returning packed mask.
- between_
i64_ packed - Check if i64 values are in range [low, high] (inclusive), returning packed mask.
- count_
filtered - Count of true values in filter mask (number of rows passing filter).
- count_
packed_ filtered - Count of set bits in filter mask (number of rows passing filter).
- eq_f64
- Equality comparison for f64 with epsilon tolerance.
- eq_
f64_ packed - Equality comparison for f64 returning packed mask with epsilon tolerance.
- eq_i32
- eq_i64
- eq_
i32_ packed - Equality comparison for i32 returning packed mask.
- eq_
i64_ packed - Equality comparison returning packed mask.
- ge_f64
- Greater-than-or-equal comparison for f64 with epsilon tolerance.
- ge_
f64_ packed - Greater-than-or-equal comparison for f64 returning packed mask with epsilon tolerance.
- ge_i32
- ge_i64
- ge_
i32_ packed - Greater-than-or-equal comparison for i32 returning packed mask.
- ge_
i64_ packed - Greater-than-or-equal comparison returning packed mask.
- gt_f64
- Greater-than comparison for f64 with epsilon tolerance.
- gt_
f64_ packed - Greater-than comparison for f64 returning packed mask with epsilon tolerance.
- gt_i32
- gt_i64
- gt_
i32_ packed - Greater-than comparison for i32 returning packed mask.
- gt_
i64_ packed - Greater-than comparison returning packed mask.
- le_f64
- Less-than-or-equal comparison for f64 with epsilon tolerance.
- le_
f64_ packed - Less-than-or-equal comparison for f64 returning packed mask with epsilon tolerance.
- le_i32
- le_i64
- le_
i32_ packed - Less-than-or-equal comparison for i32 returning packed mask.
- le_
i64_ packed - Less-than-or-equal comparison returning packed mask.
- lt_f64
- Less-than comparison for f64 with epsilon tolerance.
- lt_
f64_ packed - Less-than comparison for f64 returning packed mask with epsilon tolerance.
- lt_i32
- lt_i64
- lt_
i32_ packed - Less-than comparison for i32 returning packed mask.
- lt_
i64_ packed - Less-than comparison returning packed mask.
- max_f64
- Maximum of f64 values using 4-lane parallel reduction.
- max_
f64_ filtered - Max of f64 values where filter_mask[i] == true.
- max_
f64_ packed_ filtered - Max of f64 values where the corresponding bit in filter_mask is set.
- max_i64
- Maximum of i64 values using 4-lane parallel reduction.
- max_
i64_ filtered - Max of i64 values where filter_mask[i] == true.
- max_
i64_ packed_ filtered - Max of i64 values where the corresponding bit in filter_mask is set.
- min_f64
- Minimum of f64 values using 4-lane parallel reduction.
- min_
f64_ filtered - Min of f64 values where filter_mask[i] == true.
- min_
f64_ packed_ filtered - Min of f64 values where the corresponding bit in filter_mask is set.
- min_i64
- Minimum of i64 values using 4-lane parallel reduction.
- min_
i64_ filtered - Min of i64 values where filter_mask[i] == true.
- min_
i64_ packed_ filtered - Min of i64 values where the corresponding bit in filter_mask is set.
- ne_f64
- Inequality comparison for f64 with epsilon tolerance.
- ne_
f64_ packed - Inequality comparison for f64 returning packed mask with epsilon tolerance.
- ne_i32
- ne_i64
- ne_
i32_ packed - Inequality comparison for i32 returning packed mask.
- ne_
i64_ packed - Inequality comparison returning packed mask.
- sum_f64
- Sum of f64 values using 4-accumulator auto-vectorization pattern.
- sum_
f64_ filtered - Sum of f64 values where filter_mask[i] == true, using 4-accumulator pattern.
- sum_
f64_ packed_ filtered - Sum of f64 values where the corresponding bit in filter_mask is set.
- sum_i64
- Sum of i64 values using 4-accumulator auto-vectorization pattern.
- sum_
i64_ filtered - Sum of i64 values where filter_mask[i] == true.
- sum_
i64_ packed_ filtered - Sum of i64 values where the corresponding bit in filter_mask is set.
- sum_
product_ f64 - Sum of element-wise product of two f64 arrays using 4-accumulator pattern.
- sum_
product_ f64_ filtered - Sum of element-wise product of two f64 arrays with filter mask.
- sum_
product_ f64_ filtered_ masked - Sum of element-wise product with filter mask AND null masks.
- sum_
product_ f64_ masked - Sum of element-wise product with null mask using 4-accumulator pattern.
- sum_
product_ f64_ packed_ filtered - Sum of element-wise product of two f64 arrays with packed filter mask.
- sum_
product_ f64_ packed_ filtered_ masked - Sum of element-wise product with packed filter mask AND null masks.