Quicksim
Quick-sim(d) provides several SIMD-accelerated, drop-in replacements for common algorithms.
It's designed to make speeding up trivial parts of your code as easy as possible - no manual SIMD programming required.
AVX features are automatically detected at runtime and enabled whenever possible.
Quick-sim(d) is not intended as a replacement for Portable SIMD. Instead, it aims to completely hide vectorization logic from the API, keeping usage simple and ergonomic.
The project is currently in alpha state, and does not yet provide vectorized implementations for all vectorizable functions in the Rust standard library.
The implementations in this crate generally achieve significant speedups compared to naive versions. However, there are some trade-offs. For more information, see the Limitations section.
Note: Iterators are usually fast enough for most cases. Use this crate carefully, and make sure to identify critical hot-paths before replacing all of your implementations with the accelerated functions provided by this crate.
Add to your project
quicksim = "0.1"
Example
use quicksim::prelude::*;
fn main() {
let array: Vec<u32> = vec![0, 1, 2, 3, 4];
assert_eq!(array.max_simd(), array.iter().max().copied());
assert_eq!(array.min_simd(), array.iter().min().copied());
assert_eq!(array.find_simd(4), Some(4));
assert!(array.contains_simd(4));
}
Limitations
The SIMD implementation becomes effective for arrays with more than 32 items. This means that if your array length is below 32 more than 50% of the time, using the *_simd()
functions of this crate will generally be slower on average.
Additionally, if no optimized implementation exists for a specific architecture, the SIMD version will always be slower than the regular implementation due to the overhead of checking for SIMD support and falling back to the default version.
Attribution
Many algorithms implemented here are based on those from Algorithmica, which helped me to get deeper into SIMD and implement some of the algorithms in this crate.
Check this book out if you're interested in SIMD programming yourself, or if you want to learn more about computers, algorithms and low-level optimizations.
Benchmarks
The SIMD implementation becomes effective for arrays with more than 32 items. For smaller arrays (32 items or fewer), the SIMD versions are actually slower than the standard implementations.
Please note that the benchmark results may contain some noise, so they shouldn't be treated as 100% accurate.
All measurements were taken on a Ryzen 5 5600x CPU with 32 GB of RAM.
Some benchmarks include two cases: -avg
and -worst
. These represent the average-case and worst-case performance of the algorithm.
For example, the find algorithm’s average case assumes the element is located at index size/2
, while its worst case assumes the element is the last item in the collection.
&[u32]
🔍 Find - avg
size |
simd (ns) |
iter (ns) |
speedup |
16 |
2.8042 |
2.367 |
0.844091 |
32 |
2.1073 |
4.7695 |
2.26332 |
64 |
2.5788 |
8.4581 |
3.27986 |
150 |
3.5278 |
22.757 |
6.45076 |
530 |
9.2504 |
67.652 |
7.31341 |
1028 |
16.543 |
126.14 |
7.62498 |
5010 |
85.771 |
595.31 |
6.94069 |
8000 |
132.41 |
948.74 |
7.16517 |
🔍 Find - worst
size |
simd (ns) |
iter (ns) |
speedup |
16 |
5.1783 |
4.1654 |
0.804395 |
32 |
2.3703 |
8.2243 |
3.46973 |
64 |
3.0473 |
15.68 |
5.14554 |
150 |
11.059 |
40.107 |
3.62664 |
530 |
20.414 |
129.42 |
6.33977 |
1028 |
36.596 |
245.98 |
6.7215 |
5010 |
153.56 |
1178.3 |
7.67322 |
8000 |
243.77 |
1880.8 |
7.71547 |
Count
size |
simd (ns) |
iter (ns) |
speedup |
16 |
2.3437 |
2.6018 |
1.11013 |
32 |
2.8135 |
4.6383 |
1.64859 |
64 |
3.0854 |
8.9749 |
2.90883 |
150 |
7.3019 |
20.38 |
2.79105 |
530 |
12.712 |
73.599 |
5.78973 |
1028 |
18.706 |
137.6 |
7.35593 |
5010 |
83.04 |
658.32 |
7.92775 |
8000 |
126.9 |
1061.6 |
8.36564 |
Min
size |
simd (ns) |
iter (ns) |
speedup |
16 |
6.8805 |
4.4347 |
0.644532 |
32 |
2.5866 |
7.578 |
2.92971 |
64 |
3.2832 |
15.214 |
4.63389 |
150 |
10.318 |
35.093 |
3.40114 |
530 |
13.826 |
116.56 |
8.43049 |
1028 |
18.393 |
270.9 |
14.7284 |
5010 |
80.394 |
1195.6 |
14.8718 |
8000 |
126.19 |
1885.2 |
14.9394 |
Max
size |
simd (ns) |
iter (ns) |
speedup |
16 |
5.3713 |
4.9116 |
0.914416 |
32 |
2.5786 |
9.9717 |
3.8671 |
64 |
3.0447 |
24.564 |
8.06779 |
150 |
12.016 |
65.061 |
5.41453 |
530 |
14.317 |
244.71 |
17.0923 |
1028 |
18.343 |
479.27 |
26.1282 |
5010 |
80.861 |
2337 |
28.9014 |
8000 |
126.19 |
3734 |
29.5903 |
Contains - avg
size |
simd (ns) |
iter (ns) |
speedup |
16 |
2.5754 |
2.5422 |
0.987109 |
32 |
1.8953 |
4.401 |
2.32206 |
64 |
2.3444 |
8.6579 |
3.69301 |
150 |
3.4732 |
18.912 |
5.44512 |
530 |
8.1109 |
68.4 |
8.4331 |
1028 |
17.607 |
127.71 |
7.25337 |
5010 |
88.245 |
590.65 |
6.6933 |
8000 |
129.19 |
941.28 |
7.28601 |
Contains - worst
size |
simd (ns) |
iter (ns) |
speedup |
16 |
4.8998 |
4.9446 |
1.00914 |
32 |
2.1081 |
8.6596 |
4.10777 |
64 |
2.5934 |
20.38 |
7.85841 |
150 |
12.114 |
40.873 |
3.37403 |
530 |
21.691 |
130.17 |
6.00111 |
1028 |
31.116 |
249.47 |
8.01742 |
5010 |
154.94 |
1192.1 |
7.69395 |
8000 |
245.39 |
1874.9 |
7.64049 |
&[f32]
Min
size |
simd (ns) |
iter (ns) |
speedup |
16 |
3.46 |
3.2625 |
0.942919 |
32 |
2.3321 |
5.1986 |
2.22915 |
64 |
2.8477 |
10.93 |
3.83819 |
150 |
6.8676 |
27.82 |
4.05091 |
530 |
14.967 |
118.27 |
7.90205 |
1028 |
21.059 |
234.49 |
11.1349 |
5010 |
79.785 |
1177.4 |
14.7572 |
8000 |
128.59 |
1859.9 |
14.4638 |
Max
size |
simd (ns) |
iter (ns) |
speedup |
16 |
3.6937 |
3.2061 |
0.867991 |
32 |
2.3294 |
5.0832 |
2.18219 |
64 |
3.299 |
10.929 |
3.31282 |
150 |
6.7782 |
28.02 |
4.13384 |
530 |
14.808 |
118.77 |
8.02066 |
1028 |
25.172 |
234.02 |
9.29684 |
5010 |
120.41 |
1157.3 |
9.61133 |
8000 |
190.97 |
1861.4 |
9.74708 |