Crate iterator_ilp

source ·
Expand description

This crate implements an Iterator extension that provides instruction-parallel reductions


Have you ever wondered why Iterator::sum() performs so poorly on floating-point data?

On my machine, a benchmark of summing the elements of a Vec<f32> that fits in the CPU’s L1d cache (which is a precondition for maximal computation performance) sums about a billion numbers per second. This may seem reasonable until you realize that modern CPUs have multi-GHz clock rates, can process multiple instructions per clock cycle, and can sum multiple floating-point numbers per instruction. Then you come to the realization that the orders of magnitude aren’t right.

The problem lies not in the implementation of Iterator::sum(), but in its definition. This code…

let sum = floats.iter().sum::<f32>();

…correctly compiles down to the same assembly as that loop…

let mut sum = 0.0;
for &float in &floats {
    sum += float;

…but that loop itself isn’t right for modern hardware, because it does not expose enough instruction-level parallelism (ILP).

To give some context, the Rust compiler does not allow itself to reorder floating-point operations with respect to what the user wrote. This is a good thing in general because floating-point arithmetic is not associative, which means such optimizations would make program output nondeterministic (it depends on what compiler optimizations were applied) and could break the numerical stability of some trickier algorithms.

But in the case of the loop above, it also means that whatever optimizations are applied, the final code must only use one accumulator, and sum the first number into that accumulator, then the second number, then the third one…

Doing so turns our whole program into a gigantic dependency chain of scalar floating-point operations, with no opportunities for compilers or hardware to extract parallelism. Without parallelism, hardware capabilities for small-scale parallelism go to waste, and execution speed becomes limited by the CPU’s addition latency (how much time it takes to compute one addition) rather than its addition throughput (how many unrelated additions it can compute per second).

This problem is not unique to Iterator::sum(), or even to floating-point data. All iterator methods that perform data reduction (take an iterator of elements and produce a single result) are affected by this problem to some degree. All it takes is an operation whose semantics depend on the number of observed iterator items or the order in which operations are performed, and the compiler will generate bad code.

And since most of these Iterator methods are documented to perform data reduction in a specific way, this problem cannot be solved by improving the standard library’s Iterator implementation, at least not without breaking the API of Iterator, which would be Very Bad (tm) and is thus highly unlikely to happen.

§What this crate provides

IteratorILP is an Iterator extension trait that can be implemented for all iterators of known length, and provides variants of the standard reduction methods with a STREAMS const generic parameter. By tuning up this parameter, you divide the iterator reduction work across more and more instruction streams, exposing more and more instruction-level parallelism for the compiler and hardware to take advantage of.

This is effectively loop unrolling, but instead of making your code unreadable by manually implementing the operation yourself, you let IteratorILP do it for you by just providing it with the tuning parameter it needs.

So to reuse the above example…

use iterator_ilp::IteratorILP;
let sum = floats.iter().sum_ilp::<16, f32>();

…would sum a slice of floating-point numbers using 16 independent instruction streams, achieving a more respectable throughput of 17 billion floating-point sums per second on an Intel i9-10900 CPU running at 2.5 GHz. This corresponds to 6.8 additions per CPU cycle, which is reasonable when considering that the hardware can do 16 additions per second on correctly aligned SIMD data, but here the data is not correctly aligned and hence reaching about half the hardware throughput is expected.

§How many instruction streams do I need?

The number of independent instruction streams you need to reach peak hardware performance depends on many factors:

  • What hardware you are running on
  • What hardware features (e.g. SIMD instruction set extensions) are used by the compiled program
  • What type of data you are manipulating
  • How complex your input iterator and the reduction operation are

Further, you cannot just tune the number of streams up indefinitely because managing more instruction streams requires more hardware resources (CPU registers, instruction cache…), and at some point you will run out of these scarce resources and your runtime performance will drop. Even before that, some internal compiler optimizer code complexity threshold may be reached at which point the compiler will decide to stop optimizing the code of individual instruction streams as much at it would optimize that of a normal reduction, which will reduce performance as well.

To give orders of magnitude, simple reductions, like the floating point sum discussed above, can benefit from being spread over 16 instruction streams or more on some hardware (e.g. x86 CPUs with AVX enabled), while complex reductions (e.g. those using manually vectorized data) may not benefit from more than 2 streams, or could even already exhibit decreased performance in that configuration.

Since the rules for determining the right number of streams are so complex, and involve information unknown to this crate, we leave it up to you to tune this parameter. An empirical approach is advised: take a workload whose performance you care about, benchmark it with various numbers of streams on a broad range of target configurations, and find out the right compromise for you (which may be a hardware-dependent compromise selected via #[cfg()] attributes and/or runtime detection if needed).