Crate iterator_ilp
source ·Expand description
This crate implements an Iterator
extension that provides
instruction-parallel reductions
§Motivation
Have you ever wondered why Iterator::sum()
performs so poorly on
floating-point data?
On my machine, a benchmark of summing the elements of a Vec<f32>
that fits
in the CPU’s L1d cache (which is a precondition for maximal computation
performance) sums about a billion numbers per second. This may seem
reasonable until you realize that modern CPUs have multi-GHz clock rates,
can process multiple instructions per clock
cycle, and can sum
multiple floating-point numbers per
instruction.
Then you come to the realization that the orders of magnitude aren’t right.
The problem lies not in the implementation of Iterator::sum()
, but in its
definition. This code…
let sum = floats.iter().sum::<f32>();
…correctly compiles down to the same assembly as that loop…
let mut sum = 0.0;
for &float in &floats {
sum += float;
}
…but that loop itself isn’t right for modern hardware, because it does not expose enough instruction-level parallelism (ILP).
To give some context, the Rust compiler does not allow itself to reorder floating-point operations with respect to what the user wrote. This is a good thing in general because floating-point arithmetic is not associative, which means such optimizations would make program output nondeterministic (it depends on what compiler optimizations were applied) and could break the numerical stability of some trickier algorithms.
But in the case of the loop above, it also means that whatever optimizations are applied, the final code must only use one accumulator, and sum the first number into that accumulator, then the second number, then the third one…
Doing so turns our whole program into a gigantic dependency chain of scalar floating-point operations, with no opportunities for compilers or hardware to extract parallelism. Without parallelism, hardware capabilities for small-scale parallelism go to waste, and execution speed becomes limited by the CPU’s addition latency (how much time it takes to compute one addition) rather than its addition throughput (how many unrelated additions it can compute per second).
This problem is not unique to Iterator::sum()
, or even to floating-point
data. All iterator methods that perform data reduction (take an iterator of
elements and produce a single result) are affected by this problem to some
degree. All it takes is an operation whose semantics depend on the number
of observed iterator items or the order in which operations are performed,
and the compiler will generate bad code.
And since most of these Iterator methods are documented to perform data
reduction in a specific way, this problem cannot be solved by improving
the standard library’s Iterator
implementation, at least not without
breaking the API of Iterator
, which would be Very Bad (tm) and is thus
highly unlikely to happen.
§What this crate provides
IteratorILP
is an Iterator extension trait that can be implemented for
all iterators of known length, and provides variants of the standard
reduction methods with a STREAMS
const generic parameter. By tuning up
this parameter, you divide the iterator reduction work across more and more
instruction streams, exposing more and more instruction-level parallelism
for the compiler and hardware to take advantage of.
This is effectively
loop unrolling, but instead
of making your code unreadable by manually implementing the operation
yourself, you let IteratorILP
do it for you by just providing it with the
tuning parameter it needs.
So to reuse the above example…
use iterator_ilp::IteratorILP;
let sum = floats.iter().sum_ilp::<16, f32>();
…would sum a slice of floating-point numbers using 16 independent instruction streams, achieving a more respectable throughput of 17 billion floating-point sums per second on an Intel i9-10900 CPU running at 2.5 GHz. This corresponds to 6.8 additions per CPU cycle, which is reasonable when considering that the hardware can do 16 additions per second on correctly aligned SIMD data, but here the data is not correctly aligned and hence reaching about half the hardware throughput is expected.
§How many instruction streams do I need?
The number of independent instruction streams you need to reach peak hardware performance depends on many factors:
- What hardware you are running on
- What hardware features (e.g. SIMD instruction set extensions) are used by the compiled program
- What type of data you are manipulating
- How complex your input iterator and the reduction operation are
Further, you cannot just tune the number of streams up indefinitely because managing more instruction streams requires more hardware resources (CPU registers, instruction cache…), and at some point you will run out of these scarce resources and your runtime performance will drop. Even before that, some internal compiler optimizer code complexity threshold may be reached at which point the compiler will decide to stop optimizing the code of individual instruction streams as much at it would optimize that of a normal reduction, which will reduce performance as well.
To give orders of magnitude, simple reductions, like the floating point sum discussed above, can benefit from being spread over 16 instruction streams or more on some hardware (e.g. x86 CPUs with AVX enabled), while complex reductions (e.g. those using manually vectorized data) may not benefit from more than 2 streams, or could even already exhibit decreased performance in that configuration.
Since the rules for determining the right number of streams are so complex,
and involve information unknown to this crate, we leave it up to you to
tune this parameter. An empirical approach is advised: take a workload
whose performance you care about, benchmark it with various numbers of
streams on a broad range of target configurations, and find out the right
compromise for you (which may be a hardware-dependent compromise selected
via #[cfg()]
attributes and/or runtime detection if needed).
Structs§
- Manual implementation of
TrustedLowerBound
for an iterator
Traits§
- Iterator extension that provides instruction-parallel reductions
- An iterator that reports an accurate lower bound using
size_hint()