Crate orx_parallel

Expand description

§orx-parallel

High performance, configurable and expressive parallel computation library.

Parallel Computation by Iterators
Parallelizable Collections
Parallelization over Nonlinear Data Structures
Performance and Benchmarks
Fallible Parallel Iterators
Using Mutable Variables
Configurations
Runner: Pools and Executors
Contributing

§Parallel Computation by Iterators

Parallel computation is defined using the parallel iterator trait ParIter.

The goal is to convert an expressive sequential program into an efficient parallel program only by replacing iter with par; and into_iter with into_par.

The following is a naive traveling salesperson algorithm which randomly generates sequences and picks the one with the minimum duration as the best tour. The example demonstrates chaining of very common and useful map, filter and reduce (min_by_key) operations. Notice that the only difference between the sequential and parallel programs is the par() call.

use orx_parallel::*;
use rand::prelude::*;

struct Tour(Vec<usize>);

impl Tour {
    fn random(n: usize) -> Self {
        let mut cities: Vec<_> = (0..n).collect();
        cities.shuffle(&mut rand::rng());
        Self(cities)
    }

    fn not_in_standard_order(&self) -> bool {
        self.0.iter().enumerate().any(|(i, c)| i != *c)
    }

    fn duration(&self) -> usize {
        let mut total = 0;
        let links = self.0.iter().zip(self.0.iter().skip(1));
        for (a, b) in links {
            total += (*a as i64 - *b as i64).abs() as usize;
        }
        total
    }
}

let num_tours = 1_000_000;
let num_cities = 10;

// sequential
let best_tour = (0..num_tours)
    .map(|_| Tour::random(num_cities))
    .filter(|t| t.not_in_standard_order())
    .min_by_key(|t| t.duration())
    .unwrap();

// parallel
let best_tour = (0..num_tours)
    .par() // parallelized !!
    .map(|_| Tour::random(num_cities))
    .filter(|t| t.not_in_standard_order())
    .min_by_key(|t| t.duration())
    .unwrap();

§Parallelizable Collections

Inputs that can be used in parallel computations can be categorized in three groups:

i. directly parallelizable collections
ii. parallelization of any iterator
iii. parallelization of any collection

§i. Directly Parallelizable Collections

These are collections which are parallelized by utilizing their specific structure to achieve high performance.

This crate provides direct implementations of std collections; the table below lists the most recent table of direct implementations.

Type	Over References `&T`	Over Mut References `&mut T>`	Over Owned Values `T`
`v: Vec<T>`	`v.par()`	`v.par_mut()`	`v.into_par()`
`v: VecDeque<T>`	`v.par()`		`v.into_par()`
`s: &[T]`	`s.par()` `s.into_par()`
`s: &mut [T]`		`s.into_par()`
`r: Range<usize>`			`r.par()` `r.into_par()`

Implementations of custom collections belong to their respective crates as they most likely require access to internals. Currently, the following collections are known to allow parallel computation using this crate:

│ SplitVec │ FixedVec │ LinkedList │ Tree │ ImpVec │

Since these implementations are particularly optimized for the collection type, it is preferable to start defining parallel computation from the collection whenever available. In other words, for a direclty parallelizable collection col,

col.par().map(_).filter(_).reduce(_) is a better approach than
col.iter().iter_into_par().map(_).filter(_).reduce(_), which will be explained in the next subsection.

extensibility: Note that any input collection or generator that implements IntoConcurrentIter automatically implements IntoParIter. Therefore, a new collection can be parallelized provided that its concurrent iterator is implemented.

In addition, there exist the following special parallel iterators that can be directly created from the collection.

Type	Method	Definition
`v: Vec<T>`	`v.par_drain(range)`	Parallel counterpart of `v.drain(range)`

§ii. Parallelization of Any Iterator

Any arbitrary sequential Iterator implements IterIntoParIter trait and can be converted into a parallel iterator using the iter_into_par method.

As demonstrated below, item type of the Iterator can as well be a mutable reference.

use orx_parallel::*;
use std::collections::HashMap;

let mut map: HashMap<_, _> = (0..1024).map(|x| (x.to_string(), x)).collect();

let par = map.values_mut().iter_into_par(); // mutable parallel iterator from Iterator
par.filter(|x| **x != 42).for_each(|x| *x *= 0);

assert_eq!(map.values().iter_into_par().sum(), 42); // parallel iterator from Iterator

This is powerful since it allows to parallelize all iterables, including pretty much every collection and more.

On the other hand, due to being a generic implementation without collection specific optimizations, parallelized computation might underperform its sequential counterpart if the work to be done on each input element is insignificant. For instance, i being an arbitrary iterator of numbers, i.sum() will most likely be faster than i.iter_into_par().sum().

This being said, ParIter takes advantage of certain optimizations, such as buffering and chunk size optimization, in order to improve performance. Therefore, whenever the computation on the iterator elements is more involved than just returning them or adding numbers, we can benefit from parallelization. The respective section of benchmarks present significant improvements achieved consistently.

§iii. Parallelization of Any Collection

Lastly, consider a collection which does not provide a direct concurrent iterator implementation. This might be our custom collection, say MyCollection; or an external collection without a concurrent iterator implementation, such as the HashSet<T>.

There are two methods to parallelize computations over such collections:

(ii) parallelize using the collection’s iterator, or
(i) collect the elements in a vector and then parallelize work over the vector.

The following table demonstrates these methods for the HashSet; however, they are applicable to any collection with iter and into_iter methods.

Type	Method	Over References `&T`	Over Owned Values `T`
`h: HashSet<T>`	ii	`h.iter()` `.iter_into_par()`	`h.into_iter()` `.iter_into_par()`
	i	`h.iter()` `.collect::<Vec<_>>()` `.par()`	`h.into_iter()` `.collect::<Vec<_>>()` `.into_par()`

Note that each approach can be more efficient in different scenarios. For large elements, (ii) might be preferred to avoid allocation of the vector. For insignificant tasks to be performed on each element, (i) might be preferred to take full benefit of vector-specific optimizations.

§Parallelization over Nonlinear Data Structures

IntoParIterRec trait can be used to create a parallel recursive iterator over an initial set of elements which is useful when working with non-linear data structures such as trees and graphs.

Consider, for instance, a tree which is defined by the following node struct:

pub struct Node<T> {
    pub data: T,
    pub children: Vec<Node<T>>,
}

Assume that we want to map all the data with map: impl Fn(T) -> u64 and compute the sum of mapped values of all nodes descending from a root: &Node.

We can express this computation and execute in parallel with the following:

fn extend<'a>(node: &&'a Node, queue: &Queue<&'a Node>) {
    queue.extend(&node.children);
}

[root].into_par_rec(extend).map(map).sum()

Instead of into_par, we use into_par_rec and provide extend function as its argument. This function defines the recursive extension of the parallel iterator such that every time we process a node we first add its children to the queue. Queue is the queue of elements to be processed and it exposes two growth methods to define the recursive extension: push and extend.

Although we create the parallel iterator differently, we get a ParIter. Therefore, we have access to all features of a regular parallel iterator.

For instance, assume we want to filter nodes first. Further, instead of summing up the mapped values, we need to collect them in a vector. We can express this computation just as we would do on a linear data structure:

[root].into_par_rec(extend).filter(filter).map(map).collect()

For more details, you may see the parallelization_on_tree example.

§Performance and Benchmarks

Please also see impact of ChunkSize on performance section.

You may find some sample parallel programs in examples directory. These examples allow to express parallel computations as iterator method compositions and run quick experiments with different approaches. Examples use GenericIterator. As the name suggests, it is a generalization of sequential iterator, rayon’s parallel iterator and orx-parallel’s parallel iterator, and hence, allows for convenient experiments. You may play with the code, update the tested computations and run these examples by including generic_iterator feature, such as:

cargo run --release --features generic_iterator --example benchmark_collect -- --len 123456 --num-repetitions 10

Actual benchmark files are located in benches directory. Tables below report average execution times in microseconds. The numbers in parentheses represent the ratio of execution time to that of sequential computation which is used as the baseline (1.00). Parallelized executions of all benchmarks are carried out with default settings.

Computations are separated into three categories with respect to how the iterator is consumed: collect, reduce and early-exit. Further, two additional categories are created to test parallelization of arbitrary iterators (ii) and flexibility in composition of computations.

§Collect

In this group of benchmarks, outputs of parallel computations are collected into vectors. Details of the iterator chains and tested functions can be found in the respective benchmark files (you may use the link in the file column).

(s) Outputs can also be collected into a SplitVec, which can provide further improvements by avoiding memory copies. Note that a split vector provides constant time random access; and despite the fact that it is split to fragments, it asymptotically inherits advantages of contiguous vectors.

file	computation	sequential	rayon	orx-parallel	orx-parallel (s)
⇨	`.filter(_).collect()`	2.74 (1.00)	12.14 (4.43)	1.80 (0.66)	1.87 (0.68)
⇨	`.filter_map(_).collect()`	6.96 (1.00)	13.28 (1.91)	3.51 (0.50)	3.35 (0.48)
⇨	`.flat_map(_).collect()`	77.93 (1.00)	239.83 (3.08)	31.73 (0.41)	23.79 (0.31)
⇨	`.map(_).filter(_).collect()`	19.24 (1.00)	9.99 (0.52)	6.21 (0.32)	5.98 (0.31)
⇨	`.map(_).collect()`	18.08 (1.00)	7.98 (0.44)	5.28 (0.29)	6.09 (0.34)
⇨	`.map(_).filter(_).collect()` ☆	19.41 (1.00)	7.54 (0.39)	5.90 (0.30)	5.77 (0.30)

§Reduce

In this group, instead of collecting outputs, the results are reduced to a single value. Some common reductions are sum, count, min, etc.

file	computation	sequential	rayon	orx-parallel
⇨	`.map(_).filter(_).reduce(_)`	14.15 (1.00)	7.55 (0.53)	3.86 (0.27)
⇨	`.map(_).reduce(_)`	13.81 (1.00)	6.25 (0.45)	4.15 (0.30)
⇨	`.reduce(_)`	0.97 (1.00)	10.58 (10.91)	0.90 (0.93)

§Find

Here, computations that allow for early exit or short-circuit are investigated. As an example, experiments on find method are presented; methods such as find_any, any or all lead to similar results.

file	computation	sequential	rayon	orx-parallel
⇨	`.flat_map(_).find(_)`	160.24 (1.00)	127.37 (0.79)	27.66 (0.17)
⇨	`.map(_).filter(_).find(_)`	43.01 (1.00)	11.14 (0.26)	8.61 (0.20)
⇨	`.find(_)`	2.94 (1.00)	12.85 (4.37)	1.54 (0.52)

§Parallelization of Arbitrary Iterators

As discussed in ii, parallelization of regular iterators is a powerful feature. The benchmarks in this category demonstrate that improvements can be achieved provided that the computation on elements is not insignificant.

file	computation	sequential	rayon	orx-parallel
⇨	`…long_chain.collect()`	19.72 (1.00)	32.54 (1.65)	6.12 (0.31)
⇨	`.map(_).filter(_).reduce(_)`	15.17 (1.00)	118.28 (7.80)	4.98 (0.33)
⇨	`.map(_).filter(_).find(_)`	42.58 (1.00)	63.60 (1.49)	7.98 (0.19)

§Parallel Mutable Iterators

In this group, we investigate the performance of parallel computation which mutates the input elements. In the benchmarks, we filter elements and update the ones which satisfy the given criterion within the for_each call.

file	computation	sequential	rayon	orx-parallel
⇨	`slice.par_mut().filter(_).for_each(_)`	62.61 (1.00)	14.08 (0.22)	8.45 (0.13)
⇨	`iter.iter_into_par().filter(_).for_each(_)`	77.63 (1.00)	78.69 (1.01)	10.03 (0.13)

§Composition

In the final category of benchmarks, impact of long chains of transformations on computation time is tested. You may see such example long chains in the benchmark computations below, where long_chain is a shorthand for .map(map1).filter(filter1).map(map2).filter(filter2).map(map3).map(map4).filter(filter4). Notice that the caller can actually shorten the chains by composing some of them. An obvious one is the .map(map3).map(map4) call which could have been one call like map(map3-then-map4). However, this is not always possible as the computation might be conditionally built up in stages. Further, breaking transformations into smaller pieces help in achieving more descriptive computation definitions.

The results suggest that the functions are efficiently composed by the parallel iterator.

file	computation	sequential	rayon	orx-parallel
⇨	`…long_chain.collect()`	14.27 (1.00)	6.33 (0.44)	3.80 (0.27)
⇨	`…long_chain.reduce(_)`	15.08 (1.00)	6.10 (0.40)	4.03 (0.27)

§Fallible Parallel Iterators

We enjoy rust’s ? operator when working with fallible computations. It allows us to focus on and code only the success path. Failure at any step of the computation leads to a short-circuit and immediately returns from the function.

fn try_to_parse() -> Result<i32, std::num::ParseIntError> {
    let x: i32 = "123".parse()?; // x = 123
    let y: i32 = "24a".parse()?; // returns an Err() immediately
    Ok(x + y)                    // Doesn't run.
}

However, we do not have this convenience while working with iterators.

collect is the only exception. Normally, it allows us to pick the container to collect the items into.

let into_vec: Vec<usize> = (0..10).collect();
let into_set: std::collections::HashSet<usize> = (0..10).collect();

But it also does something exceptional when the item type is a result:

The first computation below is similar to above, it simply collects each element to the container which is defined as a vector.
The second computation; however, is fundamentally different. It collects elements iff all elements are of the Ok variant. Further, it short-circuits the computation as soon as an Err is observed. This is exactly how the ? operator behaves.

let into_vec_of_results: Vec<Result<usize, char>> = (0..10).map(|x| Ok(x)).collect();
let into_result_of_vec: Result<Vec<usize>, char> = (0..10).map(|x| Ok(x)).collect();

Although convenient, change in the behavior of the collect computation might be considered unexpected, at least for me.

Further, we do have not short-circuiting methods for computations other than collect. For instance, it is not as convenient to compute the sum of numbers of an iterator provided that all elements are of the Ok variant, and receive the error otherwise.

In general, the requirement to early exit in fallible computation is common and important both for performance and convenience reasons.

For parallel computation, this crate proposes to explicitly transform an iterator with fallible elements into a fallible parallel iterator.

use orx_parallel::*;
use std::num::ParseIntError;

let collect: Result<Vec<i32>, ParseIntError> = vec!["7", "2", "34"]
    .into_par()
    .map(|x| x.parse::<i32>())
    .into_fallible_result() // <-- explicit transformation to fallible iterator
    .collect();

Currently, there exist two fallible parallel iterators ParIterResult and ParIterOption. The transformation is as follows:

Regular Iterator	Transformation Method	Fallible Iterator
`ParIter<Item=Result<T, E>>`	`into_fallible_result()`	`ParIterResult<Item=T, Error=E>`
`ParIter<Item=Option<T>>`	`into_fallible_option()`	`ParIterOption<Item=T>`

After converting into a fallible iterator, each chaining transformation is based on the success item type. Similar to ? operator, this allows us to focus on the success path while any error case will be handled by early returning from the iterator with the error.

use orx_parallel::*;
use std::num::ParseIntError;

let sum: Result<i32, ParseIntError> = vec!["7", "2", "34"]
    .into_par()
    .map(|x| x.parse::<i32>()) // Item = Result<i32, ParseIntError>
    .into_fallible_result() // we are only working with success type after this point
    .map(|x| x + 1)
    .filter(|x| x % 2 == 0)
    .flat_map(|x| [x, x + 1, x + 2])
    .sum(); // returns Result, rather than i32
assert_eq!(sum, Ok(27));

let sum: Result<i32, ParseIntError> = vec!["7", "!!!", "34"]
    .into_par()
    .map(|x| x.parse::<i32>())
    .into_fallible_result()
    .map(|x| x + 1)
    .filter(|x| x % 2 == 0)
    .flat_map(|x| [x, x + 1, x + 2])
    .sum();
assert!(sum.is_err());

As demonstrated above, not only collect but all computation methods return a Result.

To summarize:

We can use all iterator methods with fallible iterators as well.
The transformations are based on the success type. All computations return a Result:
- if all computations succeed, it is Ok of the value that an infallible iterator would return;
- it is the first discovered Err if any of the computations fails.
Finally, all computations immediately return in case of an error.

Optional fallible iterator behaves exactly the same, except that None is treated as the failure case.

§Using Mutable Variables

Iterator methods allow us to define expressive computations using closures. These closures are often FnMut for sequential iterators allowing to mutably capture variables from the scope. It is clear that this is not possible for parallel iterators as it would lead to race condition when multiple threads simultaneously try to access the captured mutable variable. Therefore, parallel counterpart of the iterator methods often accept closures implementing Fn.

However, it is necessary to have mutable variables for certain programs. A very common example is computations requiring random number generators which are stateful and can create random numbers only with a mutable reference.

using transformation aims to provide a general and safe solution to this problem as follows:

One mutable variable per thread; hence, no race conditions.
The mutable variable is explicitly and mutably available to all iterator methods.

The following two examples demonstrate the idea and usage:

using takes a closure with thread index as the argument, describing how the mutable variable should be created for each thread.
using_clone, on the other hand, takes the value to be used as the mutable variable and shares a clone of it with each thread (just a shorthand for using(|_| sender.clone())).

In either case, there will exactly be n mutable variables created provided that the parallel computation uses n threads.

input
    .into_par()
    .using(|t_idx| ChaCha20Rng::seed_from_u64(42 * t_idx as u64)) // <-- explicit using
    .map(|_, i| fibonacci((i % 50) + 1) % 10)       // rng: &mut ChaCha20Rng
    .filter(|rng, _: &u64| rng.random_bool(0.4))    // is accessible for
    .map(|rng, i: u64| rng.random_range(0..i))      // all iterator methods
    .sum()

let (sender, receiver) = channel();

let (sender, receiver) = channel();
(0..5)
    .into_par()
    .using_clone(sender)
    .for_each(|s, x| s.send(x).unwrap());

let mut res: Vec<_> = receiver.iter().collect();

Further details can be found in using.md.

§Configurations

§Configuration per Computation

Each parallel computation is governed by two main straightforward parameters.

NumThreads is the degree of parallelization. This is a capacity parameter used to limit the resources that can be used by the computation.

Auto: All available threads can be used, but not necessarily.
Max(n): The computation can spawn at most n threads.
Max(1): Falls back to sequential execution on the main thread.

ChunkSize represents the number of elements a parallel worker will pull and process every time it becomes idle. This is an optimization parameter that can be tuned to balance the overhead of parallelization and cost of heterogeneity of tasks.

Auto: Let the parallel executor dynamically decide, achieves high performance in general and can be used unless we have useful computation specific knowledge.
Exact(c): Chunks will have c elements; gives complete control to the caller. Useful when we have a very good knowledge or want to tune the computation for certain data.
Min(c): Every chunk will have at least c elements. Parallel executor; however, might decide to pull more if each computation is handled very fast.

See also the last parameter IterationOrder with variants Ordered (default) and Arbitrary which is another useful optimization parameter for specific use cases.

When omitted, NumThreads::Auto and ChunkSize::Auto will be used. Configuring parallel computation is straightforward and specific to computation rather than through a global setting.

use orx_parallel::*;
use std::num::NonZeroUsize;

let n = 1024;

_ = (0..n).par().sum(); // NumThreads::Auto & ChunkSize::Auto

_ = (0..n).par().num_threads(4).sum(); // <= 4 threads
_ = (0..n).par().num_threads(1).sum(); // sequential
_ = (0..n).par().num_threads(0).sum(); // shorthand for NumThreads::Auto

_ = (0..n).par().chunk_size(64).sum(); // chunks of exactly 64 elements
let c = ChunkSize::Min(NonZeroUsize::new(16).unwrap());
_ = (0..n).par().chunk_size(c).sum(); // chunks of at least 16 elements

_ = (0..n).par().num_threads(4).chunk_size(16).sum(); // set both params

Note that NumThreads::Max(1) executes the computation sequentially.

This gives the consumer, who actually executes the defined computation, complete control to:

execute in parallel with the given configuration, or
execute sequentially, or
execute in parallel with any number of threads that it decides.

This is guaranteed by the fact that both consuming computation calls and configuration methods require ownership (self) of the iterator.

§Global Configuration

Additionally, maximum number of threads that can be used by parallel computations can be globally bounded by the environment variable ORX_PARALLEL_MAX_NUM_THREADS. Please see the corresponding example for details.

§Impact of `ChunkSize` on Performance

The impact of the chunk size on performance might be significant.

Our objective is to minimize the sum of two computational costs:

parallelization overhead => it gets smaller as chunk size gets greater
cost of heterogeneity => it gets larger as chunk size gets greater

Parallelization overhead can further be divided into two:

concurrent state update: This often corresponds to one atomic update per chunk. It may be significant if our computation is very small such as input.par().sum(). Otherwise, cost of atomic update could be negligible.
false sharing: This is relevant only if we are writing results. For instance, when we are one-to-one mapping an input and collecting the results such as input.par().map(|x| x.to_string()).collect(), or if are writing with mut references such as input.par().for_each(|x| *x += 1). Here, the performance might suffer from false sharing when the chunk size × size of output item is not large enough. You may also see false sharing section for ConcurrentBag.

In either case, when computation on each item is sufficiently long, parallelization overhead is negligible. Here, we want to make sure that we do not have heterogeneity cost. Therefore, a safe chunk size choice would be one, par.chunk_size(1).

Otherwise, our choice depends on the use case. As a rule of thumb, we want a chunk size that is just large enough to mitigate the parallelization overhead but not larger so that we do not suffer from heterogeneity.

The default configuration par.chunk_size(ChunkSize::Auto) or par.chunk_size(0) uses a heuristic to solve this tradeoff. A difficult case for the current version is when the tasks are significantly heterogeneous (see the discussion for future improvements).

As described above, the best way to deal with heterogeneity is to have par.chunk_size(1). You may of course test larger chunk sizes to optimize the computation for your data.

§Runner: Pools and Executors

This crate defines parallel computation by combining two basic components.

Pulling inputs

Pulling inputs in parallel is achieved through ConcurrentIter. Concurrent iterator implementations are lock-free, efficient and support pull-by-chunks optimization to reduce the parallelization overhead. A thread can pull any number of inputs from the concurrent iterator every time it becomes idle. This provides the means to dynamically decide on the chunk sizes.
Furthermore, this allows to reduce the overhead of defining creating tasks. To illustrate, provided that the computation will be handled by n threads, a closure holding a reference to the input concurrent iterator is defined to represent the computation. This same closure is passed to n threads; i.e., n spawn calls are made. Each of these threads keep pulling elements from the input until the computation is completed, without requiring to define another task.

Writing outputs

When we collect results, writing outputs is handled using lock-free containers such as ConcurrentBag and ConcurrentOrderedBag which aim for high performance collection of results.

There are two main decisions to be taken while executing these components:

how many threads do we use?
what is the chunk size; i.e., how many input items does a thread pull each time?

A ParallelRunner is a combination of a ParThreadPool and a ParallelExecutor that are responsible for these decisions, respectively.

§ParThreadPool: number of threads

ParThreadPool trait generalizes thread pools that can be used for parallel computations. This allows the parallel computation to be generic over thread pools.

When not explicitly set, DefaultPool is used:

When std feature is enabled, default pool is the StdDefaultPool. In other words, all available native threads can be used by the parallel computation. This number can globally bounded by “ORX_PARALLEL_MAX_NUM_THREADS” environment variable when set.
When working in a no-std environment, default pool is the SequentialPool. As the name suggests, this pool executes the parallel computation sequentially on the main thread. It can be considered as a placeholder to be overwritten by with_pool or with_runner methods to achieve parallelism.

Note that thread pool defines the resource, or upper bound. This upper bound can further be bounded by the num_threads configuration. Finally, parallel executor might choose not to use all available threads if it decides that the computation is small enough.

To overwrite the defaults and explicitly set the thread pool to be used for the computation, with_pool or with_runner methods are used.

use orx_parallel::*;

let inputs: Vec<_> = (0..42).collect();

// uses the DefaultPool
// assuming "std" enabled, StdDefaultPool will be used; i.e., native threads
let sum = inputs.par().sum();

// equivalent to:
#[cfg(feature = "std")]
{
    let sum2 = inputs.par().with_pool(StdDefaultPool::default()).sum();
    assert_eq!(sum, sum2);
}

#[cfg(not(miri))]
#[cfg(feature = "scoped_threadpool")]
{
    let mut pool = scoped_threadpool::Pool::new(8);
    // uses the scoped_threadpool::Pool created with 8 threads
    let sum2 = inputs.par().with_pool(&mut pool).sum();
    assert_eq!(sum, sum2);
}

#[cfg(not(miri))]
#[cfg(feature = "rayon-core")]
{
    let pool = rayon_core::ThreadPoolBuilder::new()
        .num_threads(8)
        .build()
        .unwrap();
    // uses the rayon-core::ThreadPool created with 8 threads
    let sum2 = inputs.par().with_pool(&pool).sum();
    assert_eq!(sum, sum2);
}

#[cfg(not(miri))]
#[cfg(feature = "yastl")]
{
    let pool = YastlPool::new(8);
    // uses the yastl::Pool created with 8 threads
    let sum2 = inputs.par().with_pool(&pool).sum();
    assert_eq!(sum, sum2);
}

ParThreadPool implementations of several thread pools are provided in this crate as optional features (see features section). Provided that the pool supports scoped computations, it is trivial to implement this trait in most cases (see implementations for examples).

In most of the cases, rayon-core, scoped_threadpool and scoped_pool perform better than others, and get close to native threads performance with StdDefaultPool.

Since parallel computations are generic over the thread pools, performances can be conveniently compared for specific use cases. Such an example benchmark can be found in collect_filter_map file. To have quick tests, you may also use the example benchmark_pools.

§ParallelExecutor: chunk size

Once thread pool provides the computation resources, it is ParallelExecutor’s task to distribute work to available threads. As mentioned above, all threads receive exactly the same closure. This closure continues to pull elements from the input concurrent iterator and operate on the inputs until all elements are processed.

The critical decision that parallel executor makes is the chunk size. Depending on the state of the computation, it can dynamically decide on number of elements to pull from the input iterator. The tradeoff it tries to solve is as follows:

the larger the chunk size,
- the smaller the parallelization overhead; but also
- the larger the risk of imbalance in cases of heterogeneity.

§Features

std: This is a no-std crate while std is included as a default feature. Please use --no-default-features flag for no-std use cases. std feature enables StdDefaultPool as the default thread provider which uses native threads.
rayon-core: This feature enables using rayon_core::ThreadPool for parallel computations.
scoped_threadpool: This feature enables using scoped_threadpool::Pool.
scoped-pool: This feature enables using scoped-pool::Pool.
yastl: This feature enables using yastl::Pool.
pond: This feature enables using pond::Pool.
poolite: This feature enables using poolite::Pool.

§Contributing

Contributions are welcome!

Please open an issue or create a PR,

if you notice an error,
have a question or think something could be improved,
have an input collection or generator that needs to be parallelized,
want to use a particular thread pool with parallel iterators,
having trouble representing a particular parallel computation with parallel iterators,
or anything else:)

Finally, feel free to contact me if you are interested in optimization of the parallel runner to further improve performance, by maybe dynamic optimization of chunk size decisions with respect to online collection and analysis of metrics.

§License

Dual-licensed under Apache 2.0 or MIT.

Re-exports§

pub use executor::DefaultExecutor;
pub use executor::ParallelExecutor;
pub use executor::ThreadExecutor;
pub use iter::IntoParIterRec;
pub use runner::DefaultPool;
pub use runner::DefaultRunner;
pub use runner::ParallelRunner;
pub use runner::RunnerWithPool;
pub use runner::SequentialPool;
pub use using::ParIterUsing;
pub use executor::ParallelExecutorWithDiagnostics;
pub use runner::PondPool;
pub use runner::StdDefaultPool;
pub use runner::YastlPool;

Modules§

computational_variants: Module containing variants of parallel iterators.
executor: Module defining the parallel runner trait and the default parallel runner.
generic_iterator: Module defining the GenericIterator which is a generalization over sequential iterator, rayon’s parallel iterator and orx-parallel’s parallel iterator. This is particularly useful for running experiments and comparing results of computations with different methods.
iter: Module for creating special iterators.
runner: ParallelRunner for parallel execution and managing threads.
using: Module defining parallel iterators with mutable access to values distributed to each thread.

Structs§

Params: Parameters of a parallel computation.
Queue: A queue of elements that will be returned by the ConcurrentRecursiveIter.

Enums§

ChunkSize: ChunkSize represents the batch size of elements each thread will pull from the main iterator once it becomes idle again. It is possible to define a minimum or exact chunk size.
IterationOrder: Order of parallel iteration, which might be:
NumThreads: NumThreads represents the degree of parallelization. It is possible to define an upper bound on the number of threads to be used for the parallel computation. When set to 1, the computation will be executed sequentially without any overhead. In this sense, parallel iterators defined in this crate are a union of sequential and parallel execution.

Traits§

IntoParIter: Trait to convert a source (collection or generator) into a parallel iterator; i.e., ParIter, using its into_par method.
IterIntoParIter: Any regular iterator implements IterIntoParIter trait allowing them to be used as a parallel iterator; i.e., ParIter, by calling iter_into_par.
ParCollectInto: Collection types into which outputs of a parallel computations can be collected into.
ParIter: Parallel iterator.
ParIterOption: A parallel iterator for which the computation either completely succeeds, or fails and early exits with None.
ParIterResult: A parallel iterator for which the computation either completely succeeds, or fails and early exits with an error.
ParThreadPool: A thread pool that can be used for parallel computation.
ParallelDrainableOverSlice: A type which can create a parallel draining iterator over any of its sub-slices.
Parallelizable: Parallelizable types are those from which parallel iterators can be created multiple times using the par method, since this method call does not consume the source.
ParallelizableCollection: A type implementing ParallelizableCollection is a collection owning the elements such that
ParallelizableCollectionMut: A type implementing ParallelizableCollectionMut is a collection owning the elements such that
Sum: Number that can be summed over.