Fork Union 🍴
The fork_union library is a thread-pool for "Fork-Join" SIMT-style parallelism for Rust and C++.
It's quite different from most open-source thread-pool implementations, generally designed around heap-allocated "queues of tasks", synchronized by a "mutex".
In C++, wrapping tasks into a std::function is expensive, as is growing the std::queue and locking the std::mutex under contention.
Same for Rust.
When you can avoid it - you should.
OpenMP-like use-cases are the perfect example of that!

OpenMP, however, isn't great for fine-grained parallelism, when different pieces of your application logic need to work on different sets of threads.
This is where fork_union comes in with a minimalistic STL implementation of a thread-pool, avoiding dynamic memory allocations and exceptions on the hot path, and prioritizing lock-free and CAS-free user-space "atomics" to system calls.
Usage
The fork_union is dead-simple!
There is no nested parallelism, exception-handling, or "futures promises".
The thread pool has just one core API - broadcast to launch a callback on each thread.
The higher-level API for index-addressable tasks are:
for_n- for individual evenly-sized tasks.for_n_dynamic- for individual unevenly-sized tasks.for_slices- for slices of evenly-sized tasks.
Both are available in C++ and Rust.
Usage in Rust
A minimal example may look like this:
use fork_union as fu;
let pool = spawn;
pool.broadcast;
Higher-level APIs distribute tasks across the threads in the pool:
for_n;
for_slices;
for_n_dynamic;
A safer try_spawn_in interface is recommended, using the Allocator API.
A more realistic example may look like this:
use thread;
use Error;
use Global;
use fork_union as fu;
Usage in C++
To integrate into your C++ project, either just copy the include/fork_union.hpp file into your project, add a Git submodule, or CMake.
For a Git submodule, run:
Alternatively, using CMake:
FetchContent_Declare(
fork_union
GIT_REPOSITORY
https://github.com/ashvardanian/fork_union
)
FetchContent_MakeAvailable(fork_union)
target_link_libraries(your_target PRIVATE fork_union::fork_union)
Then, include the header in your C++ code:
;
int
That's it.
Why Not Use $𝑋$
There are many other thread-pool implementations, that are more feature-rich, but have different limitations and design goals.
- Modern C++:
taskflow/taskflow,progschj/ThreadPool,bshoshany/thread-pool - Traditional C++:
vit-vit/CTPL,mtrebi/thread-pool - Rust:
tokio-rs/tokio,rayon-rs/rayon,smol-rs/smol
Those are not designed for the same OpenMP-like use-cases as fork_union.
Instead, they primarily focus on task queueing, which requires a lot more work.
Locks and Mutexes
Unlike the std::atomic, the std::mutex is a system call, and it can be expensive to acquire and release.
Its implementations generally have 2 executable paths:
- the fast path, where the mutex is not contended, where it first tries to grab the mutex via a compare-and-swap operation, and if it succeeds, it returns immediately.
- the slow path, where the mutex is contended, and it has to go through the kernel to block the thread until the mutex is available.
On Linux, the latter translates to a "futex" syscall, which is expensive.
Memory Allocations
C++ has rich functionality for concurrent applications, like std::future, std::packaged_task, std::function, std::queue, std::conditional_variable, and so on.
Most of those, I believe, aren't unusable in Big-Data applications, where you always operate in memory-constrained environments:
- The idea of raising a
std::bad_allocexception, when there is no memory left, and just hoping that someone up the call stack will catch it is simply not a great design idea for any Systems Engineering. - The threat of having to synchronize ~200 physical CPU cores across 2-8 sockets, and potentially dozens of NUMA nodes around a shared global memory allocator, practically means you can't have predictable performance.
As we focus on a simpler concurrency parallelism model, we can avoid the complexity of allocating shared states, wrapping callbacks into some heap-allocated "tasks", and a lot of other boilerplate.
Less work - more performance.
Atomics and CAS
Once you get to the lowest-level primitives on concurrency you end up with the std::atomic and a small set of hardware-supported atomic instructions.
Hardware implements it differently:
- x86 is built around the "Total Store Order" (TSO) memory consistency model and provides
LOCKvariants of theADDandCMPXCHG, which act as full-blown "fences" - no loads or stores can be reordered across it. - Arm, on the other hand, has a "weak" memory model, and provides a set of atomic instructions that are not fences, that match C++ concurrency model, offering
acquire,release, andacq_relvariants of each atomic instruction—such asLDADD,STADD, andCAS- which allow precise control over visibility and ordering, especially with the introduction of "Large System Extension" (LSE) instructions in Armv8.1.
In practice, a locked atomic on x86 requires the cache line in the Exclusive state in the requester's L1 cache. This will incur a coherence transaction (Read-for-Ownership) if some other core had the line. Both Intel and AMD handle this similarly.
It makes Arm and Power much more suitable for lock-free programming and concurrent data structures, but some observations hold for both platforms. Most importantly, "Compare and Swap" (CAS) is a very expensive operation, and should be avoided at all costs.
On x86, for example, the LOCK ADD can easily take 50 CPU cycles, being 50x slower than a regular ADD instruction, but still easily 5-10x faster than a LOCK CMPXCHG instruction.
Once the contention rises, the gap naturally widens, and is further amplified by the increased "failure" rate of the CAS operation, when the value being compared has already changed.
That's why for the "dynamic" mode, we resort to using an additional atomic variable as opposed to more typical CAS-based implementations.
Alignment
Assuming a thread-pool is a heavy object anyway, nobody will care if it's a bit larger than expected.
That allows us to over-align the internal counters to std::hardware_destructive_interference_size to avoid false sharing.
In that case, even on x86, where the entire cache will be exclusively owned by a single thread, in eager mode, we end up effectively "pipelining" the execution, where one thread may be incrementing the "in-flight" counter, while the other is decrementing the "remaining" counter, and others are executing the loop body in-between.
Performance
One of the most common parallel workloads is the N-body simulation ¹.
An implementation is available in both C++ and Rust in scripts/nbody.cpp and scripts/nbody.rs respectively.
Both are extremely light-weight and involve little logic outside of number-crunching, so both can be easily profiled with time and introspected with perf Linux tools.
C++ benchmarking results for $N=128$ bodies and $I=1e6$ iterations:
| Machine | OpenMP (D) | OpenMP (S) | Fork Union (D) | Fork Union (S) |
|---|---|---|---|---|
| 16x Intel SPR | 20.3s | 16.0s | 18.1s | 10.3s |
| 12x Apple M2 | ? | 1m:16.7s | 1m:30.3s ² | 1m:40.7s ² |
| 96x Graviton 4 | 32.2s | 20.8s | 39.8s | 26.0s |
Rust benchmarking results for $N=128$ bodies and $I=1e6$ iterations:
| Machine | Rayon (D) | Rayon (S) | Fork Union (D) | Fork Union (S) |
|---|---|---|---|---|
| 16x Intel SPR | 51.4s | 38.1s | 15.9s | 9.8s |
| 12x Apple M2 | 3m:23.5s | 2m:0.6s | 4m:8.4s | 1m:20.8s |
| 96x Graviton 4 | 2m:13.9s | 1m:35.6s | 18.9s | 10.1s |
¹ Another common workload is "Parallel Reductions" covered in a separate repository. ² When a combination of performance and efficiency cores is used, dynamic stealing may be more efficient than static slicing.
Safety & Logic
There are only 3 core atomic variables in this thread-pool, and some of them are practically optional.
Let's call every invocation of a for_* API - a "fork", and every exit from it a "join".
| Variable | Users Perspective | Internal Usage |
|---|---|---|
stop |
Stop the entire thread-pool | Tells workers when to exit the loop |
fork_generation |
"Forks" called since init | Tells workers to wake up on new forks |
threads_to_sync |
Threads not joined this fork | Tells main thread when workers finish |
Why don't we need atomics for "total_threads"?
The only way to change the number of threads is to stop_and_reset the entire thread-pool and then try_spawn it again.
Either of those operations can only be called from one thread at a time and never coincides with any running tasks.
That's ensured by the stop.
Why don't we need atomics for a "job pointer"?
A new task can only be submitted from one thread, that updates the number of parts for each new fork.
During that update, the workers are asleep, spinning on old values of fork_generation and stop.
They only wake up and access the new value once fork_generation increments, ensuring safety.
How do we deal with overflows and SIZE_MAX-sized tasks?
The library entirely avoids saturating multiplication and only uses one saturating addition in "release" builds.
To test the consistency of arithmetic, the C++ template class can be instantiated with a custom index_t, such as std::uint8_t or std::uint16_t.
In the former case, no more than 255 threads can operate and no more than 255 tasks can be addressed, allowing us to easily test every weird corner case of [0:255] threads competing for [0:255] tasks.
Testing and Benchmarking
To run the C++ tests, use CMake:
For C++ debug builds, consider using the VS Code debugger presets or the following commands:
To build with an alternative compiler, like LLVM Clang, use the following command:
For Rust, use the following command: