Crate burst_pool [−] [src]
A thread pool optimised for bursts of activity.
Consider the following use-case: A single thread produces work which must then be performed by a pool of worker threads. The work is produced infrequently, but in bursts. Under normal operation, therefore, the threads in the pool sleep until some event requires many of them to be suddenly be woken at once. Those threads perform some work before going back to sleep again.
Most thread pools schedule work based on thread readiness. This invariably means that work is pushed onto a single queue which is shared by all the workers, and the workers steal work from the queue when they're ready to accept it. This usually works very well, but not in our use-case: since many threads become runnable at the same time, they immediately contend to read from the queue.
Instead, we use a round-robin scheduling strategy, in which the work-producing thread sends work to specific workers. This eliminates contention (since each worker has its own queue). The trade-off is that it performs badly when utilisation is high and the workloads are uneven.
Usage
Normally a thread pool will spawn a fixed number of worker threads; once the threads are running,
you send closures to the pool which are then executed by one of the workers. BurstPool
's API
works a bit differently.
A BurstPool
is parameterised over the kind of data which will be sent to the workers. (This data
could be boxed closures, if you want to mimic the normal API.) The user provides work by calling
BurstPool::send()
, and one of the workers is chosen to receive it. Each worker has its own
function which it uses to handle work sent to it. This function is provided when the thread is
spawned.
use burst_pool::BurstPool; let mut pool = BurstPool::new(); // Spawn a worker in the pool pool.spawn(|x| println!("Received {}!", x)); // Send some work to the worker pool.send(36).unwrap();
Performance
I'm using the following benchmark:
- Create a pool with 10 workers.
- Send 10 pieces of work to the pool.
- When a thread recieves some work it reads the clock, records its latency, and goes back to sleep.
For comparison, the benchmark was replicated using some of the other thread pools on crates.io (which are not optimised for this use-case), and I also benchmarked the time taken to suddenly unpark 10 parked threads.
crate | mean latency | 20 %ⁱˡᵉ | 40 %ⁱˡᵉ | 60 %ⁱˡᵉ | 80 %ⁱˡᵉ |
---|---|---|---|---|---|
burst_pool | 8.3 μs | <3.4 μs | <5.1 μs | <7.6 μs | <7.6 μs |
threadpool | 17.4 μs | <7.6 μs | <11.4 μs | <17.1 μs | <17.1 μs |
scoped_threadpool | 18.7 μs | <7.6 μs | <11.4 μs | <17.1 μs | <17.1 μs |
(unpark only) | 7.7 μs | <3.4 μs | <5.1 μs | <7.6 μs | <7.6 μs |
- The profile of
BurstPool
shows that it doesn't add much latency over just callingunpark()
. Almost all of the time goes to the linux scheduler. - The mean latency is heavily skewed by outliers, lying above the 80th percentile. You can expect latencies better than 7.6 μs most of the time, with the occasional long wait.
- Getting results as good as these is heavily dependent on setting
max_cstate = 0
- this makes a huge difference to thread wake-up times.
You can run the benchmarks yourself with cargo bench
. If your results are significantly worse
than those above, your kernel might be powering down CPU cores too eagerly. If you care about
latency more than battery life, consider setting max_cstate = 0
.
Structs
BurstPool |
A thread pool optimised for bursts of activity. |
Enums
BurstError |