Expand description
Sort a large number of items in memory
This library provides types, most prominently SortBuf, and traits for accumulating a large number of items in memory and iterating over them in ascending or descending order (as per Ord). The implementation
- avoids potentially costly reallocations,
- releases chunks of memory every now and then during iteration,
- doesn’t introduce much memory overhead,
- allows reacting to allocation failures,
- supports multi-threaded insertion and
- isn’t awfully slow.
§Examples
Natively, SortBuf will prepare items for descending iteration:
let mut sortbuf = sortbuf::SortBuf::new();
let mut inserter = sortbuf::Inserter::new(&mut sortbuf);
inserter.insert_items([10, 20, 5, 17]).expect("Failed to insert items");
drop(inserter);
assert!(sortbuf.into_iter().eq([20, 17, 10, 5]));
For ascending iteration, items need to be wrapped in std::cmp::Reverse. However, the library provides convenience functions for handling the (un)wrapping:
let mut sortbuf = sortbuf::SortBuf::new();
let mut inserter = sortbuf::Inserter::new(&mut sortbuf);
inserter.insert_items_reversed([10, 20, 5, 17]).expect("Failed to insert items");
drop(inserter);
assert!(sortbuf.unreversed().eq([5, 10, 17, 20]));
Multithreaded insertion is supported via multiple Inserters:
use std::sync::{Arc, Mutex};
let sortbuf: Arc<Mutex<sortbuf::SortBuf<_>>> = Default::default();
let workers: Vec<_> = (0..4).map(|n| {
let mut inserter = sortbuf::Inserter::new(sortbuf.clone());
std::thread::spawn(move || inserter
.insert_items((0..1000).map(|i| 4*i+n))
.expect("Failed to insert items"))
}).collect();
workers.into_iter().try_for_each(|h| h.join()).unwrap();
assert!(sortbuf.lock().unwrap().take().into_iter().eq((0..4000).rev()));
§Approach and comparison
As indicated in the examples above, adding new items to a buffer is done via Inserters. These accumulate items in pre-sorted Buckets and commit them to their target buffer. Later, that buffer can be converted to an Iterator which yields items taken from those Buckets, which involves selecting the Bucket with the current greatest item in the buffer.
While a significant amount of time is spent during insertion, the majority of time is usually spent during iteration. Performance is usually better, and skews towards more time spent in the parallelizable insertion state, with fewer, bigger Buckets. As Buckets are pre-allocated, this comes at the cost of flexibility regarding memory.
§Comparison to Vec and sort
Buffering and sorting items can also be done using a Vec (for buffering) in conjunction with slice::sort, slice::sort_unstable or another sorting function. The process is then usually split into an insertion, a sorting and an iteration phase, with sort being the most computational intensive phase.
Sorting and iteration over the items in a Vec is generally faster than with the utilities provided by this library —in the single-threaded case. However, this library does allow insertion from multiple threads, contrary to a bare Vec. In addition, the performance of inserting items into a Vec hinges on the reallocation performance, which might be poor in some cases, e.g. if multiple buffers are involved.
The need of a single, separate, computational intensive sorting phase may also have some implications on overall performance in some use-cases. With the types provided by this library, sorting will likely interleave with I/O linked to insertion and/or the final iteration, spread out over the entire process. Thus, the underlying OS may have more opportunities to perform background operations related to reads (insertion stage) and writes (iteration stage), increasing the overall throughput.
§Comparison to BTreeSet
Another option for sorting items without the need for a separate sorting
phase would be an BTreeSet. Contrary to the
sortbuf
approach, most of the time is spent in the insertion phase rather
than the iteration phase. Using a BTreeSet is
usually slower than a SortBuf with sufficiently large Buckets, not
parallelizable and incurs a higher memory overhead.
Modules§
- error
- Types and utilities related to error handling and reporting
Structs§
- Bucket
- A collection of items to be committed to a SortBuf
- Inserter
- Item feeder for BucketAccumulators
- SortBuf
- Data structure for preparing a large number of items for sorted iteration
Constants§
- DEFAULT_
BUCKET_ BYTESIZE - Default size for Buckets
Traits§
- Bucket
Accumulator - Accumulator for Buckets