Crate sortbuf

Source
Expand description

Sort a large number of items in memory

This library provides types, most prominently SortBuf, and traits for accumulating a large number of items in memory and iterating over them in ascending or descending order (as per Ord). The implementation

  • avoids potentially costly reallocations,
  • releases chunks of memory every now and then during iteration,
  • doesn’t introduce much memory overhead,
  • allows reacting to allocation failures,
  • supports multi-threaded insertion and
  • isn’t awfully slow.

§Examples

Natively, SortBuf will prepare items for descending iteration:

let mut sortbuf = sortbuf::SortBuf::new();
let mut inserter = sortbuf::Inserter::new(&mut sortbuf);
inserter.insert_items([10, 20, 5, 17]).expect("Failed to insert items");
drop(inserter);
assert!(sortbuf.into_iter().eq([20, 17, 10, 5]));

For ascending iteration, items need to be wrapped in std::cmp::Reverse. However, the library provides convenience functions for handling the (un)wrapping:

let mut sortbuf = sortbuf::SortBuf::new();
let mut inserter = sortbuf::Inserter::new(&mut sortbuf);
inserter.insert_items_reversed([10, 20, 5, 17]).expect("Failed to insert items");
drop(inserter);
assert!(sortbuf.unreversed().eq([5, 10, 17, 20]));

Multithreaded insertion is supported via multiple Inserters:

use std::sync::{Arc, Mutex};
let sortbuf: Arc<Mutex<sortbuf::SortBuf<_>>> = Default::default();
let workers: Vec<_> = (0..4).map(|n| {
    let mut inserter = sortbuf::Inserter::new(sortbuf.clone());
    std::thread::spawn(move || inserter
        .insert_items((0..1000).map(|i| 4*i+n))
        .expect("Failed to insert items"))
}).collect();
workers.into_iter().try_for_each(|h| h.join()).unwrap();
assert!(sortbuf.lock().unwrap().take().into_iter().eq((0..4000).rev()));

§Approach and comparison

As indicated in the examples above, adding new items to a buffer is done via Inserters. These accumulate items in pre-sorted Buckets and commit them to their target buffer. Later, that buffer can be converted to an Iterator which yields items taken from those Buckets, which involves selecting the Bucket with the current greatest item in the buffer.

While a significant amount of time is spent during insertion, the majority of time is usually spent during iteration. Performance is usually better, and skews towards more time spent in the parallelizable insertion state, with fewer, bigger Buckets. As Buckets are pre-allocated, this comes at the cost of flexibility regarding memory.

§Comparison to Vec and sort

Buffering and sorting items can also be done using a Vec (for buffering) in conjunction with slice::sort, slice::sort_unstable or another sorting function. The process is then usually split into an insertion, a sorting and an iteration phase, with sort being the most computational intensive phase.

Sorting and iteration over the items in a Vec is generally faster than with the utilities provided by this library —in the single-threaded case. However, this library does allow insertion from multiple threads, contrary to a bare Vec. In addition, the performance of inserting items into a Vec hinges on the reallocation performance, which might be poor in some cases, e.g. if multiple buffers are involved.

The need of a single, separate, computational intensive sorting phase may also have some implications on overall performance in some use-cases. With the types provided by this library, sorting will likely interleave with I/O linked to insertion and/or the final iteration, spread out over the entire process. Thus, the underlying OS may have more opportunities to perform background operations related to reads (insertion stage) and writes (iteration stage), increasing the overall throughput.

§Comparison to BTreeSet

Another option for sorting items without the need for a separate sorting phase would be an BTreeSet. Contrary to the sortbuf approach, most of the time is spent in the insertion phase rather than the iteration phase. Using a BTreeSet is usually slower than a SortBuf with sufficiently large Buckets, not parallelizable and incurs a higher memory overhead.

Modules§

error
Types and utilities related to error handling and reporting

Structs§

Bucket
A collection of items to be committed to a SortBuf
Inserter
Item feeder for BucketAccumulators
SortBuf
Data structure for preparing a large number of items for sorted iteration

Constants§

DEFAULT_BUCKET_BYTESIZE
Default size for Buckets

Traits§

BucketAccumulator
Accumulator for Buckets