Expand description

Key-Value Parallel Merge Sort

Sort Pod (key, value) data sets that don’t fit in memory.

This crate provides the kv_par_merge_sort library, which enables the user to sort Chunks of (key, value) pairs (AKA entries) via a SortingPipeline. The sorted output lands in two files: one for keys and one for values. The keys file is sorted, while the values file is “parallel” to the key file.

More precisely, we denote the input stream as [(k[0], v[0]), (k[1], v[1]), ...]. The final key and value files are laid out as [k[a], k[b], ...] and [v[a], v[b], ...] respectively, such that the key array is sorted. The reason for separate files is to ensure correct data type alignment (for zero-copy reads) without wasting space to padding.

Sorting ~17 GB data set (half a billion entries)

$ time RUST_LOG=debug cargo run --release --example large_data_set -- -o /ssd_data/bench_data/ -t /ssd_data/tmp/
[2022-05-28T08:24:56Z INFO  large_data_set] Random input data set will contain 18 unsorted chunks of at most 28071681 entries each
[2022-05-28T08:25:36Z INFO  large_data_set] Done generating random chunks
[2022-05-28T08:26:00Z INFO  kv_par_merge_sort] Running merge of 16 persisted chunks
[2022-05-28T08:26:01Z INFO  kv_par_merge_sort] All chunks sorted, only merge work remains
[2022-05-28T08:27:02Z INFO  kv_par_merge_sort] Running merge of 3 persisted chunks
[2022-05-28T08:28:30Z INFO  kv_par_merge_sort] Done merging! Performed 2 merge(s) total

real    3m33.830s
user    3m31.733s
sys     0m42.923s

Implementation

To sort an arbitrarily large data set without running out of memory, we must resort to an “external” sorting algorithm that uses the file system for scratch space; we use a parallel merge sort. Each Chunk is sorted separately, in parallel and streamed to a pair of files. These files are consumed by a merging thread, which (also in parallel) iteratively merges groups of up to merge_k similarly-sized chunks.

File Handles

WARNING: It’s possible to exceed your system’s limit on open file handles if Chunks are too small.

Memory Usage

WARNING: If you are running out of memory, make sure you can actually fit max_sort_concurrency Chunks in memory. Also note that std::env::temp_dir might actually be an in-memory tmpfs.

File System Usage

This algorithm requires twice the size of the input data in free file system space in order to perform the final merge.

Structs

An arbitrarily ordered vector of key-value pairs.

The files storing a sorted chunk of key-value pairs.

Accepts an arbitrarily large stream of unordered key-value pairs (AKA entries) from the user and streams them into two binary files: one for keys and one for values.

Functions