Skip to main content

Module external_sort

Module external_sort 

Source
Expand description

External sorting for minimizer tuples

This module implements RAM-bounded external sorting for large datasets, following the C++ sshash implementation precisely:

  1. Each thread has a RAM-bounded buffer
  2. When buffer fills → parallel sort + flush to temp binary file
  3. After all tuples processed → k-way merge of temp files
  4. Merge uses buffered file I/O (not mmap) to avoid resident page bloat

§Memory Management

Buffer size per thread = (ram_limit_gib * GiB) / (2 * sizeof(tuple) * num_threads)

The factor of 2 accounts for temporary memory during parallel sort.

After merge, the sorted file is accessed via sequential buffered readers (FileTuples) rather than mmap, keeping RSS proportional to the I/O buffer size (~4 MB) instead of the file size (~8-10 GB).

Structs§

BucketScan
Metadata about a single bucket, gathered during a sequential scan.
ExternalSorter
External sorter for minimizer tuples
FileBucketIter
Sequential iterator over buckets in the merged file via buffered I/O.
FileTuples
Handle to the merged external sort file for sequential buffered access.
MergeResult
Result of merge operation
MinimizerTupleExternal
Packed minimizer tuple for disk I/O (matches C++ #pragma pack(push, 2))
SequentialTupleReader
Sequential reader for tuples from the merged file.

Constants§

GIB
Bytes per GiB
TUPLE_SIZE_BYTES
Size of MinimizerTupleExternal in bytes (packed, no padding)