Expand description
External sorting for minimizer tuples
This module implements RAM-bounded external sorting for large datasets, following the C++ sshash implementation precisely:
- Each thread has a RAM-bounded buffer
- When buffer fills → parallel sort + flush to temp binary file
- After all tuples processed → k-way merge of temp files
- Merge uses buffered file I/O (not mmap) to avoid resident page bloat
§Memory Management
Buffer size per thread = (ram_limit_gib * GiB) / (2 * sizeof(tuple) * num_threads)
The factor of 2 accounts for temporary memory during parallel sort.
After merge, the sorted file is accessed via sequential buffered readers
(FileTuples) rather than mmap, keeping RSS proportional to the I/O buffer
size (~4 MB) instead of the file size (~8-10 GB).
Structs§
- Bucket
Scan - Metadata about a single bucket, gathered during a sequential scan.
- External
Sorter - External sorter for minimizer tuples
- File
Bucket Iter - Sequential iterator over buckets in the merged file via buffered I/O.
- File
Tuples - Handle to the merged external sort file for sequential buffered access.
- Merge
Result - Result of merge operation
- Minimizer
Tuple External - Packed minimizer tuple for disk I/O (matches C++
#pragma pack(push, 2)) - Sequential
Tuple Reader - Sequential reader for tuples from the merged file.
Constants§
- GIB
- Bytes per GiB
- TUPLE_
SIZE_ BYTES - Size of
MinimizerTupleExternalin bytes (packed, no padding)