Module sais_fallback

Module sais_fallback 

Source
Expand description

The fallback bwt_sort algorithm for the Rust version of the standard BZIP2 library.

This is a simple SA-IS implemenation of a sorting algorithm, using sentinels. SA-IS is particularly suited to repetative data such as is found in genetic sequencing.

SA-IS does not lend itself to multi-threading when the data sets are smaller (except for frequency counting). The BZIP2 algorithm has a maximum block size of 900kb, consequently no attempt at multithreading within the sorting portion of the algorithm is made. (Multiple blocks will be processed in parallel, though.)

In order to determine whether the data is more likely to be better sorted using SA-IS, the lms_complexity function can test a sample of the data to determine whether SA-IS is suited to the data.

NOTE 1: This implementation effectively uses compressed LMS data in order to reduce cache misses.

NOTE 2: It is difficult to determine the best sorting alogithm based on only a sample of the data. The approach used in this version is based on LMS complexity (my term). This technique has proven to be the most reliable method I have discovered to return a decent result based on analysis of a small sample of the larger data set.

Functionsยง

lms_complexity
Given a sample slice of the data (5k suggested), compute the data complexity. A complexity of less than 0.3 indicates that SA-IS is the better algorithm than the native sort_unstable algorithm.
sais_entry
Entry point for Simple SA-IS sort for Burrow-Wheeler Transform. Takes u8 slice and returns u32 key and u8 vector in BWT format.