Struct rscompress_transformation::BurrowWheeler [−][src]
pub struct BurrowWheeler { /* fields omitted */ }
Expand description
Burrow-Wheeler transformation
Implementation of the Burrow-Wheeler Transformation as described here.
Algorithm
In the following is a rough sketch of the algorithm and its inner workings.
▶ Transformation
The transformation process involves three steps:
- Create a
N x N
matrix with rotations of datad
, whereN
is the length ofd
- Sort the rows of this matrix
- Output the last characters of each row + the final row index of
d
The last characters and the row index is enough information to reproduce d
.
Implementation
Most implementations of the Burrow-Wheeler Transform use
Suffix Arrays for generating the
transformations. A suffix appears often in concatenation with strings and substrings.
A substring is a string appearing in the original string.
While pp
, ppl
, or le
are all a substring of apple
; ale
does not appear in apple
.
If the substring appears at the beginning of the word, it is called a prefix e.g. a
, ap
, or app
.
If it appears at the end of the word, it is called a suffix e.g. e
, le
, or ple
.
This definition of prefix and suffix are also being used in context of byte vectors.
Suffix Arrays are sorted matrices of all suffixes of the original data. Often they are represented by their row indices vector of the length of the original data. The following example shows the sorted suffix indices.
Example
apple$ > $, apple, e, le, ple, pple > [5, 0, 4, 3, 2, 1]
Some implementations of suffix arrays also include an end character often symbolized using $
.
This can, but must not be included for the Burrow Wheeler Transform.
From this array the last chars and the index position need to be calculated.
The latter is simply the position of 0
in the suffix indices vector.
In this it is at index position 1
. The former is a bit trickier.
What is needed is the last character in a rotated suffix list.
The sorted suffix list is given above. For the BWT algorithm these words need to be filled
with the actual word, to the length of the original data.
apple, e, le, ple, pple > apple, eappl, leapp, pleap, pleap, pplea > [e, l, p, p, a]
The kean reader might have observed that the last letter
is always the one previous in the original word.
Therefore the last characters of [5, 4, 4, 3, 2, 1]
are [4, -1, 3, 2, 1, 0]
.
Since the special character $
is not important the final vector information is [e, l, p, p, a]
and 1
for the index position.
◀ Reverse
TODO: Improve explanation
The reverse algorithm of BWT uses three vectors of length N
with N
being the number of data vector.
The first vector is the actual vector to be reversed i.e. last characters of the suffix array.
The second vector is the first vector sorted.
The third and last vector is a count of the second vector.
The third vector saves the information on the position of the byte.
The following is an example for the apple
string as above.
[e, l, p, p, a] > [1, 1, 1, 2, 1]
[e, l, p, p, a]
[a, e, l, p, p]
[1, 1, 1, 2, 1]
i: 0,
The first output is the letter at the index position of the second vector.
In the above example this is pos 1
in [$, a, e, l, p, p]
and therefore a
.
The second output is at the index position of the n
th occurence of the letter just output in the first vector.
The n
th occurence is described by the third vector.
In this case it is the first occurence of a
in vector 1.
This is at the last position. Therefore the output is p
.
After that is the first occurence of the letter p
in vector 1.
This is at position 2.
Example
use rscompress_transformation::BurrowWheeler;
Implementations
impl BurrowWheeler
[src]
impl BurrowWheeler
[src]Trait Implementations
impl Debug for BurrowWheeler
[src]
impl Debug for BurrowWheeler
[src]impl Default for BurrowWheeler
[src]
impl Default for BurrowWheeler
[src]