1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
use crate::{Transform, TransformError};
use log::debug;
use std::collections::HashMap;
use suffix_array::SuffixArray;

/// Burrow-Wheeler transformation
///
/// Implementation of the Burrow-Wheeler Transformation as
/// described [here](https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform).
///
/// # Algorithm
/// In the following is a rough sketch of the algorithm and its inner workings.
///
/// ## ▶ Transformation
/// The transformation process involves three steps:
///
/// 1. Create a `N x N` matrix with rotations of data `d`, where `N` is the length of `d`
/// 2. Sort the rows of this matrix
/// 3. Output the last characters of each row + the final row index of `d`
///
/// The last characters and the row index is enough information to reproduce `d`.
///
/// ### Implementation
/// Most implementations of the Burrow-Wheeler Transform use
/// [Suffix Arrays](https://en.wikipedia.org/wiki/Suffix_array) for generating the
/// transformations. A suffix appears often in concatenation with strings and substrings.
/// A substring is a string appearing in the original string.
/// While `pp`, `ppl`, or `le` are all a substring of `apple`; `ale` does not appear in `apple`.
/// If the substring appears at the beginning of the word, it is called a *prefix* e.g. `a`, `ap`, or `app`.
/// If it appears at the end of the word, it is called a *suffix* e.g. `e`, `le`, or `ple`.
/// This definition of prefix and suffix are also being used in context of byte vectors.
///
/// Suffix Arrays are sorted matrices of all suffixes of the original data.
/// Often they are represented by their row indices vector of the length of the original data.
/// The following example shows the sorted suffix indices.
///
/// Example
/// ```text
/// apple$ > $, apple, e, le, ple, pple > [5, 0, 4, 3, 2, 1]
/// ```
///
/// Some implementations of suffix arrays also include an end character often symbolized using `$`.
/// This can, but must not be included for the Burrow Wheeler Transform.
/// From this array the last chars and the index position need to be calculated.
///
/// The latter is simply the position of `0` in the suffix indices vector.
/// In this it is at index position `1`. The former is a bit trickier.
/// What is needed is the last character in a rotated suffix list.
/// The sorted suffix list is given above. For the BWT algorithm these words need to be filled
/// with the actual word, to the length of the original data.
///
/// ```text
/// apple, e, le, ple, pple > apple, eappl, leapp, pleap, pleap, pplea > [e, l, p, p, a]
/// ```
/// The kean reader might have observed that the last letter
/// is always the one previous in the original word.
/// Therefore the last characters of `[5, 4, 4, 3, 2, 1]` are `[4, -1, 3, 2, 1, 0]`.
/// Since the special character `$` is not important the final vector information is `[e, l, p, p, a]` and `1`
/// for the index position.
///
/// ## ◀ Reverse
/// TODO: Improve explanation
/// The reverse algorithm of BWT uses three vectors of length `N` with `N` being the number of data vector.
/// The first vector is the actual vector to be reversed i.e. last characters of the suffix array.
/// The second vector is the first vector sorted.
/// The third and last vector is a count of the second vector.
/// The third vector saves the information on the position of the byte.
/// The following is an example for the `apple` string as above.
///
/// ```text
/// [e, l, p, p, a] > [1, 1, 1, 2, 1]
/// ```
///
///
/// [e, l, p, p, a]
/// [a, e, l, p, p]
/// [1, 1, 1, 2, 1]
/// i: 0,
/// The first output is the letter at the index position of the second vector.
/// In the above example this is pos `1` in `[$, a, e, l, p, p]` and therefore  `a`.
/// The second output is at the index position of the `n`th occurence of the letter just output in the first vector.
/// The `n`th occurence is described by the third vector.
/// In this case it is the first occurence of `a` in vector 1.
/// This is at the last position. Therefore the output is `p`.
/// After that is the first occurence of the letter `p` in vector 1.
/// This is at position 2.
///
/// # Example
/// ```rust
/// use rscompress_transformation::BurrowWheeler;
/// ```
#[derive(Debug)]
pub struct BurrowWheeler {
    ix: Option<usize>,
    size: usize,
}

impl BurrowWheeler {
    pub fn new() -> Self {
        BurrowWheeler { ix: None, size: 0 }
    }
    pub fn reset(&mut self) {
        self.ix = None
    }
    pub fn with_ix_and_size(ix: usize, size: usize) -> Self {
        BurrowWheeler { ix: Some(ix), size }
    }
}

impl Default for BurrowWheeler {
    fn default() -> Self {
        Self::new()
    }
}

impl Transform for BurrowWheeler {
    /// Transformation of the initial source data
    fn transform(&mut self, source: &[u8]) -> Result<Vec<u8>, TransformError> {
        self.size = source.len();
        if source.is_empty() {
            return Err(TransformError::EmptyBufferError);
        }
        debug!("Input {:?}", source);
        let (_, mut sarr) = SuffixArray::new(source).into_parts();
        debug!("SARR {:?}", sarr);
        self.ix = sarr.iter().position(|&x| x == 0);
        let ix = self.ix.ok_or(TransformError::MissingIndex)?;
        sarr[ix] = sarr.len() as u32 + 1;
        let mut last_column: Vec<_> = sarr
            .iter()
            .map(|&x| *source.get((x - 1) as usize).unwrap_or(&0u8))
            .collect();
        last_column.remove(ix);
        Ok(last_column)
    }
    /// Reversing the initial transformation
    fn reverse(&mut self, source: &[u8]) -> Result<Vec<u8>, TransformError> {
        debug!("{:?}", self);
        let bix = self.ix.ok_or(TransformError::MissingIndex)?;

        // generate sorted vector
        let mut sorted = source.to_vec();
        sorted.sort_unstable();

        // generate counts vector
        let counts = get_counts(&sorted)?;

        // generate mapping
        let mapping = get_position_map(source);

        debug!("Source: {:?}", source);
        debug!("Sorted: {:?}", sorted);
        debug!("Counts: {:?}", counts);
        debug!("Self: {:?}", self);
        let mut result: Vec<u8> = vec![0u8; self.size];
        let mut pos = bix - 1;
        for r in result.iter_mut() {
            let reversed = sorted[pos];
            *r = reversed;
            let c = counts[pos];
            let m = mapping.get(r).ok_or(TransformError::MissingMapping(*r))?;
            let ix = *m.get(c).ok_or(TransformError::MissingCountMap(*r, c))?;
            pos = ix - (ix != 0 && ix < bix) as usize;
            debug!("{:?}", r);
        }
        Ok(result)
    }
}

fn get_position_map(data: &[u8]) -> HashMap<u8, Vec<usize>> {
    let mut result: HashMap<u8, Vec<usize>> = HashMap::new();
    for (i, d) in data.iter().enumerate() {
        let v = result.get_mut(d);
        match v {
            Some(v) => v.push(i),
            None => {
                let k = vec![i];
                let _ = result.insert(*d, k);
            }
        };
    }
    result
}

fn get_counts(sorted: &[u8]) -> Result<Vec<usize>, TransformError> {
    let mut v = *sorted.first().ok_or(TransformError::EmptyBufferError)?;
    let mut counter = 0;

    let mut result: Vec<usize> = sorted[1..]
        .iter()
        .map(|&x| {
            if x == v {
                counter += 1
            } else {
                counter = 0;
                v = x
            }
            counter
        })
        .collect();

    result.insert(0, 0);
    Ok(result)
}

#[cfg(test)]
mod tests {
    use super::*;
    use crate::tests::{random_roundtrip, roundtrip, transform};

    #[test]
    fn test_counts() {
        let mut data: [u8; 7] = [123, 139, 39, 62, 139, 139, 139];
        data.sort_unstable();

        let counts = get_counts(&data).unwrap();
        assert_eq!(counts, [0, 0, 0, 0, 1, 2, 3])
    }

    #[test]
    fn test_should_return_error() {
        let mut m = BurrowWheeler::new();
        let k = m.reverse("reverse".as_bytes());
        assert!(k.is_err())
    }

    #[test]
    fn test_easy_transforms() {
        transform::<BurrowWheeler>(&[123, 139, 39, 62, 139], &[139, 139, 39, 62, 123]);
        transform::<BurrowWheeler>(&[230, 183, 108, 102, 230], &[230, 108, 183, 230, 102]);
        transform::<BurrowWheeler>("banana".as_bytes(), "annbaa".as_bytes());
        transform::<BurrowWheeler>("compressioncode".as_bytes(), "enodrsooccimpse".as_bytes());
    }

    // Simple reverse tests will not be working, since information needs to
    // be transmitted from each tranformed block to successfully reverse it.
    // TODO: Add tests for multi step transformations.

    #[test]
    fn test_easy_roundtrip() {
        roundtrip::<BurrowWheeler>(&[34, 10, 0, 206, 40]);
        roundtrip::<BurrowWheeler>(&[123, 139, 39, 62, 139]);
        roundtrip::<BurrowWheeler>(&[230, 183, 108, 102, 230]);
        roundtrip::<BurrowWheeler>("compressioncode".as_bytes());
        roundtrip::<BurrowWheeler>("apple".as_bytes());
        roundtrip::<BurrowWheeler>("banana".as_bytes());
    }
    #[test]
    fn test_random_roundtrip() {
        random_roundtrip::<BurrowWheeler>(100, 10_000);
        random_roundtrip::<BurrowWheeler>(100, 10_000);
        random_roundtrip::<BurrowWheeler>(100, 10_000);
        random_roundtrip::<BurrowWheeler>(100, 10_000);
        random_roundtrip::<BurrowWheeler>(100, 10_000);
    }
}