Expand description

Fast smaz

Fast and pure Rust port of antirez/smaz library.

Example

use fast_smaz::Smaz;
let my_message = "Hello world!";
let compressed = my_message.smaz_compress();
let decompressed = compressed.smaz_decompress().unwrap();
assert_eq!(my_message.as_bytes(), decompressed);

Unicode/ASCII compatibility

fast-smaz does not do any assumption on whether the input is encoded using Unicode or ASCII, the entire API is designed to work only with byte slices (&[u8]). However, the original smaz cookbook uses only ASCII encoded strings, so it will only really compress the data (given the codebook words) if the input encoding is a superset of the ASCII encoding, which includes UTF-8 (but not UTF-16 or UTF-32).

However, trying to compress strings encoded with anything other than ASCII or UTF-8 is not guaranteed to actually compress the input string, but it is expected to at least correctly encode the input data conforming to SMAZ spec. Which means you still can get back the original data, but it will probably never compress the string.

This also means that, you can pass non-UTF8 and non-ASCII encoded inputs, get back the original data and recreate the string using the same input encoding. This is guaranteed to be safe as long as the data passed to the decompress conforms to SMAZ spec and was compressed using a SMAZ compression algorithm.

Compressing non-ASCII characters (2-byte, 3-byte or 4-byte wide)

Since the original codebook only has ASCII characters, non-ASCII characters are just stored as is in the result data, and restored back when decompressed. Nothing is lost when using SMAZ, even if it cannot be compressed or even if it is a sequence of bytes that is not a valid character in any existing encoding.

How smaz works

Here we dive into SMAZ algorithm details.

The codebook and reverse codebook

All the process is very simple once you understand the structure and the logic of SMAZ algorithm, below I’ve described how the algorithm works. This can be used as a resource to understand the algorithm, write a new implementation of it or to use for anything useful.

Compression

The codebook is a simple hash table with multiple buckets and provides fast-lookup for substrings that can be compressed.

It is basically constructed in this form:

An entry is consisted of:

[NUMBER_OF_BYTES][BYTES_OF_ASCII_STRING][INDEX_IN_REVERSE_CODEBOOK]

And a bucket is just a sequence of entries, like this:

[ENTRY][ENTRY][ENTRY]...

There is no separator character between entries, it is solely based on the fact that the NUMBER_OF_BYTES occupies a single byte (u8), followed by N bytes, which is the value of NUMBER_OF_BYTES and a final value INDEX_IN_REVERSE_CODEBOOK which occupies a single byte as well (u8).

Because of this, you can seek to any entry in the bucket by looking at the previous one, which gives the well known time complexity of O(n) for searching an entry inside a bucket of a hash table (but instead of a linked-list, we have a contiguous sequence of data, an array).

Every element in codebook array (i.e. every slice) is a bucket, and we can use the original SMAZ hashing algorithm to find buckets that probably contains the subslice we are looking for.

Bucket example

For example, given the words foo and bar, considering they reside in the same bucket and are located in the index 4 and 5 of the reverse codebook, respectively:

[3, 'f', 'o', 'o', 4, 3, 'b', 'a', 'r', 5]
Hash table example

The hash table must follow the distribution based on the hashing algorithm used to find the buckets, which is hard-coded in the SMAZ algorithm. SMAZ uses 3 different algorithms, one for subslices with only one element, one for subslices with 2 elements and one for subslices with 3 or more elements.

Each bucket is one element in the codebook array, for example:

[
    &[3, 'f', 'o', 'o', 4, 3, 'b', 'a', 'r', 5], // first bucket
    &[3, 'b', 'a', 'z', 6],                      // second bucket
]
What if we don’t find any entry?

In this case, we put the byte in a buffer and process the next one, trying to find a subslice in the hash table.

And now we have two scenarios, if we find an entry, the previous non-encoded byte follows after a \254 code, for example, let’s say that the byte 54 is not encoded and is in the queue, we push [\254, 54] to the output result vector.

If we don’t find an entry, we repeat the same process, pushing the byte to the buffer and processing the next one, now, if we do find an entry, we have more than one byte in our queue, so both of them follows after a \255 code and an additional length byte, for example, let’s say that there is the byte 54 and byte 88 in the queue, we push [\255, 1, 54, 88] to the output result vector, if we had three bytes, the previous ones plus 44, we push [\255, 2, 54, 88, 44]. Note that the second value is the number of additional bytes, we already know that there must have a single byte after the length, because otherwise it would be encoded with \254, so we do push N - 1 as the number of additional bytes, instead of N.

Why? this allows we to store one additional byte before we reach the max values of u8 (u8::MAX).

In this way, we can also preserve the input integrity even if it cannot be compressed.

Also, one last important information, the maximum subslice length we look for is 7, which is the size of the longest string in the codebook (http://), and the minimum is 1.

And, empty inputs translates into empty outputs, doesn’t matter if you pass them to compress, encode or decompress.

Note that we reslice the input data to skip the bytes we already processed, which happens when we find an entry or when we feed the byte(s) to the buffer.

Decompression

For decompression, we use the reverse codebook, we take the first character and check:

  • If it is \254, take the next value and push to the result vector.
  • If it is \255, take the next value, which is the number of additional bytes to read (let assign to N), then read the next byte (right after the size) and push to the result vector, then take the next N and push all them to the result vector.
  • Otherwise, use the byte as the index to lookup at the reverse codebook, and push all the characters found in the reverse codebook at the provided index to the result vector.

Note that, after one of those steps, we take a new slice of the input encoded data starting right after the last processed byte, for example, for the first case, we start after the first 2 bytes, for the second one, we start after the first 3 bytes (\255, length and a single byte) plus the additional read bytes, and for the third one we start after only one byte.

Constants

Cookbook taken from: antirez/smaz.

Reverse cookbook taken from: antirez/smaz

Traits

Provides extension functions to compress and decompress using SMAZ algorithm.

Functions

Compresses the input using SMAZ algorithm.

Decompress the input using SMAZ algorithm.

Encodes the string to conform to SMAZ spec, but do not compress it.