Expand description
Fast smaz
Fast and pure Rust port of antirez/smaz library.
Example
use fast_smaz::Smaz;
let my_message = "Hello world!";
let compressed = my_message.smaz_compress();
let decompressed = compressed.smaz_decompress().unwrap();
assert_eq!(my_message.as_bytes(), decompressed);
Unicode/ASCII compatibility
fast-smaz does not do any assumption on whether the input is encoded using Unicode or ASCII,
the entire API is designed to work only with byte slices (&[u8]
). However, the original smaz
cookbook uses only ASCII encoded strings, so it will only really compress the data (given the codebook words)
if the input encoding is a superset of the ASCII encoding, which includes UTF-8 (but not UTF-16 or UTF-32).
However, trying to compress strings encoded with anything other than ASCII or UTF-8 is not guaranteed to actually compress the input string, but it is expected to at least correctly encode the input data conforming to SMAZ spec. Which means you still can get back the original data, but it will probably never compress the string.
This also means that, you can pass non-UTF8 and non-ASCII encoded inputs, get back the original data
and recreate the string using the same input encoding. This is guaranteed to be safe as long as
the data passed to the decompress
conforms to SMAZ spec and was compressed using a SMAZ compression
algorithm.
Compressing non-ASCII characters (2-byte, 3-byte or 4-byte wide)
Since the original codebook only has ASCII characters, non-ASCII characters are just stored as is in the result data, and restored back when decompressed. Nothing is lost when using SMAZ, even if it cannot be compressed or even if it is a sequence of bytes that is not a valid character in any existing encoding.
How smaz works
Here we dive into SMAZ algorithm details.
The codebook and reverse codebook
All the process is very simple once you understand the structure and the logic of SMAZ algorithm, below I’ve described how the algorithm works. This can be used as a resource to understand the algorithm, write a new implementation of it or to use for anything useful.
Compression
The codebook is a simple hash table with multiple buckets and provides fast-lookup for substrings that can be compressed.
It is basically constructed in this form:
An entry is consisted of:
[NUMBER_OF_BYTES][BYTES_OF_ASCII_STRING][INDEX_IN_REVERSE_CODEBOOK]
And a bucket is just a sequence of entries, like this:
[ENTRY][ENTRY][ENTRY]...
There is no separator character between entries, it is solely based on the fact that the NUMBER_OF_BYTES
occupies a single byte (u8
), followed by N bytes, which is the value of NUMBER_OF_BYTES
and a final value INDEX_IN_REVERSE_CODEBOOK
which occupies a single byte as well (u8
).
Because of this, you can seek to any entry in the bucket by looking at the previous one, which gives
the well known time complexity of O(n)
for searching an entry inside a bucket of a hash table
(but instead of a linked-list, we have a contiguous sequence of data, an array).
Every element in codebook array (i.e. every slice) is a bucket, and we can use the original SMAZ hashing algorithm to find buckets that probably contains the subslice we are looking for.
Bucket example
For example, given the words foo
and bar
, considering they reside in the same bucket and
are located in the index 4
and 5
of the reverse codebook, respectively:
[3, 'f', 'o', 'o', 4, 3, 'b', 'a', 'r', 5]
Hash table example
The hash table must follow the distribution based on the hashing algorithm used to find the buckets, which is hard-coded in the SMAZ algorithm. SMAZ uses 3 different algorithms, one for subslices with only one element, one for subslices with 2 elements and one for subslices with 3 or more elements.
Each bucket is one element in the codebook array, for example:
[
&[3, 'f', 'o', 'o', 4, 3, 'b', 'a', 'r', 5], // first bucket
&[3, 'b', 'a', 'z', 6], // second bucket
]
What if we don’t find any entry?
In this case, we put the byte in a buffer and process the next one, trying to find a subslice in the hash table.
And now we have two scenarios, if we find an entry, the previous non-encoded byte follows after
a \254
code, for example, let’s say that the byte 54
is not encoded and is in the queue,
we push [\254, 54]
to the output result vector.
If we don’t find an entry, we repeat the same process, pushing the byte to the buffer and processing
the next one, now, if we do find an entry, we have more than one byte in our queue, so both of them
follows after a \255
code and an additional length byte, for example, let’s say that there is
the byte 54
and byte 88
in the queue, we push [\255, 1, 54, 88]
to the output result vector,
if we had three bytes, the previous ones plus 44
, we push [\255, 2, 54, 88, 44]
. Note that
the second value is the number of additional bytes, we already know that there must have a single
byte after the length, because otherwise it would be encoded with \254
, so we do push
N - 1
as the number of additional bytes, instead of N
.
Why? this allows we to store one additional byte before we reach the max values of u8
(u8::MAX
).
In this way, we can also preserve the input integrity even if it cannot be compressed.
Also, one last important information, the maximum subslice length we look for is 7
, which is the
size of the longest string in the codebook (http://
), and the minimum is 1
.
And, empty inputs translates into empty outputs, doesn’t matter if you pass them to compress
, encode
or decompress
.
Note that we reslice the input data to skip the bytes we already processed, which happens when we find an entry or when we feed the byte(s) to the buffer.
Decompression
For decompression, we use the reverse codebook, we take the first character and check:
- If it is
\254
, take the next value and push to the result vector. - If it is
\255
, take the next value, which is the number of additional bytes to read (let assign toN
), then read the next byte (right after the size) and push to the result vector, then take the nextN
and push all them to the result vector. - Otherwise, use the byte as the index to lookup at the reverse codebook, and push all the characters found in the reverse codebook at the provided index to the result vector.
Note that, after one of those steps, we take a new slice of the input encoded data starting right
after the last processed byte, for example, for the first case, we start after the first 2 bytes,
for the second one, we start after the first 3 bytes (\255
, length and a single byte) plus the additional
read bytes, and for the third one we start after only one byte.
Constants
Cookbook taken from: antirez/smaz.
Reverse cookbook taken from: antirez/smaz
Traits
Provides extension functions to compress and decompress using SMAZ algorithm.
Functions
Compresses the input using SMAZ algorithm.
Decompress the input using SMAZ algorithm.