Byte Size

A short string compressor/decompressor that can store 20,000+ words in three bytes or less.

Similar to smaz, byte-size is able to compress small strings, something that other conventional compression algorithms struggle with.

However, byte-size is typically better than smaz, certainly for very commonly used words (out of 10000 most common words, ony 1% had better compression with smaz) byte-size can also represent numbers, repeated sequences and non-alphanumeric characters more efficiently than smaz. It can encode unicode characters, but not very efficiently. If your text includes a few unicode characters it should still compress better, but if your strings are mostly unicode characters, other schemes such as Unishox are better.

Cost

byte-size uses several tables with over 18000 total entries. Obviously this will incur a large runtime memory and binary file size cost, but if you have the memory available, it is worth it to compress more effectively.

To match these, currently we use a poor algorithm that lops over EVERY entry in EVERY table to obtain the best map. Future versions will use a phf hash table approach.

Examples

Using examples directly from smaz we have:

[Insert examples]

We can see how every example is compressed more with byte-size than smaz.

Encoding

the one byte wonder sequences are taken from smaz
- We filtered out certain sequences that don't work well
the two byte common and three byte uncommon words are chosen from here
- Words are only chosen from the list if their representation is more compact (for example the word 'the' does not appear in either 2 byte or 3 byte tables as it can be represented in 1 byte)

The Snaz encoding is as follows:

The one byte wonders consist of 240 one byte sequences. These one byte wonders are the ascii values in their normal positions, with other common sequences filling the gaps to
- This leaves 16 values remaining for multi-byte codes.
  - One of these values is used to indicate a Unicode scalar value will follow
  - This leaves 15*256=3840 combinations of two byte sequences, which are divided as such:
    - 3586 are used to encode the 2 byte common words, 1793 with a space prefix and 1793 without
    - 32 are used to encode custom words
    - 32 are used to encode sequence repetitions of anywhere between 4 and 35 repeating units
    - 32 are used to encode numbers. 32 values means 5 bits in total, 3 for the number of bytes used, and 2 for the number itself
    - 29 are used to encode the non-printable control characters
    - 129 are used for the 3 byte codes. 129*256 = 33024 combinations
      - All 33024 combinations are used for the 3 byte uncommon words, 16512 with a space prefix, and 16512 without

byte-size 0.2.0

Byte Size

Cost

Examples

Encoding