Sequential Compression

This is a proof-of-concept, "toy", compression algorithm that takes sequences of matching bytes and represents them as text.

This allows for unparalleled compression for a string such as aaaaaaaaaaaaaaaaaaaaaaaaaaa, but falls apart with anything more complex.

For example the string above would be represented as 97x27, 97 being the UTF-8, binary representation of the character "a". This representation takes up a mere 5 bytes whereas the uncompressed string takes up 27 bytes.

Benchmarks

I compressed a 1GB file was 0 (not the character a binary 0) repeated 8 billion times. This was compressed to the string 0x8000000000, which is equivalent to about a 9999999985% decrease in size.

By comparison if you try to compress the Cargo.toml file in this repository, it goes from 180 bytes to 977 bytes which is a 442.778% increase in size.

Spec

Every compression algorithm needs a specification :)

Compressed files must be represented as plain text files using UTF-8 encoding.
Every sequence of bytes is represented as the byte, represented as an integer, followed by the character x, followed by the number of times it was repeated.
Every sequence is newline delimited

Made with 💗 by Juliette Cordor

seqcompress 0.1.0

Sequential Compression

Benchmarks

Spec