mayda
A Rust library to compress integer arrays (all primitive integer types are supported). The design favors decompression speed and the ability to index the compressed array over the compression ratio, on the principle that the runtime penalty for using compressed arrays should be as small as possible.
This crate provides three variations on a single compression algorithm. The
Uniform
type can decompress around five billion u32
s per second, or 20
GiB/s of decompressed integers, on a 2.6 GHz Intel Core i7-6700HQ processor.
The Monotone
and Unimodal
types decompress at around half the speed,
but can have much better compression ratios depending on the distribution of
the integers.
Compiling mayda
requires the nightly compiler and CPU support for the
SSE2 instruction set (any Intel or AMD processor manufactured after 2003). The
basic approach is further described in Zukowski2006 and Lemire2015.
Documentation
The module documentation provides further examples and some more detailed descriptions of the algorithms involved.
Usage
Add this to your Cargo.toml
:
[]
= "0.1"
and this to your crate root:
extern crate mayda;
Example: encoding and decoding
The Uniform
struct is recommended for general purpose integer compression.
Use the Encodable
trait to encode and decode the array.
extern crate mayda;
use ;
Example: indexing
Use the Access
trait to index the compressed array. This is similar to Index
,
but returns a vector instead of a slice.
extern crate mayda;
use ;
Performance
Consider the following three vectors:
extern crate rand;
use ;
let mut rng = thread_rng;
let length: usize = 1024;
// `input1` contains random integers
let mut input1: = Vec with_capacity;
let range = new;
for _ in 0..length
// `input2` contains monotone increasing integers
let mut input2: = input1.clone;
input2.sort;
// `input3` contains Gaussian distributed integers with outliers
let mut input3: = Vec with_capacity;
let gaussian = new;
for _ in 0..length
let index = new;
let outliers = new;
for _ in 0..
The performance of the Uniform
, Monotone
and Unimodal
types for
these three vectors on a 2.6 GHz Intel Core i7-6700HQ processor is given below.
Encoding and decoding speeds are reported in billions of integers per second,
time required to index the last entry is reported in nanoseconds, and
compression is reported as bits per integer. Native encoding and decoding
speeds allocate memory and perform a memcpy. The Shannon entropy is a
reasonable target for the bits per integer.
For input1
the Shannon entropy is 10.00. Uniform
is preferrable in every
respect for the general case.
Encode (BInt/s) | Decode (BInt/s) | Index (ns) | Bits/Int | |
---|---|---|---|---|
Uniform | 1.26 | 5.28 | 29 | 10.63 |
Monotone | 1.19 | 2.45 | 68 | 32.63 |
Unimodal | 0.21 | 2.13 | 53 | 11.16 |
Native | 14.03 | 14.03 | 0 | 32 |
For input2
the Shannon entropy is 10.00, but the additional structure is used
by Monotone
to improve compression.
Encode (BInt/s) | Decode (BInt/s) | Index (ns) | Bits/Int | |
---|---|---|---|---|
Uniform | 1.21 | 5.22 | 29 | 8.25 |
Monotone | 1.25 | 2.41 | 68 | 3.5 |
Unimodal | 0.24 | 2.35 | 52 | 8.13 |
Native | 14.03 | 14.03 | 0 | 32 |
For input3
the Shannon entropy is 12.46, but compression is difficult due to
the presence of outliers. Unimodal
gives the most robust compression.
Encode (BInt/s) | Decode (BInt/s) | Index (ns) | Bits/Int | |
---|---|---|---|---|
Uniform | 1.21 | 5.25 | 29 | 32.63 |
Monotone | 1.16 | 2.45 | 66 | 32.63 |
Unimodal | 0.18 | 1.61 | 63 | 12.50 |
Native | 14.03 | 14.03 | 0 | 32 |
License
mayda
is licensed under either of
- Apache License, Version 2.0, (LICENSE-APACHE or https://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or https://opensource.org/licenses/MIT)
at your option.
Contribution
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.