Crate mayda [] [src]

mayda is a library to compress integer arrays (all primitive integer types are supported). The design favors decompression speed and the ability to index the compressed array over the compression ratio, on the principle that the runtime penalty for using compressed arrays should be as small as possible.

This crate provides three variations on a single compression algorithm. The Uniform type can decompress around six billion u32s per second, or 24 GiB/s of decompressed integers, on a 2.6 GHz Intel Core i7-6700HQ processor (see below for specifics). The Monotone and Unimodal types decompress at a little less than half the speed, but can have much better compression ratios depending on the distribution of the integers. Overall performance is comparable to the fastest (known) libraries in any language.

Usage

Add this to your Cargo.toml:

[dependencies]
mayda = "^0.2"

and this to your crate root:

extern crate mayda;

Example: encoding and decoding

The Uniform struct is recommended for general purpose integer compression. Use the Encode trait to encode and decode the array.

use std::mem;
use mayda::{Uniform, Encode};

// Integers in some interval of length 255, require one byte per integer
let input: Vec<u32> = (0..128).map(|x| (x * 73) % 181 + 307).collect();

let mut uniform = Uniform::new();
uniform.encode(&input).unwrap();

let i_bytes = std::mem::size_of_val(input.as_slice());
let u_bytes = std::mem::size_of_val(uniform.storage());

// 128 bytes for encoded entries, 12 bytes of overhead
assert_eq!(i_bytes, 512);
assert_eq!(u_bytes, 140);

let output: Vec<u32> = uniform.decode();
assert_eq!(input, output);

Example: indexing

Use the Access or AccessInto traits to index the compressed array. This is similar to Index, but returns a vector instead of a slice.

use mayda::{Uniform, Encode, Access};

// All primitive integer types supported
let input: Vec<i16> = (-64..64).collect();

let mut uniform = Uniform::new();
uniform.encode(&input).unwrap();

let val: i16 = uniform.access(64);
assert_eq!(val, 0);

// Vector is returned to give ownership to the caller
let range: Vec<i16> = uniform.access(..10);
assert_eq!(range, (-64..-54).collect::<Vec<_>>());

Algorithm

The compression algorithm relies on the observation that for many integer arrays, the probability distribution of a block of 128 entries is not uniform over all values that can be represented by the integer type. For example, an array of indices into a second array with 256 entries has entries that are bounded below by 0 and above by 255. This requires only eight bits per entry, rather than the 32 or 64 generally set aside for a usize index. The basic idea of the compression algorithm is to represent all of the entries in a block with a single minimum necessary bit width. This allows SIMD operations and manual loop unrolling to avoid branches to be used to accelerate encoding and decoding.

This approach does not perform well for probability distributions with outliers though, or for situations where the median value is nonzero. This crate provides three types that handle this situation by applying different initial transformations to the input integer array.

The Uniform type targets encoding and decoding speed. The only transformation is to subtract the minimum value from the entries, with the result that the compression depends only on the difference between the largest and smallest entries.

The Monotone type is specifically intended for arrays with monotone increasing entries. A blocks of entries is transformed into an offset and an array of differences of successive entries, and the array of differences is encoded by the approach above. The compression depends only on the largest entry in the difference array.

The Unimodal type performs the most extensive transformations and offers the most robust compression. The median value is subtracted from the entries, the entries are mapped to the unsigned integers by the zig-zag encoding, and the most significant bits of any outliers are removed and stored separately. The result is that the compression effectively depends only on the standard deviation of the probability distribution of the block entries.

Indexing

Indexing an object of the Uniform, Monotone or Unimodal types is not a simple pointer offset. Part of the header encodes the relative offsets to every block in a compressed form, with the result that random access via the Access trait is an O(log(idx)) operation, where idx is the value of the index (not the length of the array). The overhead of this part of the header is around a tenth of a bit per encoded integer.

If you need to access several nearby entries, consider accessing the range and indexing the returned vector for performance.

Safety

As a general rule, you should not decode or access objects of the Uniform, Monotone or Unimodal types from untrusted sources.

mayda performs wildly unsafe pointer operations during encoding and decoding. Changing the header information with mut_storage can cause data to be written to or read from arbitrary addresses in memory.

That said, the situation is the same for any of the data structures in the standard library (consider the set_len method of a Vec).

Re-exports

pub use self::error::Error;
pub use self::monotone::Monotone;
pub use self::uniform::Uniform;
pub use self::unimodal::Unimodal;
pub use self::utility::Access;
pub use self::utility::AccessInto;
pub use self::utility::Encode;

Modules

error

Defines the mayda::Error type. Currently only used for the return types of functions defined in the Encode trait, but intended to allow for more complex error handling in the future.

monotone

Monotone encoding of integer arrays. Intended for cases where the entries are monotonically increasing. Implemented for all primitive integer types.

uniform

Uniform encoding of integer arrays. Intended for cases where encoding and decoding speed is desired, or the probability distribution of the entries is uniform within certain bounds. Implemented for all primitive integer types.

unimodal

Unimodal encoding of integer arrays. Intended for cases where information about the probability distribution of the entries is not known, and the presence of outliers reduces the compression ratio of the other types. Implemented for all primitive integer types.

utility

Contains constants, enums, traits and functions used by all of the encoding types provided by the mayda crate.