Module constriction::stream

source ·
Expand description

Stream Codes (entropy codes that amortize over several symbols)

Module Overview

This module provides implementations of stream codes and utilities for defining entropy models for these stream codes. This module is the heart of the constriction crate.

Provided Implementations of Stream Codes

Currently, the following stream codes are provided (see below for a comparison):

  • Asymmetric Numeral Systems (ANS): a highly efficient modern stream code with near-optimal compression effectiveness; ANS Coding operates as a stack (“last in first out”) and is surjective, which enables advanced use cases like bits-back coding in hierarchical entropy models. See submodule stack.
  • Range Coding: a modern variant of Arithmetic Coding; it has essentially the same compression effectiveness as ANS Coding but operates as a queue (“first in first out”), which makes it preferable for autoregressive entropy models but less useful for hierarchical models (see comparison). See submodule queue.
  • Chain Coding: an experimental new entropy coder that combines the (net) effectiveness of stream codes with the locality of symbol codes; it is meant for experimental new compression approaches that perform joint inference, quantization, and bits-back coding in an end-to-end optimization. See submodule chain.

All of these stream codes are provided through types that implement the Encode and Decode traits defined in this module.

Provided Utilities for Entropy Models

To encode or decode a sequence of symbols with one of the above stream codes, you have to specify an EntropyModel for each symbol. The submodule model provides utilities for defining EntropyModels.

Examples

See submodules stack and queue.

Which Stream Code Should I Use?

The list below shows a detailed comparison of the major two provided stream codes, ANS Coding and Range Coding (most users won’t want to use the Chain Coder).

TL;DR: Don’t overthink it. For many practical use cases, both ANS and Range Coding will probably do the job, and if you end up painting yourself into a corner then you can still switch between the methods later relatively easily since most operations are implemented in trait methods. If you want to use the bits-back trick or similar advanced techniques, then definitely use the ANS Coder because the provided Range Coder isn’t surjective (see below). If you have an autoregressive entropy model then you might prefer Range Coding. In any other case, a possibly biased recommendation from the author of this paragraph is to use ANS Coding by default for its simplicity and decoding speed.

Here’s the more detailed comparison between ANS Coding and Chain Coding:

  • Read/write semantics: The main practical difference between ANS and Range Coding is that ANS Coding operates as a stack (“last in first out”) whereas Range Coding operates as a queue (“first in first out”). Whether a stack or a queue is better for you depends on your entropy model:
    • Autoregressive entropy models: point for Range Coding. If you use an ANS coder to encode a sequence of symbols then it is advisable to use the method AnsCoder::encode_symbols_reverse. As the name suggests, this method encodes the symbols onto the stack in reverse order so that, when you decode them at a later time, you’ll recover them in original order. However, iterating over symbols and, in particular, over their associated entropy models in reverse order may be inconvenient for autoregressive models with long term memory. Using an ANS coder with such a model means that you may end up having to first build up the full autoregressive model in memory front-to-back and only then iterate over it back-to-front for encoding (the decoder can still iterate as normal, front-to-back). Range Coding makes this easier since both encoding and decoding can iterate in the natural front-to-back direction.
    • Hierarchical entropy models: point for ANS Coding: Sometimes a stack (as in ANS Coding) is just what you need because having only a single “open end” at the top of the stack makes things easier. By contrast, a queue (as in Range Coding) has two separate open ends for reading and writing, respectively. This leads to a number of corner cases when the two ends come close to each other (which can happen either within a single word or on two neighboring words). To avoid the computational overhead for checking for these corner cases on every encoded or decoded symbol, the provided Range Coder strictly distinguishes between a RangeEncoder and a RangeDecoder type. Switching from a RangeEncoder to a RangeDecoder requires “sealing” the write end and “opening up” the read end, both of which are irreversible operations, i.e., you can’t switch back from a RangeDecoder with partially consumed data to a RangeEncoder to continue writing to it. By contrast, the provided ANS Coder just has a single AnsCoder type that supports arbitrary sequences of (possibly interleaved) read and write operations. Being able to interleave reads and writes (i.e., decoding and encoding operations) is important for bits-back coding, a useful method for hierarchical entropy models.
  • Surjectivity: point for ANS Coding. This is another property that is important for bits-back coding, and that only ANS Coding satisfies: encoding data with an ANS Coder and with a given sequence of entropy models can lead to any arbitrary sequence of words of compressed data. This means that decoding with an ANS Coder is invertible, even if the binary data that you decode is arbitrary. You can load any given binary data into an ANS Coder (with AnsCoder::from_binary) and then decode some part of the data into a sequence of symbols. Regardless of the source of the binary data, re-encoding the symbols will restore it. By contrast, Range Coding is not surjective, and decoding with the provided Range Coder is not always invertible. Due to rounding, Range Coding has some (small) pockets of unrealizable bit strings for any given sequence of entropy models. While the Range Decoder in constriction is currently lenient in this regard (it still outputs valid symbols when decoding an unrealizable bit string), re-encoding the symbols you obtain from decoding an unrealizable bit string will not restore the original (unrealizable) bit string. This means that the provided Range Coder cannot be used for bits-back coding in a reliable way.
  • Compression effectiveness: virtually no difference unless misused. With the provided presets), the bitrates of both ANS Coding and Range Coding are both very close to the theoretical bound for lossless compression. The lack of surjectivity in Range Coding leads to a small overhead, but it seems to be roughly countered by a slightly larger rounding overhead in ANS Coding. Keep in mind, however, that deviating from the presets can considerably degrade compression effectiveness if not done carefully. It is strongly recommended to always start with the DefaultXxx type aliases (for either ANS or Range Coding), which use well-tested presets, before experimenting with deviations from these defaults.
  • Computational efficiency: point for ANS Coding. ANS Coding is a simpler algorithm than Range Coding, especially the decoding part. This manifests itself in fewer branches and a smaller internal coder state. Empirically, our decoding benchmarks in the file benches/lookup.rs run more than twice as fast with an AnsCoder than with a RangeDecoder. However, please note that (i) these benchmarks use the highly optimized lookup models; if you use other entropy models then these will likely be the computational bottleneck, not the coder; (ii) future versions of constriction may introduce further run-time optimizations; and (iii) while decoding is more than two times faster with ANS, encoding is somewhat (~ 10 %) faster with Range Coding (this might be because encoding with ANS, unlike decoding, involves an integer division, which is a surprisingly slow operation on most hardware).
  • Random access decoding: minor point for ANS Coding. Both ANS and Range Coding support decoding with random access via the Seek trait, but it comes at different costs. In order to jump (“seek”) to a given location in a slice of compressed data, you have to provide an index for a position in the slice and an internal coder state (you can get both during encoding via Pos::pos). While the type of the internal coder state can be controlled through type parameters, its minimal (and also default) size is only two words for ANS Coding but four words for Range Coding. Thus, if you’re developing a container file format that contains a jump table for random access decoding, then choosing ANS Coding over Range Coding will allow you to reduce the memory footprint of the jump table. Whether or not this is significant depends on how fine grained your jump table is.
  • Serialization: minor point for ANS Coding. The state of an ANS Coder is entirely described by the compressed bit string. This means that you can interrupt encoding and/or decoding with an ANS Coder at any time, serialize the ANS Coder to a bit string, recover it at a later time by deserialization, and continue encoding and/or decoding. The serialized form of the ANS Coder is simply the string of compressed bits that the Coder has accumulated so far. By contrast, a Range Encoder has some additional internal state that is not needed for decoding and therefore not part of the compressed bit string, but that you would need if you wanted to continue to append more symbols to a serialized-deserialized Range Encoder.

What’s a Stream Code?

A stream code is an entropy coding algorithm that encodes a sequence of symbols into a compressed bit string, one symbol at a time, in a way that amortizes compressed bits over several symbols. This amortization allows stream codes to reach near-optimal compression performance without giving up computational efficiency. In our experiments, both the provided ANS Coder and the Range Coder implementations showed about 0.1 % relative overhead over the theoretical bound on the bitrate (using “default” presets).

The near-optimal compression performance of stream codes is to be seen in contrast to symbol codes (see module symbol), such as the well-known Huffman code. Symbol codes do not amortize over symbols. Instead, they map each symbol to a fixed sequence of bits of integer length (a “codeword”). This leads to a typical overhead of 0.5 bits per symbol in the best case, and to an overhead of almost 1 bit per symbol for entropy models with very low (≪ 1 bit of) entropy per symbol, which is common for deep learning based entropy models. Stream codes do not suffer from this overhead.

The computational efficiency of stream codes is to be seen in contrast to block codes. Block codes are symbol codes that operate on blocks of several consecutive symbols at once, which reduces the relative impact of the constant overhead of ~0.5 bits per block compared to regular symbol codes. Block codes typically require some method of constructing the code books dynamically (e.g., Deflate) to avoid the computational cost from growing exponentially in the block size. By contrast, stream codes amortize over long sequences of symbols without ever constructing an explicit code book for the entire sequence of symbols. This keeps their computational cost linear in the number of symbols.

Highly Customizable Implementations With Sane Presets

Users who need precise control over the trade-off between compression effectiveness, runtime performance, and memory consumption can fine tune the provided implementations of stream codes and entropy models at compile time through type parameters discussed below. For most users, however, we recommend using one of the following presets.

Presets

As finding optimal settings can be tricky, we provide the following type aliases for useful presets:

You’ll usually want to use matching presets for entropy coders and entropy models. However, it is legal to use entropy models with the “small” preset for an entropy coder with the (larger) “default” preset (the opposite way is statically prohibited on purpose by the type system). Using “small” models with a “default” coder may make sense if you want to mix entropy models with “default” and “small” presets (e.g., if some of the models are lookup tables but others aren’t).

Customizations for Advanced Use Cases

Some advanced use cases may not be covered by the above presets. For such cases, the entropy coders and entropy models can be adjusted type parameters listed below.

Warning: Adjusting these parameters can severly impact compression effectiveness if not done with care. We strongly recommend to always start from one of the above presets, to only deviate from them when necessary, and to measure the effect of any deviations from the presets (on compression effectiveness, computational performance, and memory consumption).

Type Parameters of Entropy Coders

  • Word: a BitArray specifying the smallest unit of compressed data that the entropy coder emits or reads in at a time; an entropy coder’s Word type has to be at least as large as the Probability type (see below) of any entropy model used with it.
    • The “default” preset sets Word = u32.
    • The “small” preset sets Word = u16.
  • State: a BitArray that parameterizes the size of the internal coder state (the actual coder state Code::State may differ from State, e.g., it may contain several fields of type State, as in the case of a Range Coder). Typically, State has to be at least twice as large as Word. You’ll want to use a small State if you’re planning to perform random accesses via the Seek trait because your container format will then typically contain some kind of jump table composed of indices and Code::States (see PosSeek::Position).
    • The “default” preset sets State = u64.
    • The “small” preset sets State = u32.
  • Backend: the source and/or sink of compressed data. See module backends. You’ll rarely have to specify Backend explictily, it usually defaults to Vec<Word> for encoding and to either Vec<Word> or Cursor for decoding, depending on which entropy coder you use and who owns the compressed data. The ChainCoder has two type parameters for backends because it keeps track of compressed data and remainders separately.

Type Parameters of Entropy Models

  • Probability: a BitArray specifying the type used to represent probabilities (smaller than one) in fixed point arithmetic; must hold at least PRECISION bits (see below) and must not be larger than the Word type of the entropy coder that uses the model.
    • The “default” preset sets Probability = u32 (same as for Word above).
    • The “small” preset sets Probability = u16 (same as for Word above).
  • PRECISION: a usize const generic that defines the number of bits used when representing probabilities in fixed-point arithmetic. Must not be zero or larger than Probability::BITS. A small PRECISION will lead to compression overhead due to poor approximations of the true probability distribution. A large PRECISION will lead to a large memory overhead if you use a LookupDecoderModel, and it can make decoding with a LeakilyQuantizedDistribution slow.
    • The “default” preset sets PRECISION = 24.
    • The “small” preset sets PRECISION = 12.

Modules

  • Experimental entropy coding algorithm for advanced variants of bitsback coding.
  • Probability distributions that can be used as entropy models for stream codes.
  • Near-optimal compression on a queue (“first in first out”)
  • Fast and Near-optimal compression on a stack (“last in first out”)

Structs

Enums

Traits

  • A trait for temporary conversion into a matching decoder type.
  • Base trait for stream encoders and decoders
  • A trait for stream decoders (i.e., decompressors)
  • A trait for stream encoders (i.e., compressors)
  • A trait for permanent conversion into a matching decoder type.