Module vbyte

Source
Expand description

Variable-length byte codes.

These codes represent a natural number as a sequence of bytes, the length of the sequence depends on the magnitude of the number. They are used in many contexts, and they are known under a plethora of different names such “vbyte”, “varint”, “variable-length quantity”, “LEB”, and so on.

There are several variants of their definition, but their implied distribution is always ≈ 1/x8/7

§Definition

Since there are a few slightly different variants used in production code and in the literature, before going into the details of this implementation we will try to define a clear taxonomy by explaining in detail the four three properties that define such variants.

The simplest variable-length byte code encodes a number with a binary representation of k bits using ⌈k / 7⌉ bytes. The binary representation is left-padded with zeros so to obtain exactly ⌈k / 7⌉ blocks of 7 bits. Each block is stored in a byte, prefixed with a continuation bit which is one for all blocks except for the last one.

§Endianness

The first property is the endianness of the bytes: in big-endian codes, the first byte contains the highest (most significant) bits, whereas in little-endian codes, the first byte contains the lowest (less significant) bits.

The advantage of the big-endian variant is that is lexicographical, that is, comparing lexicographically a stream of encoded natural numbers will give the same results as comparing lexicographically the encoded numbers, much like it happens for UTF-8 encoding.

§Completeness

This basic representation discussed above is not complete, as there are sequences that are not used. For example, zero can be written in many ways (e.g., 0x00 or 0x80 0x00 ), but we are using only the single-byte representation. Uncompleteness leads to a (small) loss in compression.

To have completeness, one can offset the representation in k bits by the maximum number representable using k − 1 bits. That is, we represent the interval [0..2⁷) with one byte, then the interval [2⁷..2⁷ + 2¹⁴] with two bytes, the interval [2⁷ + 2¹⁴..2⁷ + 2¹⁴ + 2²¹] with three bytes, and so on.

§Grouping

In the basic representation, the continuation bit is the most significant bit of each byte. However, one can gather all continuation bits in the first byte (as UTF-8 does). This approach makes it possible to compute the length of the code using a call to usize::leading_ones on the first negated byte, which usually maps to a negation and a call to a fast instruction for the detection of the most significant bit, improving branch prediction.

Note that if the code is grouped, choosing a code with the same endianness as your hardare can lead to a performance improvement, as after the first byte the rest of the code can be read with a read_exact.

§Sign

It is possible to extend the codes to represent signed integers. Two possible approaches are using a bijective mapping between the integers and the natural numbers, or defining a specialized code.

§Implementations

We provide two unsigned, ungrouped, complete representations, one big-endian and one little-endian. We recommend in general big-endian version as it is lexicographical. However, the big-endian version needs a small buffer when writing, so on some hardware the little-endian version might be faster.

Note that the endianness of the code is independent from the endianness of the underlying bit stream, as it just depend on the order in which bytes are written.

Since this code is byte-aligned, we provide also convenient, fast methods vbyte_write and vbyte_read that can be used on types implementing std::io::Read and std::io::Write. There are also endianness-specific methods vbyte_write_be, vbyte_write_le, vbyte_read_be, and vbyte_read_le.

§Examples

  • The LEB128 code used by LLVM is a little-endian incomplete ungrouped representation. There is both a signed and an unsigned version; the signed version represent negative numbers using two’s complement.

  • The code used by git is a big-endian complete ungrouped representation.

  • This implementation in folly is a little-endian incomplete ungrouped representation.

Traits§

VByteBeRead
Trait for reading big-endian variable-length byte codes.
VByteBeWrite
Trait for write big-endian variable-length byte codes.
VByteLeRead
Trait for reading little-endian variable-length byte codes.
VByteLeWrite
Trait for write little-endian variable-length byte codes.

Functions§

bit_len_vbyte
Return the length of the variable-length byte code for value in bits.
byte_len_vbyte
Return the length of the variable-length byte code for value in bytes.
vbyte_read
Decode a natural number from a byte stream using variable-length byte codes.
vbyte_read_be
Decode a natural number from a big-endian byte stream using variable-length byte codes.
vbyte_read_le
Decode a natural number from a little-endian byte stream using variable-length byte codes.
vbyte_write
Write a natural number to a byte stream using variable-length byte codes and return the number of bytes written.
vbyte_write_be
Encode a natural number to a big-endian byte stream using variable-length byte codes and return the number of bytes written.
vbyte_write_le
Encode a natural number to a little-endian byte stream using variable-length byte codes and return the number of bytes written.