Expand description
Variable-length byte codes.
These codes represent a natural number as a sequence of bytes, the length of the sequence depends on the magnitude of the number. They are used in many contexts, and they are known under a plethora of different names such “vbyte”, “varint”, “variable-length quantity”, “LEB”, and so on.
There are several variants of their definition, but their implied distribution is always ≈ 1/x8/7
§Definition
Since there are a few slightly different variants used in production code and in the literature, before going into the details of this implementation we will try to define a clear taxonomy by explaining in detail the four three properties that define such variants.
The simplest variable-length byte code encodes a number with a binary representation of k bits using ⌈k / 7⌉ bytes. The binary representation is left-padded with zeros so to obtain exactly ⌈k / 7⌉ blocks of 7 bits. Each block is stored in a byte, prefixed with a continuation bit which is one for all blocks except for the last one.
§Endianness
The first property is the endianness of the bytes: in big-endian codes, the first byte contains the highest (most significant) bits, whereas in little-endian codes, the first byte contains the lowest (less significant) bits.
The advantage of the big-endian variant is that is lexicographical, that is, comparing lexicographically a stream of encoded natural numbers will give the same results as comparing lexicographically the encoded numbers, much like it happens for UTF-8 encoding.
§Completeness
This basic representation discussed above is not complete, as there are
sequences that are not used. For example, zero can be written in many ways
(e.g., 0x00
or 0x80 0x00
), but we are using only the single-byte
representation. Uncompleteness leads to a (small) loss in compression.
To have completeness, one can offset the representation in k bits by the maximum number representable using k − 1 bits. That is, we represent the interval [0..2⁷) with one byte, then the interval [2⁷..2⁷ + 2¹⁴] with two bytes, the interval [2⁷ + 2¹⁴..2⁷ + 2¹⁴ + 2²¹] with three bytes, and so on.
§Grouping
In the basic representation, the continuation bit is the most significant
bit of each byte. However, one can gather all continuation bits in the first
byte (as UTF-8 does). This approach
makes it possible to compute the length of the code using a call to
usize::leading_ones
on the first negated byte, which usually maps to a
negation and a call to a fast instruction for the detection of the most
significant bit, improving branch prediction.
Note that if the code is grouped, choosing a code with the same endianness
as your hardare can lead to a performance improvement, as after the first
byte the rest of the code can be read with a
read_exact
.
§Sign
It is possible to extend the codes to represent signed integers. Two possible approaches are using a bijective mapping between the integers and the natural numbers, or defining a specialized code.
§Implementations
We provide two unsigned, ungrouped, complete representations, one big-endian and one little-endian. We recommend in general big-endian version as it is lexicographical. However, the big-endian version needs a small buffer when writing, so on some hardware the little-endian version might be faster.
Note that the endianness of the code is independent from the endianness of the underlying bit stream, as it just depend on the order in which bytes are written.
Since this code is byte-aligned, we provide also convenient, fast methods
vbyte_write
and vbyte_read
that can be used on types implementing
std::io::Read
and std::io::Write
. There are also endianness-specific
methods vbyte_write_be
, vbyte_write_le
, vbyte_read_be
, and
vbyte_read_le
.
§Examples
-
The LEB128 code used by LLVM is a little-endian incomplete ungrouped representation. There is both a signed and an unsigned version; the signed version represent negative numbers using two’s complement.
-
The code used by git is a big-endian complete ungrouped representation.
-
This implementation in folly is a little-endian incomplete ungrouped representation.
Traits§
- VByte
BeRead - Trait for reading big-endian variable-length byte codes.
- VByte
BeWrite - Trait for write big-endian variable-length byte codes.
- VByte
LeRead - Trait for reading little-endian variable-length byte codes.
- VByte
LeWrite - Trait for write little-endian variable-length byte codes.
Functions§
- bit_
len_ vbyte - Return the length of the variable-length byte code for
value
in bits. - byte_
len_ vbyte - Return the length of the variable-length byte code for
value
in bytes. - vbyte_
read - Decode a natural number from a byte stream using variable-length byte codes.
- vbyte_
read_ be - Decode a natural number from a big-endian byte stream using variable-length byte codes.
- vbyte_
read_ le - Decode a natural number from a little-endian byte stream using variable-length byte codes.
- vbyte_
write - Write a natural number to a byte stream using variable-length byte codes and return the number of bytes written.
- vbyte_
write_ be - Encode a natural number to a big-endian byte stream using variable-length byte codes and return the number of bytes written.
- vbyte_
write_ le - Encode a natural number to a little-endian byte stream using variable-length byte codes and return the number of bytes written.