dsi_bitstream/codes/
vbyte.rs

1/*
2 * SPDX-FileCopyrightText: 2023 Tommaso Fontana
3 * SPDX-FileCopyrightText: 2023 Sebastiano Vigna
4 *
5 * SPDX-License-Identifier: Apache-2.0 OR LGPL-2.1-or-later
6 */
7
8//! Variable-length byte codes.
9//!
10//! These codes represent a natural number as a sequence of bytes, the length of
11//! the sequence depends on the magnitude of the number. They are used in many
12//! contexts, and they are known under a plethora of different names such
13//! “vbyte”, “varint”, “[variable-length
14//! quantity](https://en.wikipedia.org/wiki/Variable-length_quantity)”, “LEB”,
15//! and so on.
16//!
17//! There are several variants of their definition, but their implied
18//! distribution is always ≈ 1/*x*<sup>8/7</sup>
19//!
20//! The supported range is [0 . . 2⁶⁴).
21//!
22//! # Definition
23//!
24//! Since there are a few slightly different variants used in production code
25//! and in the literature, before going into the details of this implementation
26//! we will try to define a clear taxonomy by explaining in detail the four
27//! three properties that define such variants.
28//!
29//! The simplest variable-length byte code encodes a number with a binary
30//! representation of *k* bits using ⌈*k* / 7⌉ bytes. The binary representation
31//! is left-padded with zeros so to obtain exactly ⌈*k* / 7⌉ blocks of 7 bits.
32//! Each block is stored in a byte, prefixed with a continuation bit which is
33//! one for all blocks except for the last one.
34//!
35//! ## Endianness
36//!
37//! The first property is the endianness of the bytes: in big-endian codes, the
38//! first byte contains the highest (most significant) bits, whereas in
39//! little-endian codes, the first byte contains the lowest (less significant)
40//! bits.
41//!
42//! The advantage of the big-endian variant is that is lexicographical, that is,
43//! comparing lexicographically a stream of encoded natural numbers will give
44//! the same results as comparing lexicographically the encoded numbers, much
45//! like it happens for [UTF-8 encoding](https://en.wikipedia.org/wiki/UTF-8).
46//!
47//! ## Completeness
48//!
49//! This basic representation discussed above is not *complete*, as there are
50//! sequences that are not used. For example, zero can be written in many ways
51//! (e.g., `0x00` or `0x80 0x00` ), but we are using only the single-byte
52//! representation. Incompleteness leads to a (small) loss in compression.
53//!
54//! To have completeness, one can offset the representation in *k* bits by the
55//! maximum number representable using *k* − 1 bits. That is, we represent the
56//! interval [0..2⁷) with one byte, then the interval [2⁷..2⁷ + 2¹⁴] with two
57//! bytes, the interval [2⁷ + 2¹⁴..2⁷ + 2¹⁴ + 2²¹] with three bytes, and so on.
58//!
59//! ## Grouping
60//!
61//! In the basic representation, the continuation bit is the most significant
62//! bit of each byte. However, one can gather all continuation bits in the first
63//! byte ([as UTF-8 does](https://en.wikipedia.org/wiki/UTF-8)). This approach
64//! makes it possible to compute the length of the code using a call to
65//! [`usize::leading_ones`] on the first negated byte, which usually maps to a
66//! negation and a call to a fast instruction for the detection of the most
67//! significant bit, improving branch prediction.
68//!
69//! Note that if the code is grouped, choosing a code with the same endianness
70//! as your hardware can lead to a performance improvement, as after the first
71//! byte the rest of the code can be read with a
72//![`read_exact`](std::io::Read::read_exact).
73//!
74//! ## Sign
75//!
76//! It is possible to extend the codes to represent signed integers. Two
77//! possible approaches are using a [bijective mapping](crate::codes::ToInt)
78//! between the integers and the natural numbers, or defining a specialized
79//! code.
80//!
81//! # Implementations
82//!
83//! We provide two unsigned, ungrouped, complete representations, one
84//! [big-endian](VByteBeRead) and one [little-endian](VByteLeRead). We recommend
85//! in general big-endian version as it is lexicographical. However, the
86//! big-endian version needs a small buffer when writing, so on some hardware
87//! the little-endian version might be faster.
88//!
89//! Note that the endianness of the code is independent from the endianness of
90//! the underlying bit stream, as it just depend on the order in which bytes are
91//! written.
92//!
93//! Since this code is byte-aligned, we provide also convenient, fast methods
94//! [`vbyte_write`] and [`vbyte_read`] that can be used on types implementing
95//! [`std::io::Read`] and [`std::io::Write`]. There are also endianness-specific
96//! methods [`vbyte_write_be`], [`vbyte_write_le`], [`vbyte_read_be`], and
97//! [`vbyte_read_le`].
98//!
99//! [`LEB128`]: https://en.wikipedia.org/wiki/LEB128
100//! [`VLQ`]: https://en.wikipedia.org/wiki/Variable-length_quantity
101//!
102//! # Examples
103//!
104//! - The [LEB128](https://en.wikipedia.org/wiki/LEB128) code used by LLVM is a
105//!   little-endian incomplete ungrouped representation. There is both a signed
106//!   and an unsigned version; the signed version represent negative numbers
107//!   using two's complement.
108//!
109//! - The [code used by
110//!   git](https://github.com/git/git/blob/7fb6aefd2aaffe66e614f7f7b83e5b7ab16d4806/varint.c)
111//!   is a big-endian complete ungrouped representation.
112//!
113//! - [This implementation in
114//!   folly](https://github.com/facebook/folly/blob/dd4a5eb057afbc3c7c7da71801df2ee3c61c47d1/folly/Varint.h#L109)
115//!   is a little-endian incomplete ungrouped representation.
116
117use crate::traits::*;
118
119/// Return the length of the variable-length byte code for `value` in bytes.
120#[must_use]
121#[inline(always)]
122pub fn byte_len_vbyte(mut value: u64) -> usize {
123    let mut len = 1;
124    loop {
125        value >>= 7;
126        if value == 0 {
127            return len;
128        }
129        value -= 1;
130        len += 1;
131    }
132}
133
134/// Return the length of the variable-length byte code for `value` in bits.
135#[must_use]
136#[inline(always)]
137pub fn bit_len_vbyte(value: u64) -> usize {
138    8 * byte_len_vbyte(value)
139}
140
141/// Trait for reading big-endian variable-length byte codes.
142///
143/// Note that the endianness of the code is independent
144/// from the endianness of the underlying bit stream.
145pub trait VByteBeRead<E: Endianness>: BitRead<E> {
146    fn read_vbyte_be(&mut self) -> Result<u64, Self::Error>;
147}
148
149/// Trait for reading little-endian variable-length byte codes.
150///
151/// Note that the endianness of the code is independent
152/// from the endianness of the underlying bit stream.
153pub trait VByteLeRead<E: Endianness>: BitRead<E> {
154    fn read_vbyte_le(&mut self) -> Result<u64, Self::Error>;
155}
156
157impl<E: Endianness, B: BitRead<E>> VByteBeRead<E> for B {
158    #[inline(always)]
159    fn read_vbyte_be(&mut self) -> Result<u64, Self::Error> {
160        let mut byte = self.read_bits(8)?;
161        let mut value = byte & 0x7F;
162        while (byte >> 7) != 0 {
163            value += 1;
164            byte = self.read_bits(8)?;
165            value = (value << 7) | (byte & 0x7F);
166        }
167        Ok(value)
168    }
169}
170
171impl<E: Endianness, B: BitRead<E>> VByteLeRead<E> for B {
172    #[inline(always)]
173    fn read_vbyte_le(&mut self) -> Result<u64, Self::Error> {
174        let mut result = 0;
175        let mut shift = 0;
176        loop {
177            let byte = self.read_bits(8)?;
178            result += (byte & 0x7F) << shift;
179            if (byte >> 7) == 0 {
180                break;
181            }
182            shift += 7;
183            result += 1 << shift;
184        }
185        Ok(result)
186    }
187}
188
189/// Trait for write big-endian variable-length byte codes.
190///
191/// Note that the endianness of the code is independent
192/// from the endianness of the underlying bit stream.
193pub trait VByteBeWrite<E: Endianness>: BitWrite<E> {
194    fn write_vbyte_be(&mut self, value: u64) -> Result<usize, Self::Error>;
195}
196/// Trait for write little-endian variable-length byte codes.
197///
198/// Note that the endianness of the code is independent
199/// from the endianness of the underlying bit stream.
200pub trait VByteLeWrite<E: Endianness>: BitWrite<E> {
201    fn write_vbyte_le(&mut self, value: u64) -> Result<usize, Self::Error>;
202}
203
204impl<E: Endianness, B: BitWrite<E>> VByteBeWrite<E> for B {
205    #[inline(always)]
206    fn write_vbyte_be(&mut self, mut value: u64) -> Result<usize, Self::Error> {
207        let mut buf = [0u8; 10];
208        let mut pos = buf.len() - 1;
209        buf[pos] = (value & 0x7F) as u8;
210        value >>= 7;
211        while value != 0 {
212            value -= 1;
213            pos -= 1;
214            buf[pos] = 0x80 | (value & 0x7F) as u8;
215            value >>= 7;
216        }
217        let bytes_to_write = buf.len() - pos;
218        for &byte in &buf[pos..] {
219            self.write_bits(byte as u64, 8)?;
220        }
221        Ok(bytes_to_write * 8)
222    }
223}
224
225impl<E: Endianness, B: BitWrite<E>> VByteLeWrite<E> for B {
226    #[inline(always)]
227    fn write_vbyte_le(&mut self, mut value: u64) -> Result<usize, Self::Error> {
228        let mut len = 1;
229        loop {
230            let byte = (value & 0x7F) as u8;
231            value >>= 7;
232            if value != 0 {
233                self.write_bits((byte | 0x80) as u64, 8)?;
234            } else {
235                self.write_bits(byte as u64, 8)?;
236                break;
237            }
238            value -= 1;
239            len += 1;
240        }
241        Ok(len * 8)
242    }
243}
244
245/// Write a natural number to a byte stream using variable-length byte codes and
246/// return the number of bytes written.
247///
248/// This method just delegates to the correct endianness-specific method.
249#[inline(always)]
250#[cfg(feature = "std")]
251pub fn vbyte_write<E: Endianness, W: std::io::Write>(
252    value: u64,
253    writer: &mut W,
254) -> std::io::Result<usize> {
255    if core::any::TypeId::of::<E>() == core::any::TypeId::of::<BigEndian>() {
256        vbyte_write_be(value, writer)
257    } else {
258        vbyte_write_le(value, writer)
259    }
260}
261
262/// Encode a natural number to a big-endian byte stream using variable-length
263/// byte codes and return the number of bytes written.
264#[inline(always)]
265#[cfg(feature = "std")]
266pub fn vbyte_write_be<W: std::io::Write>(mut value: u64, w: &mut W) -> std::io::Result<usize> {
267    let mut buf = [0u8; 10];
268    let mut pos = buf.len() - 1;
269    buf[pos] = (value & 0x7F) as u8;
270    value >>= 7;
271    while value != 0 {
272        value -= 1;
273        pos -= 1;
274        buf[pos] = 0x80 | (value & 0x7F) as u8;
275        value >>= 7;
276    }
277    let bytes_to_write = buf.len() - pos;
278    w.write_all(&buf[pos..])?;
279    Ok(bytes_to_write)
280}
281
282/// Encode a natural number to a little-endian byte stream using variable-length
283/// byte codes and return the number of bytes written.
284#[inline(always)]
285#[cfg(feature = "std")]
286pub fn vbyte_write_le<W: std::io::Write>(mut value: u64, writer: &mut W) -> std::io::Result<usize> {
287    let mut len = 1;
288    loop {
289        let byte = (value & 0x7F) as u8;
290        value >>= 7;
291        if value != 0 {
292            writer.write_all(&[byte | 0x80])?;
293        } else {
294            writer.write_all(&[byte])?;
295            break;
296        }
297        value -= 1;
298        len += 1;
299    }
300    Ok(len)
301}
302
303#[inline(always)]
304#[cfg(feature = "std")]
305/// Decode a natural number from a byte stream using variable-length byte codes.
306///
307/// This method just delegates to the correct endianness-specific method.
308pub fn vbyte_read<E: Endianness, R: std::io::Read>(reader: &mut R) -> std::io::Result<u64> {
309    if core::any::TypeId::of::<E>() == core::any::TypeId::of::<BigEndian>() {
310        vbyte_read_be(reader)
311    } else {
312        vbyte_read_le(reader)
313    }
314}
315
316/// Decode a natural number from a big-endian byte stream using variable-length
317/// byte codes.
318#[inline(always)]
319#[cfg(feature = "std")]
320pub fn vbyte_read_be<R: std::io::Read>(reader: &mut R) -> std::io::Result<u64> {
321    let mut buf = [0u8; 1];
322    let mut value: u64;
323    reader.read_exact(&mut buf)?;
324    value = (buf[0] & 0x7F) as u64;
325    while (buf[0] >> 7) != 0 {
326        value += 1;
327        reader.read_exact(&mut buf)?;
328        value = (value << 7) | ((buf[0] & 0x7F) as u64);
329    }
330    Ok(value)
331}
332
333/// Decode a natural number from a little-endian byte stream using
334/// variable-length byte codes.
335#[inline(always)]
336#[cfg(feature = "std")]
337pub fn vbyte_read_le<R: std::io::Read>(reader: &mut R) -> std::io::Result<u64> {
338    let mut result = 0;
339    let mut shift = 0;
340    let mut buffer = [0; 1];
341    loop {
342        reader.read_exact(&mut buffer)?;
343        let byte = buffer[0];
344        result += ((byte & 0x7F) as u64) << shift;
345        if (byte >> 7) == 0 {
346            break;
347        }
348        shift += 7;
349        result += 1 << shift;
350    }
351    Ok(result)
352}
353
354#[cfg(test)]
355#[cfg(feature = "std")]
356mod test {
357    #[allow(unused_imports)]
358    use super::*;
359
360    const UPPER_BOUND_1: u64 = 128;
361    const UPPER_BOUND_2: u64 = 128_u64.pow(2) + UPPER_BOUND_1;
362    const UPPER_BOUND_3: u64 = 128_u64.pow(3) + UPPER_BOUND_2;
363    const UPPER_BOUND_4: u64 = 128_u64.pow(4) + UPPER_BOUND_3;
364    const UPPER_BOUND_5: u64 = 128_u64.pow(5) + UPPER_BOUND_4;
365    const UPPER_BOUND_6: u64 = 128_u64.pow(6) + UPPER_BOUND_5;
366    const UPPER_BOUND_7: u64 = 128_u64.pow(7) + UPPER_BOUND_6;
367    const UPPER_BOUND_8: u64 = 128_u64.pow(8) + UPPER_BOUND_7;
368
369    macro_rules! impl_tests {
370        ($test_name:ident, $E:ty) => {
371            #[test]
372            fn $test_name() {
373                const MAX: usize = 1 << 20;
374                const MIN: usize = 0;
375                let mut buffer = std::io::Cursor::new(Vec::with_capacity(128));
376                let mut lens = Vec::new();
377
378                for i in MIN..MAX {
379                    lens.push(vbyte_write::<$E, _>(i as _, &mut buffer).unwrap());
380                }
381                buffer.set_position(0);
382                for (i, l) in (MIN..MAX).zip(lens.iter()) {
383                    let j = vbyte_read::<$E, _>(&mut buffer).unwrap();
384                    assert_eq!(byte_len_vbyte(i as _), *l);
385                    assert_eq!(j, i as u64);
386                }
387
388                let values = [
389                    0,
390                    UPPER_BOUND_1 - 1,
391                    UPPER_BOUND_1 + 1,
392                    UPPER_BOUND_2 - 1,
393                    UPPER_BOUND_2 + 1,
394                    UPPER_BOUND_3 - 1,
395                    UPPER_BOUND_3 + 1,
396                    UPPER_BOUND_4 - 1,
397                    UPPER_BOUND_4 + 1,
398                    UPPER_BOUND_5 - 1,
399                    UPPER_BOUND_5 + 1,
400                    UPPER_BOUND_6 - 1,
401                    UPPER_BOUND_6 + 1,
402                    UPPER_BOUND_7 - 1,
403                    UPPER_BOUND_7 + 1,
404                    UPPER_BOUND_8 - 1,
405                    UPPER_BOUND_8 + 1,
406                    u64::MAX,
407                ];
408
409                let tell: u64 = buffer.position();
410                for &i in values.iter() {
411                    vbyte_write::<$E, _>(i, &mut buffer).unwrap();
412                }
413                buffer.set_position(tell);
414                for &i in values.iter() {
415                    assert_eq!(i, vbyte_read::<$E, _>(&mut buffer).unwrap());
416                }
417            }
418        };
419    }
420
421    impl_tests!(test_vbytes_be, BE);
422    impl_tests!(test_vbytes_le, LE);
423}