dsi-bitstream 0.9.2

A Rust implementation of read/write bit streams supporting several types of instantaneous codes
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
/*
 * SPDX-FileCopyrightText: 2023 Tommaso Fontana
 * SPDX-FileCopyrightText: 2023 Sebastiano Vigna
 *
 * SPDX-License-Identifier: Apache-2.0 OR LGPL-2.1-or-later
 */

//! Variable-length byte codes.
//!
//! These codes represent a natural number as a sequence of bytes, the length of
//! the sequence depends on the magnitude of the number. They are used in many
//! contexts, and they are known under a plethora of different names such as
//! “vbyte”, “varint”, “[variable-length
//! quantity](https://en.wikipedia.org/wiki/Variable-length_quantity)”, “LEB”,
//! and so on.
//!
//! There are several variants of their definition, but their implied
//! distribution is always ≈ 1/*x*<sup>8/7</sup>
//!
//! The supported range is [0 . . 2⁶⁴).
//!
//! # Definition
//!
//! Since there are a few slightly different variants used in production code
//! and in the literature, before going into the details of this implementation
//! we will try to define a clear taxonomy by explaining in detail the
//! three properties that define such variants.
//!
//! The simplest variable-length byte code encodes a number with a binary
//! representation of *k* bits using ⌈*k* / 7⌉ bytes. The binary representation
//! is left-padded with zeros so to obtain exactly ⌈*k* / 7⌉ blocks of 7 bits.
//! Each block is stored in a byte, prefixed with a continuation bit which is
//! one for all blocks except for the last one.
//!
//! ## Endianness
//!
//! The first property is the endianness of the bytes: in big-endian codes, the
//! first byte contains the highest (most significant) bits, whereas in
//! little-endian codes, the first byte contains the lowest (least significant)
//! bits.
//!
//! The advantage of the big-endian variant is that is lexicographical, that is,
//! comparing lexicographically a stream of encoded natural numbers will give
//! the same results as comparing lexicographically the encoded numbers, much
//! like it happens for [UTF-8 encoding](https://en.wikipedia.org/wiki/UTF-8).
//!
//! ## Completeness
//!
//! This basic representation discussed above is not *complete*, as there are
//! sequences that are not used. For example, zero can be written in many ways
//! (e.g., `0x00` or `0x80 0x00`), but we are using only the single-byte
//! representation. Incompleteness leads to a (small) loss in compression.
//!
//! To have completeness, one can offset the representation in *k* bits by the
//! maximum number representable using *k* − 1 bits. That is, we represent the
//! interval [0 . . 2⁷) with one byte, then the interval [2⁷ . . 2⁷ + 2¹⁴) with two
//! bytes, the interval [2⁷ + 2¹⁴ . . 2⁷ + 2¹⁴ + 2²¹) with three bytes, and so on.
//!
//! ## Grouping
//!
//! In the basic representation, the continuation bit is the most significant
//! bit of each byte. However, one can gather all continuation bits in the first
//! byte ([as UTF-8 does](https://en.wikipedia.org/wiki/UTF-8)). This approach
//! makes it possible to compute the length of the code using a call to
//! [`u8::leading_ones`] on the first negated byte, which usually maps to a
//! negation and a call to a fast instruction for the detection of the most
//! significant bit, improving branch prediction.
//!
//! Note that if the code is grouped, choosing a code with the same endianness
//! as your hardware can lead to a performance improvement, as after the first
//! byte the rest of the code can be read with a
//![`read_exact`](std::io::Read::read_exact).
//!
//! ## Sign
//!
//! It is possible to extend the codes to represent signed integers. Two
//! possible approaches are using a [bijective mapping](crate::codes::ToInt)
//! between the integers and the natural numbers, or defining a specialized
//! code.
//!
//! # Implementations
//!
//! We provide two unsigned, ungrouped, complete representations, one
//! [big-endian](VByteBeRead) and one [little-endian](VByteLeRead). We recommend
//! in general the big-endian version as it is lexicographical. However, the
//! big-endian version needs a small buffer when writing, so on some hardware
//! the little-endian version might be faster.
//!
//! Note that the endianness of the code is independent from the endianness of
//! the underlying bit stream, as it just depends on the order in which bytes are
//! written.
//!
//! Since this code is byte-aligned, we provide also convenient, fast methods
//! [`vbyte_write`] and [`vbyte_read`] that can be used on types implementing
//! [`std::io::Read`] and [`std::io::Write`]. There are also endianness-specific
//! methods [`vbyte_write_be`], [`vbyte_write_le`], [`vbyte_read_be`], and
//! [`vbyte_read_le`].
//!
//! # Examples
//!
//! - The [LEB128](https://en.wikipedia.org/wiki/LEB128) code used by LLVM is a
//!   little-endian incomplete ungrouped representation. There is both a signed
//!   and an unsigned version; the signed version represents negative numbers
//!   using two's complement.
//!
//! - The [code used by
//!   git](https://github.com/git/git/blob/7fb6aefd2aaffe66e614f7f7b83e5b7ab16d4806/varint.c)
//!   is a big-endian complete ungrouped representation.
//!
//! - [This implementation in
//!   folly](https://github.com/facebook/folly/blob/dd4a5eb057afbc3c7c7da71801df2ee3c61c47d1/folly/Varint.h#L109)
//!   is a little-endian incomplete ungrouped representation.

use crate::traits::*;

/// Returns the length of the variable-length byte code for `n` in bytes.
#[must_use]
#[inline(always)]
pub const fn byte_len_vbyte(mut n: u64) -> usize {
    let mut len = 1;
    loop {
        n >>= 7;
        if n == 0 {
            return len;
        }
        n -= 1;
        len += 1;
    }
}

/// Returns the length of the variable-length byte code for `n` in bits.
#[must_use]
#[inline(always)]
pub const fn bit_len_vbyte(n: u64) -> usize {
    8 * byte_len_vbyte(n)
}

/// Trait for reading big-endian variable-length byte codes.
///
/// Note that the endianness of the code is independent
/// from the endianness of the underlying bit stream.
pub trait VByteBeRead<E: Endianness>: BitRead<E> {
    fn read_vbyte_be(&mut self) -> Result<u64, Self::Error>;
}

/// Trait for reading little-endian variable-length byte codes.
///
/// Note that the endianness of the code is independent
/// from the endianness of the underlying bit stream.
pub trait VByteLeRead<E: Endianness>: BitRead<E> {
    fn read_vbyte_le(&mut self) -> Result<u64, Self::Error>;
}

impl<E: Endianness, B: BitRead<E>> VByteBeRead<E> for B {
    #[inline(always)]
    fn read_vbyte_be(&mut self) -> Result<u64, Self::Error> {
        let mut byte = self.read_bits(8)?;
        let mut value = byte & 0x7F;
        while (byte >> 7) != 0 {
            value += 1;
            byte = self.read_bits(8)?;
            value = (value << 7) | (byte & 0x7F);
        }
        Ok(value)
    }
}

impl<E: Endianness, B: BitRead<E>> VByteLeRead<E> for B {
    #[inline(always)]
    fn read_vbyte_le(&mut self) -> Result<u64, Self::Error> {
        let mut result = 0;
        let mut shift = 0;
        loop {
            let byte = self.read_bits(8)?;
            result += (byte & 0x7F) << shift;
            if (byte >> 7) == 0 {
                break;
            }
            shift += 7;
            result += 1 << shift;
        }
        Ok(result)
    }
}

/// Trait for writing big-endian variable-length byte codes.
///
/// Note that the endianness of the code is independent
/// from the endianness of the underlying bit stream.
pub trait VByteBeWrite<E: Endianness>: BitWrite<E> {
    fn write_vbyte_be(&mut self, n: u64) -> Result<usize, Self::Error>;
}
/// Trait for writing little-endian variable-length byte codes.
///
/// Note that the endianness of the code is independent
/// from the endianness of the underlying bit stream.
pub trait VByteLeWrite<E: Endianness>: BitWrite<E> {
    fn write_vbyte_le(&mut self, n: u64) -> Result<usize, Self::Error>;
}

impl<E: Endianness, B: BitWrite<E>> VByteBeWrite<E> for B {
    #[inline(always)]
    fn write_vbyte_be(&mut self, mut n: u64) -> Result<usize, Self::Error> {
        let mut buf = [0u8; 10];
        let mut pos = buf.len() - 1;
        buf[pos] = (n & 0x7F) as u8;
        n >>= 7;
        while n != 0 {
            n -= 1;
            pos -= 1;
            buf[pos] = 0x80 | (n & 0x7F) as u8;
            n >>= 7;
        }
        let bytes_to_write = buf.len() - pos;
        for &byte in &buf[pos..] {
            self.write_bits(byte as u64, 8)?;
        }
        Ok(bytes_to_write * 8)
    }
}

impl<E: Endianness, B: BitWrite<E>> VByteLeWrite<E> for B {
    #[inline(always)]
    fn write_vbyte_le(&mut self, mut n: u64) -> Result<usize, Self::Error> {
        let mut len = 1;
        loop {
            let byte = (n & 0x7F) as u8;
            n >>= 7;
            if n != 0 {
                self.write_bits((byte | 0x80) as u64, 8)?;
            } else {
                self.write_bits(byte as u64, 8)?;
                break;
            }
            n -= 1;
            len += 1;
        }
        Ok(len * 8)
    }
}

/// Writes a natural number to a byte stream using variable-length byte codes and
/// returns the number of bytes written.
///
/// This method just delegates to the correct endianness-specific method.
#[inline(always)]
#[cfg(feature = "std")]
pub fn vbyte_write<E: Endianness, W: std::io::Write>(
    n: u64,
    writer: &mut W,
) -> std::io::Result<usize> {
    if core::any::TypeId::of::<E>() == core::any::TypeId::of::<BigEndian>() {
        vbyte_write_be(n, writer)
    } else {
        vbyte_write_le(n, writer)
    }
}

/// Encode a natural number to a big-endian byte stream using variable-length
/// byte codes and return the number of bytes written.
#[inline(always)]
#[cfg(feature = "std")]
pub fn vbyte_write_be<W: std::io::Write>(mut n: u64, w: &mut W) -> std::io::Result<usize> {
    let mut buf = [0u8; 10];
    let mut pos = buf.len() - 1;
    buf[pos] = (n & 0x7F) as u8;
    n >>= 7;
    while n != 0 {
        n -= 1;
        pos -= 1;
        buf[pos] = 0x80 | (n & 0x7F) as u8;
        n >>= 7;
    }
    let bytes_to_write = buf.len() - pos;
    w.write_all(&buf[pos..])?;
    Ok(bytes_to_write)
}

/// Encode a natural number to a little-endian byte stream using variable-length
/// byte codes and return the number of bytes written.
#[inline(always)]
#[cfg(feature = "std")]
pub fn vbyte_write_le<W: std::io::Write>(mut n: u64, writer: &mut W) -> std::io::Result<usize> {
    let mut len = 1;
    loop {
        let byte = (n & 0x7F) as u8;
        n >>= 7;
        if n != 0 {
            writer.write_all(&[byte | 0x80])?;
        } else {
            writer.write_all(&[byte])?;
            break;
        }
        n -= 1;
        len += 1;
    }
    Ok(len)
}

/// Decode a natural number from a byte stream using variable-length byte codes.
///
/// This method just delegates to the correct endianness-specific method.
#[inline(always)]
#[cfg(feature = "std")]
pub fn vbyte_read<E: Endianness, R: std::io::Read>(reader: &mut R) -> std::io::Result<u64> {
    if core::any::TypeId::of::<E>() == core::any::TypeId::of::<BigEndian>() {
        vbyte_read_be(reader)
    } else {
        vbyte_read_le(reader)
    }
}

/// Decode a natural number from a big-endian byte stream using variable-length
/// byte codes.
#[inline(always)]
#[cfg(feature = "std")]
pub fn vbyte_read_be<R: std::io::Read>(reader: &mut R) -> std::io::Result<u64> {
    let mut buf = [0u8; 1];
    let mut value: u64;
    reader.read_exact(&mut buf)?;
    value = (buf[0] & 0x7F) as u64;
    while (buf[0] >> 7) != 0 {
        value += 1;
        reader.read_exact(&mut buf)?;
        value = (value << 7) | ((buf[0] & 0x7F) as u64);
    }
    Ok(value)
}

/// Decode a natural number from a little-endian byte stream using
/// variable-length byte codes.
#[inline(always)]
#[cfg(feature = "std")]
pub fn vbyte_read_le<R: std::io::Read>(reader: &mut R) -> std::io::Result<u64> {
    let mut result = 0;
    let mut shift = 0;
    let mut buffer = [0; 1];
    loop {
        reader.read_exact(&mut buffer)?;
        let byte = buffer[0];
        result += ((byte & 0x7F) as u64) << shift;
        if (byte >> 7) == 0 {
            break;
        }
        shift += 7;
        result += 1 << shift;
    }
    Ok(result)
}

#[cfg(test)]
#[cfg(feature = "std")]
mod tests {
    #[allow(unused_imports)]
    use super::*;

    const UPPER_BOUND_1: u64 = 128;
    const UPPER_BOUND_2: u64 = 128_u64.pow(2) + UPPER_BOUND_1;
    const UPPER_BOUND_3: u64 = 128_u64.pow(3) + UPPER_BOUND_2;
    const UPPER_BOUND_4: u64 = 128_u64.pow(4) + UPPER_BOUND_3;
    const UPPER_BOUND_5: u64 = 128_u64.pow(5) + UPPER_BOUND_4;
    const UPPER_BOUND_6: u64 = 128_u64.pow(6) + UPPER_BOUND_5;
    const UPPER_BOUND_7: u64 = 128_u64.pow(7) + UPPER_BOUND_6;
    const UPPER_BOUND_8: u64 = 128_u64.pow(8) + UPPER_BOUND_7;

    macro_rules! impl_tests {
        ($test_name:ident, $E:ty) => {
            #[test]
            fn $test_name() -> std::io::Result<()> {
                const MAX: usize = 1 << 20;
                const MIN: usize = 0;
                let mut buffer = std::io::Cursor::new(Vec::with_capacity(128));
                let mut lens = Vec::new();

                for i in MIN..MAX {
                    lens.push(vbyte_write::<$E, _>(i as _, &mut buffer)?);
                }
                buffer.set_position(0);
                for (i, l) in (MIN..MAX).zip(lens.iter()) {
                    let j = vbyte_read::<$E, _>(&mut buffer)?;
                    assert_eq!(byte_len_vbyte(i as _), *l);
                    assert_eq!(j, i as u64);
                }

                let values = [
                    0,
                    UPPER_BOUND_1 - 1,
                    UPPER_BOUND_1 + 1,
                    UPPER_BOUND_2 - 1,
                    UPPER_BOUND_2 + 1,
                    UPPER_BOUND_3 - 1,
                    UPPER_BOUND_3 + 1,
                    UPPER_BOUND_4 - 1,
                    UPPER_BOUND_4 + 1,
                    UPPER_BOUND_5 - 1,
                    UPPER_BOUND_5 + 1,
                    UPPER_BOUND_6 - 1,
                    UPPER_BOUND_6 + 1,
                    UPPER_BOUND_7 - 1,
                    UPPER_BOUND_7 + 1,
                    UPPER_BOUND_8 - 1,
                    UPPER_BOUND_8 + 1,
                    u64::MAX,
                ];

                let tell: u64 = buffer.position();
                for &i in values.iter() {
                    vbyte_write::<$E, _>(i, &mut buffer)?;
                }
                buffer.set_position(tell);
                for &i in values.iter() {
                    assert_eq!(i, vbyte_read::<$E, _>(&mut buffer)?);
                }
                Ok(())
            }
        };
    }

    impl_tests!(test_vbytes_be, BE);
    impl_tests!(test_vbytes_le, LE);
}