vb 0.2.2

Variable byte encoding library / 变长字节编码库
Documentation

English | 中文


vb : Efficient compression for integer sequences

Table of Contents

Introduction

vb is a lightweight Rust library for Variable Byte (VByte) encoding. It provides an efficient way to compress u64 integers and lists of integers. By using fewer bytes for smaller numbers, it significantly reduces storage requirements for integer-heavy data.

Additionally, the library supports Differential Encoding (Delta Encoding), which is particularly effective for compressing strictly increasing sequences (e.g., sorted IDs, timestamps) by storing only the differences between consecutive values.

Features

  • Variable Byte Encoding: Compresses u64 integers into a variable-length byte sequence.
  • Differential Encoding: Optimizes storage for strictly increasing sequences (requires diff feature).
  • Zero-Copy Decoding: Efficient decoding directly from byte slices.
  • Simple API: Easy-to-use functions for single values and lists.
  • Error Handling: Robust error reporting using thiserror.

Usage

Add vb to your Cargo.toml:

[dependencies]
vb = "0.1.8"
# For differential encoding support
# vb = { version = "0.1.8", features = ["diff"] }

Basic Encoding

use vb::{e_li, d_li};

fn main() {
    let list = vec![0, 1, 127, 128, 300, 16383, 16384];

    // Encode list
    let encoded = e_li(&list);
    println!("Encoded length: {} bytes", encoded.len());

    // Decode list
    let decoded = d_li(&encoded).unwrap();
    assert_eq!(list, decoded);
}

Differential Encoding

Useful for sorted sequences like timestamps or primary keys.

#[cfg(feature = "diff")]
use vb::{e_diff, d_diff};

#[cfg(feature = "diff")]
fn main() {
    // Strictly increasing sequence
    let list = vec![10000, 10001, 10002, 10003, 10004];

    // Encode using differential encoding
    let encoded = e_diff(&list);

    // Decode back to original sequence
    let decoded = d_diff(&encoded).unwrap();
    assert_eq!(list, decoded);
}

API Reference

The library exports the following key functions:

  • d(input: impl AsRef<[u8]>) -> Result<(u64, usize)> Decodes a single variable-byte encoded integer. Returns the value and bytes consumed.

  • e_li(li: impl AsRef<[u64]>) -> Vec<u8> Encodes a list of u64 integers into a byte vector.

  • d_li(data: impl AsRef<[u8]>) -> Result<Vec<u64>> Decodes a byte vector back into a list of u64 integers.

  • e_diff(li: impl AsRef<[u64]>) -> Vec<u8> (feature: diff) Encodes a strictly increasing sequence using differential encoding combined with VByte.

  • d_diff(vs: impl AsRef<[u8]>) -> Result<Vec<u64>> (feature: diff) Decodes a differentially encoded sequence.

Design

The VByte format uses 7 bits of each byte for data and the Most Significant Bit (MSB) as a continuation flag.

  • MSB = 0: The last byte of the integer.
  • MSB = 1: More bytes follow.

For differential encoding (e_diff), the library first calculates the difference between adjacent elements ($x_i - x_{i-1}$) and then encodes these smaller differences using VByte. This results in significant compression for dense, increasing sequences.

Tech Stack

  • Rust: Core language.
  • thiserror: For ergonomic error handling.

Directory Structure

.
├── Cargo.toml      # Project configuration
├── readme          # Documentation
│   ├── en.md       # English README
│   └── zh.md       # Chinese README
├── src
│   └── lib.rs      # Source code
└── tests
    └── main.rs     # Integration tests

History

Variable Byte encoding (also known as VByte, Varint, or LEB128) has a rich history in computer science.

  • MIDI Standard: One of the earliest widespread uses was in the MIDI file format as "Variable-Length Quantity" (VLQ) to save space in music files.
  • Search Engines: In the late 90s, search engines like Google used VByte to compress inverted indexes (lists of document IDs), balancing high compression ratios with very fast decoding speeds.
  • DWARF: The DWARF debugging data format uses a variant called LEB128 (Little Endian Base 128).
  • Protocol Buffers: Google's Protocol Buffers heavily rely on "Varints" for efficient data serialization.

This simple yet powerful technique remains a cornerstone of data compression in systems where both space and speed matter.


About

This project is an open-source component of js0.site ⋅ Refactoring the Internet Plan.

We are redefining the development paradigm of the Internet in a componentized way. Welcome to follow us:


vb : 高效的整数序列压缩方案

目录

简介

vb 是一个轻量级的 Rust 库,用于实现 变长字节编码 (Variable Byte Encoding)。它提供了一种高效的方法来压缩 u64 整数及其列表。通过对较小的数字使用更少的字节,它能显著减少整数密集型数据的存储需求。

此外,该库支持 差分编码 (Delta Encoding),通过仅存储连续值之间的差值,对严格递增的序列(如排序后的 ID、时间戳)具有极佳的压缩效果。

功能特性

  • 变长字节编码:将 u64 整数压缩为变长字节序列。
  • 差分编码:优化严格递增序列的存储(需开启 diff 特性)。
  • 零拷贝解码:直接从字节切片高效解码。
  • 简洁 API:提供针对单值和列表的易用函数。
  • 错误处理:使用 thiserror 提供健壮的错误报告。

使用指南

Cargo.toml 中添加 vb

[dependencies]
vb = "0.1.8"
# 如需差分编码支持
# vb = { version = "0.1.8", features = ["diff"] }

基础编码

use vb::{e_li, d_li};

fn main() {
    let list = vec![0, 1, 127, 128, 300, 16383, 16384];

    // 编码列表
    let encoded = e_li(&list);
    println!("编码后长度: {} 字节", encoded.len());

    // 解码列表
    let decoded = d_li(&encoded).unwrap();
    assert_eq!(list, decoded);
}

差分编码

适用于时间戳或主键等排序序列。

#[cfg(feature = "diff")]
use vb::{e_diff, d_diff};

#[cfg(feature = "diff")]
fn main() {
    // 严格递增序列
    let list = vec![10000, 10001, 10002, 10003, 10004];

    // 使用差分编码进行压缩
    let encoded = e_diff(&list);

    // 解码回原始序列
    let decoded = d_diff(&encoded).unwrap();
    assert_eq!(list, decoded);
}

API 参考

本库导出以下核心函数:

  • d(input: impl AsRef<[u8]>) -> Result<(u64, usize)> 解码单个变长编码整数。返回数值及消耗的字节数。

  • e_li(li: impl AsRef<[u64]>) -> Vec<u8>u64 整数列表编码为字节向量。

  • d_li(data: impl AsRef<[u8]>) -> Result<Vec<u64>> 将字节向量解码回 u64 整数列表。

  • e_diff(li: impl AsRef<[u64]>) -> Vec<u8> (特性: diff) 结合差分编码与变长编码,压缩严格递增序列。

  • d_diff(vs: impl AsRef<[u8]>) -> Result<Vec<u64>> (特性: diff) 解码差分编码序列。

设计思路

VByte 格式使用每个字节的低 7 位存储数据,最高有效位 (MSB) 作为延续标志。

  • MSB = 0:表示这是整数的最后一个字节。
  • MSB = 1:表示后续还有更多字节。

对于差分编码 (e_diff),库首先计算相邻元素之间的差值 ($x_i - x_{i-1}$),然后使用 VByte 对这些较小的差值进行编码。这种方法对于密集的递增序列能产生显著的压缩效果。

技术栈

  • Rust:核心开发语言。
  • thiserror:用于符合人体工程学的错误处理。

目录结构

.
├── Cargo.toml      # 项目配置
├── readme          # 文档目录
│   ├── en.md       # 英文说明
│   └── zh.md       # 中文说明
├── src
│   └── lib.rs      # 源代码
└── tests
    └── main.rs     # 集成测试

历史轶事

变长字节编码(也称为 VByteVarintLEB128)在计算机科学史上拥有悠久的历史。

  • MIDI 标准:最早的广泛应用之一是在 MIDI 文件格式中,被称为“变长数量”(VLQ),旨在节省音乐文件的存储空间。
  • 搜索引擎:90 年代末,Google 等搜索引擎利用 VByte 压缩倒排索引(文档 ID 列表),在保持高压缩率的同时实现了极快的解码速度。
  • DWARF:DWARF 调试数据格式使用了一种名为 LEB128 (Little Endian Base 128) 的变体。
  • Protocol Buffers:Google 的 Protocol Buffers 序列化协议也严重依赖 "Varints" 来实现高效的数据传输。

这项简单而强大的技术,在对空间和速度都有要求的系统中,至今仍是数据压缩的基石。


关于

本项目为 js0.site ⋅ 重构互联网计划 的开源组件。

我们正在以组件化的方式重新定义互联网的开发范式,欢迎关注: