vb 0.2.2

Variable byte encoding library / 变长字节编码库
Documentation
[English]#en | [中文]#zh

---

<a id="en"></a>

# vb : Efficient compression for integer sequences

## Table of Contents
- [Introduction]#introduction
- [Features]#features
- [Usage]#usage
  - [Basic Encoding]#basic-encoding
  - [Differential Encoding]#differential-encoding
- [API Reference]#api-reference
- [Design]#design
- [Tech Stack]#tech-stack
- [Directory Structure]#directory-structure
- [History]#history

## Introduction
`vb` is a lightweight Rust library for **Variable Byte (VByte)** encoding. It provides an efficient way to compress `u64` integers and lists of integers. By using fewer bytes for smaller numbers, it significantly reduces storage requirements for integer-heavy data.

Additionally, the library supports **Differential Encoding (Delta Encoding)**, which is particularly effective for compressing strictly increasing sequences (e.g., sorted IDs, timestamps) by storing only the differences between consecutive values.

## Features
- **Variable Byte Encoding**: Compresses `u64` integers into a variable-length byte sequence.
- **Differential Encoding**: Optimizes storage for strictly increasing sequences (requires `diff` feature).
- **Zero-Copy Decoding**: Efficient decoding directly from byte slices.
- **Simple API**: Easy-to-use functions for single values and lists.
- **Error Handling**: Robust error reporting using `thiserror`.

## Usage

Add `vb` to your `Cargo.toml`:

```toml
[dependencies]
vb = "0.1.8"
# For differential encoding support
# vb = { version = "0.1.8", features = ["diff"] }
```

### Basic Encoding

```rust
use vb::{e_li, d_li};

fn main() {
    let list = vec![0, 1, 127, 128, 300, 16383, 16384];

    // Encode list
    let encoded = e_li(&list);
    println!("Encoded length: {} bytes", encoded.len());

    // Decode list
    let decoded = d_li(&encoded).unwrap();
    assert_eq!(list, decoded);
}
```

### Differential Encoding

Useful for sorted sequences like timestamps or primary keys.

```rust
#[cfg(feature = "diff")]
use vb::{e_diff, d_diff};

#[cfg(feature = "diff")]
fn main() {
    // Strictly increasing sequence
    let list = vec![10000, 10001, 10002, 10003, 10004];

    // Encode using differential encoding
    let encoded = e_diff(&list);

    // Decode back to original sequence
    let decoded = d_diff(&encoded).unwrap();
    assert_eq!(list, decoded);
}
```

## API Reference

The library exports the following key functions:

- **`d(input: impl AsRef<[u8]>) -> Result<(u64, usize)>`**
  Decodes a single variable-byte encoded integer. Returns the value and bytes consumed.

- **`e_li(li: impl AsRef<[u64]>) -> Vec<u8>`**
  Encodes a list of `u64` integers into a byte vector.

- **`d_li(data: impl AsRef<[u8]>) -> Result<Vec<u64>>`**
  Decodes a byte vector back into a list of `u64` integers.

- **`e_diff(li: impl AsRef<[u64]>) -> Vec<u8>`** *(feature: `diff`)*
  Encodes a strictly increasing sequence using differential encoding combined with VByte.

- **`d_diff(vs: impl AsRef<[u8]>) -> Result<Vec<u64>>`** *(feature: `diff`)*
  Decodes a differentially encoded sequence.

## Design

The VByte format uses 7 bits of each byte for data and the Most Significant Bit (MSB) as a continuation flag.
- **MSB = 0**: The last byte of the integer.
- **MSB = 1**: More bytes follow.

For differential encoding (`e_diff`), the library first calculates the difference between adjacent elements ($x_i - x_{i-1}$) and then encodes these smaller differences using VByte. This results in significant compression for dense, increasing sequences.

## Tech Stack
- **Rust**: Core language.
- **thiserror**: For ergonomic error handling.

## Directory Structure

```
.
├── Cargo.toml      # Project configuration
├── readme          # Documentation
│   ├── en.md       # English README
│   └── zh.md       # Chinese README
├── src
│   └── lib.rs      # Source code
└── tests
    └── main.rs     # Integration tests
```

## History

Variable Byte encoding (also known as **VByte**, **Varint**, or **LEB128**) has a rich history in computer science.

- **MIDI Standard**: One of the earliest widespread uses was in the MIDI file format as "Variable-Length Quantity" (VLQ) to save space in music files.
- **Search Engines**: In the late 90s, search engines like Google used VByte to compress inverted indexes (lists of document IDs), balancing high compression ratios with very fast decoding speeds.
- **DWARF**: The DWARF debugging data format uses a variant called LEB128 (Little Endian Base 128).
- **Protocol Buffers**: Google's Protocol Buffers heavily rely on "Varints" for efficient data serialization.

This simple yet powerful technique remains a cornerstone of data compression in systems where both space and speed matter.

---

## About

This project is an open-source component of [js0.site ⋅ Refactoring the Internet Plan](https://js0.site).

We are redefining the development paradigm of the Internet in a componentized way. Welcome to follow us:

* [Google Group]https://groups.google.com/g/js0-site
* [js0site.bsky.social]https://bsky.app/profile/js0site.bsky.social

---

<a id="zh"></a>

# vb : 高效的整数序列压缩方案

## 目录
- [简介]#简介
- [功能特性]#功能特性
- [使用指南]#使用指南
  - [基础编码]#基础编码
  - [差分编码]#差分编码
- [API 参考]#api-参考
- [设计思路]#设计思路
- [技术栈]#技术栈
- [目录结构]#目录结构
- [历史轶事]#历史轶事

## 简介
`vb` 是一个轻量级的 Rust 库,用于实现 **变长字节编码 (Variable Byte Encoding)**。它提供了一种高效的方法来压缩 `u64` 整数及其列表。通过对较小的数字使用更少的字节,它能显著减少整数密集型数据的存储需求。

此外,该库支持 **差分编码 (Delta Encoding)**,通过仅存储连续值之间的差值,对严格递增的序列(如排序后的 ID、时间戳)具有极佳的压缩效果。

## 功能特性
- **变长字节编码**:将 `u64` 整数压缩为变长字节序列。
- **差分编码**:优化严格递增序列的存储(需开启 `diff` 特性)。
- **零拷贝解码**:直接从字节切片高效解码。
- **简洁 API**:提供针对单值和列表的易用函数。
- **错误处理**:使用 `thiserror` 提供健壮的错误报告。

## 使用指南

在 `Cargo.toml` 中添加 `vb`:

```toml
[dependencies]
vb = "0.1.8"
# 如需差分编码支持
# vb = { version = "0.1.8", features = ["diff"] }
```

### 基础编码

```rust
use vb::{e_li, d_li};

fn main() {
    let list = vec![0, 1, 127, 128, 300, 16383, 16384];

    // 编码列表
    let encoded = e_li(&list);
    println!("编码后长度: {} 字节", encoded.len());

    // 解码列表
    let decoded = d_li(&encoded).unwrap();
    assert_eq!(list, decoded);
}
```

### 差分编码

适用于时间戳或主键等排序序列。

```rust
#[cfg(feature = "diff")]
use vb::{e_diff, d_diff};

#[cfg(feature = "diff")]
fn main() {
    // 严格递增序列
    let list = vec![10000, 10001, 10002, 10003, 10004];

    // 使用差分编码进行压缩
    let encoded = e_diff(&list);

    // 解码回原始序列
    let decoded = d_diff(&encoded).unwrap();
    assert_eq!(list, decoded);
}
```

## API 参考

本库导出以下核心函数:

- **`d(input: impl AsRef<[u8]>) -> Result<(u64, usize)>`**
  解码单个变长编码整数。返回数值及消耗的字节数。

- **`e_li(li: impl AsRef<[u64]>) -> Vec<u8>`**
  `u64` 整数列表编码为字节向量。

- **`d_li(data: impl AsRef<[u8]>) -> Result<Vec<u64>>`**
  将字节向量解码回 `u64` 整数列表。

- **`e_diff(li: impl AsRef<[u64]>) -> Vec<u8>`** *(特性: `diff`)*
  结合差分编码与变长编码,压缩严格递增序列。

- **`d_diff(vs: impl AsRef<[u8]>) -> Result<Vec<u64>>`** *(特性: `diff`)*
  解码差分编码序列。

## 设计思路

VByte 格式使用每个字节的低 7 位存储数据,最高有效位 (MSB) 作为延续标志。
- **MSB = 0**:表示这是整数的最后一个字节。
- **MSB = 1**:表示后续还有更多字节。

对于差分编码 (`e_diff`),库首先计算相邻元素之间的差值 ($x_i - x_{i-1}$),然后使用 VByte 对这些较小的差值进行编码。这种方法对于密集的递增序列能产生显著的压缩效果。

## 技术栈
- **Rust**:核心开发语言。
- **thiserror**:用于符合人体工程学的错误处理。

## 目录结构

```
.
├── Cargo.toml      # 项目配置
├── readme          # 文档目录
│   ├── en.md       # 英文说明
│   └── zh.md       # 中文说明
├── src
│   └── lib.rs      # 源代码
└── tests
    └── main.rs     # 集成测试
```

## 历史轶事

变长字节编码(也称为 **VByte**、**Varint** 或 **LEB128**)在计算机科学史上拥有悠久的历史。

- **MIDI 标准**:最早的广泛应用之一是在 MIDI 文件格式中,被称为“变长数量”(VLQ),旨在节省音乐文件的存储空间。
- **搜索引擎**:90 年代末,Google 等搜索引擎利用 VByte 压缩倒排索引(文档 ID 列表),在保持高压缩率的同时实现了极快的解码速度。
- **DWARF**:DWARF 调试数据格式使用了一种名为 LEB128 (Little Endian Base 128) 的变体。
- **Protocol Buffers**:Google 的 Protocol Buffers 序列化协议也严重依赖 "Varints" 来实现高效的数据传输。

这项简单而强大的技术,在对空间和速度都有要求的系统中,至今仍是数据压缩的基石。

---

## 关于

本项目为 [js0.site ⋅ 重构互联网计划](https://js0.site) 的开源组件。

我们正在以组件化的方式重新定义互联网的开发范式,欢迎关注:

* [谷歌邮件列表]https://groups.google.com/g/js0-site
* [js0site.bsky.social]https://bsky.app/profile/js0site.bsky.social