svb 0.1.0

Pure-Rust StreamVByte: integer compression for u16/u32/u64 with SIMD decode (AVX2, SSSE3, NEON)
Documentation

svb

Pure-Rust StreamVByte covering all major codec variants for u16, u32, and u64 integers. Delta and zigzag encoding are composable layers on top. SIMD back-ends are available for x86-64 (SSSE3, AVX2) and AArch64 (NEON).

Documentation | API reference

StreamVByte stores each integer in the minimum number of bytes its value requires — 1, 2, 3, or 4 bytes for a u32; 1 or 2 for a u16 — and keeps the per-integer width metadata (the control stream) separate from the integer bytes (the data stream). That two-stream layout is what makes SIMD decode fast: a single shuffle instruction can unpack 4–8 values at once, once the widths are known.

Delta encoding replaces each value with its difference from the previous one. For sequences where adjacent values are close — sorted data, slowly-drifting measurements, oscillating signals — the differences are much smaller than the raw values. Smaller values encode to fewer bytes.

Zigzag encoding maps signed integers to unsigned so that small absolute values stay small: 0→0, −1→1, 1→2, −2→3, 2→4. This matters when the data has signed deltas: without zigzag, a delta of −1 would encode as 4 bytes (0xFFFFFFFF) rather than 1. With zigzag it encodes as a single byte (0x01).

The three compose naturally: delta shrinks value magnitudes, zigzag keeps the result non-negative and compact, and StreamVByte encodes each small value as efficiently as possible. See the encoding guide for a full walkthrough.

Codec variants

Variant Element Byte widths Notes
Svb16 u16 1/2 ONT VBZ format
U32Classic u32 1/2/3/4 Lemire / C library compatible
U32Variant0124 u32 0/1/2/4 Better compression for sparse data
U64Coder1234 u64 1/2/3/4 Values up to u32::MAX
U64Coder1248 u64 1/2/4/8 Full u64 range

Installation

[dependencies]
svb = { version = "0.1", features = ["simd-auto"] }

Quick start

use svb::u32::U32Classic;

let values: Vec<u32> = vec![1, 500, 70_000, 16_000_000];
let encoded = U32Classic.encode(&values);
let decoded = U32Classic.decode(&encoded, values.len()).unwrap();
assert_eq!(decoded, values);

For the VBZ pipeline (Oxford Nanopore POD5 signal data):

use svb::{encode_vbz, decode_vbz};

let samples: Vec<i16> = vec![100, 101, 103, 102, 98];
let encoded = encode_vbz(&samples);
let decoded = decode_vbz(&encoded, samples.len()).unwrap();
assert_eq!(decoded, samples);

It's also pretty damn fast

Benchmarked with simd-auto on an Intel i7-11800H (AVX2), 8192-element slices:

Benchmark svb streamvbyte64
Svb16 encode 4.91 GB/s
Svb16 decode 4.51 GB/s
VBZ encode (delta + zigzag + SVB16) 3.14 GB/s
VBZ decode (3-pass) 1.88 GB/s
VBZ decode fused (single SIMD pass) 2.77 GB/s
VBZ2 decode fused (2-chain, single thread) 3.00 GB/s
U32Classic decode 4.07 GB/s 1.67 GB/s
U32Classic encode 2.08 GB/s 1.09 GB/s
U64Coder1248 decode 1.90 GB/s 1.32 GB/s
U64Coder1248 encode 1.25 GB/s 0.73 GB/s

VBZ is ~2.5x slower than SVB16 alone. Breaking down the pipeline (8192 i16 elements):

Stage encode decode
delta 11.02 GB/s 3.75 GB/s
zigzag 18.75 GB/s 14.83 GB/s
SVB16 4.91 GB/s 4.51 GB/s
VBZ combined (3-pass) 3.14 GB/s 1.88 GB/s
VBZ fused decode 2.77 GB/s

Around 2x faster on average than streamvbyte64 across all variants and sizes (range: 1.4x–2.7x). Full stage-by-stage breakdowns, fused decoder analysis, and VBZ-K parallel decode numbers are in the Performance docs.

If you run the benchmarks on another system (especially ARM with NEON) I'd love to see the results. Run:

cargo bench --features simd-auto

and open an issue or drop the output in.

MSRV

1.87 (edition 2024; SIMD intrinsics require target_feature_11, stabilised in 1.87).

License

MIT. See LICENSE. Copyright 2026 James Ferguson.