lbzip2 0.2.0

Pure Rust parallel bzip2 decompressor β€” SIMD block scanning, multi-core Burrows-Wheeler decode
Documentation

lbzip2-rs

πŸ§™β€β™‚ Med Allfaderns visdom, kompression och korruptionsskydd. ⚑ Med hans blick ΓΆver varje bit.

Pure Rust parallel bzip2 decompressor. No C dependencies. Usable as a library or as a CLI tool:

# Decompress any bzip2 file (including pbzip2 concatenated streams)
cargo run --release --bin lbunzip2 -- planet-241021.osm.bz2 > planet.osm

What makes this crate unique: 100 % Rust (no C/FFI), in-process, zero-copy, and parallel block-boundary scanning β€” splitting a chunk across N cores is O(N), not O(n) where n is the raw byte count (e.g. 200 MB per chunk). Each core only scans its own 200 MB / N slice for the 48-bit magic, so with 12 cores the work per core is ~17 MB instead of a single thread walking all 200 MB.

Part of the znippy group of software, designed for fast zero-copy integration with osm-katana β€” the parallel OSM-to-GeoParquet pipeline.

Why

  • In-process β€” no pipe, no process spawn. Decompressed segments go straight into the caller's memory.
  • Shared thread pool β€” the rayon pool is shared with the host application (e.g. VTD XML parse + PBF encode). No thread contention.
  • Zero dependency on C libbz2 β€” builds anywhere rustc does.

Performance

Test file C lbzip2 lbzip2-rs
Planet 1 GB slice (β†’ 9.86 GB) 30.5 s (323 MB/s) 30.3 s (325 MB/s) 0.6% faster
Liechtenstein 3 MB (β†’ 60 MB) 0.15 s 0.22 s startup overhead

Matches or beats C lbzip2 on real workloads (8-core / 16-thread, NVMe, /dev/null output). 4Γ— oversplit work-stealing eliminates core idle time. Handles pbzip2 concatenated streams natively.

Current state: the block-level decompression (Huffman β†’ MTF β†’ inverse BWT β†’ RLE) is heavily (ai cloned) inspired by Julian Seward's original C bzip2 library. This crate therefore includes the bzip2 license (BSD-style) alongside MIT / Apache-2.0.

Usage

use lbzip2::chunk::ChunkDecoder;

let data: &[u8] = /* compressed chunk including BZhN header */;
let decoder = ChunkDecoder::from_header(&data[..4])?;

// Returns segments separately β€” no giant memcpy
let (segments, consumed) = decoder.decode_chunk_segments(data, true)?;
for seg in &segments {
    // each seg is a Vec<u8> of decompressed data, in order
}

Single-stream sequential API also available:

let output = lbzip2::stream::decompress(&compressed)?;

Backlog

Questions / wishes for the bzip2-rs crate author β€” API changes that would have made parallel decode possible without reimplementing the decoder:

1. pub fn decode_block(data: &[u8], bit_offset: usize, max_blocksize: u32)
       -> Result<(Vec<u8>, usize), Error>
   β€” Expose single-block decode from arbitrary bit offset.
   β€” Return (decompressed_bytes, bits_consumed).

2. Zero-copy input: accept &[u8] + bit_offset, not impl Write.
   β€” For mmap / ring-buffer use cases, borrowing is essential.

3. Expose block boundary scanning or document the 48-bit bit-aligned
   magic (0x314159265359) so callers can split the stream themselves.

4. Optional: fn decode_block_into(data: &[u8], bit_offset: usize,
                                   out: &mut [u8]) -> Result<usize, Error>
   β€” Write directly into caller-provided buffer.

Without (1) and (2), parallel decode requires reimplementing the full Huffman β†’ MTF β†’ BWT β†’ RLE pipeline from scratch (which is what this crate does).

License

MIT OR Apache-2.0, plus the original bzip2 license (BSD-style, Julian Seward) for the block-decode routines derived from the C reference implementation. See LICENSE-BZIP2 for the full text.