rebgzf

Note: This code was generated by Claude (Anthropic's AI assistant) and has not been thoroughly reviewed or tested in production environments. Use at your own risk. Contributions and reviews are welcome.

Efficient gzip to BGZF transcoder using Puffin-style half-decompression.

Motivation

BGZF (Blocked GNU Zip Format) is a variant of gzip that enables random access to compressed data. It's widely used in bioinformatics for BAM, CRAM, and VCF files.

Converting standard gzip files to BGZF traditionally requires:

Full decompression - decompress the entire gzip stream
Re-compression - compress the data into BGZF blocks

This is slow and CPU-intensive because compression is the expensive part.

The Puffin Approach

This tool uses a half-decompression technique inspired by Puffin:

Parse the DEFLATE stream to extract LZ77 tokens (literals, lengths, distances)
Track only enough context to resolve cross-boundary references
Re-encode the tokens directly into new BGZF blocks

This avoids the expensive compression step entirely - we're just re-serializing already-compressed data into a different block structure.

Performance

Half-decompression is significantly faster than full decompress + recompress because:

No dictionary lookups for compression
No hash table maintenance
Minimal memory footprint (just the sliding window)
Parallel encoding of independent blocks

Benchmark Results (54 MB gzip → 80 MB BGZF, original data 208 MB uncompressed FASTQ):

Threads	Time	Throughput	Speedup
1	2.92s	19.6 MB/s	1.0x
2	1.76s	32.4 MB/s	1.66x
4	1.79s	32.0 MB/s	1.63x
8	1.78s	32.2 MB/s	1.64x

Throughput measured on compressed input size. Tested on Apple M1.

Output size: BGZF output is typically ~1.5x larger than the original gzip because:

Each BGZF block has 26 bytes of overhead (header + footer)
Cross-boundary LZ77 references are expanded to literals
Fixed Huffman encoding (less optimal than original dynamic tables)

Why only 2 threads help:

The tool uses a producer-consumer architecture where the main thread sequentially parses DEFLATE blocks (this cannot be parallelized) while worker threads encode BGZF blocks in parallel. With 2 threads total:

Main thread: parses DEFLATE, resolves cross-boundary references
Worker thread: encodes tokens to BGZF, computes CRC32

Additional threads beyond 2 provide no benefit because the main thread's sequential parsing is the bottleneck. The single worker can easily keep up with the parsing rate.

Trade-off: BGZF files are larger than gzip, but enable random access. Best suited for format conversion pipelines where random access is needed

Installation

From source

git clone https://github.com/nh13/rebgzf.git
cd rebgzf
cargo install --path .

From crates.io (coming soon)

cargo install rebgzf

Usage

Command Line

# Basic transcoding (fastest, level 1)
rebgzf -i input.gz -o output.bgz

# With progress display
rebgzf -i input.gz -o output.bgz --progress

# Better compression with dynamic Huffman (level 6)
rebgzf -i input.gz -o output.bgz --level 6

# Best compression for FASTQ files (dynamic Huffman + record-aligned blocks)
rebgzf -i reads.fastq.gz -o reads.bgz --format fastq --level 9

# Generate GZI index for random access
rebgzf -i data.gz -o data.bgz --index

# Parallel transcoding (auto-detect threads)
rebgzf -i input.gz -o output.bgz -t 0 --progress

# Check if a file is already BGZF
rebgzf --check -i input.gz
echo $?  # 0 = BGZF, 1 = not BGZF

# Strict validation (all blocks, works with stdin)
cat file.bgz | rebgzf --check --strict -i -

# Force transcoding even if already BGZF
rebgzf -i input.bgz -o output.bgz --force

# Verbose output with statistics
rebgzf -i input.gz -o output.bgz -v

CLI Options

Convert gzip files to BGZF format efficiently

Usage: rebgzf [OPTIONS] --input <INPUT>

Options:
  -i, --input <INPUT>            Input gzip file (use - for stdin)
  -o, --output <OUTPUT>          Output BGZF file (use - for stdout)
  -t, --threads <THREADS>        Number of threads (0 = auto, 1 = single-threaded) [default: 1]
  -l, --level <LEVEL>            Compression level 1-9 (1-3: fixed Huffman, 4-6: dynamic,
                                 7-9: dynamic + smart boundaries) [default: 1]
      --format <FORMAT>          Input format profile: default, fastq, auto [default: default]
      --block-size <BLOCK_SIZE>  BGZF block size (default: 65280) [default: 65280]
  -v, --verbose                  Show verbose statistics
  -q, --quiet                    Quiet mode - suppress all output except errors
      --json                     Output results as JSON (for scripting)
      --check                    Check if input is BGZF and exit (0=BGZF, 1=not BGZF, 2=error)
      --strict                   Validate all BGZF blocks (slower, more thorough)
      --verify                   Verify BGZF by decompressing and checking CRC32
      --stats                    Show file statistics without transcoding
      --force                    Force transcoding even if input is already BGZF
  -p, --progress                 Show progress during transcoding
      --index [PATH]             Write GZI index file (enables random access)
  -h, --help                     Print help
  -V, --version                  Print version

As a Library

use rebgzf::{
    CompressionLevel, FormatProfile, ParallelTranscoder,
    TranscodeConfig, Transcoder
};
use std::fs::File;

fn main() -> rebgzf::Result<()> {
    let input = File::open("input.gz")?;
    let output = File::create("output.bgz")?;

    let config = TranscodeConfig {
        num_threads: 0,  // auto-detect
        compression_level: CompressionLevel::Level6,  // dynamic Huffman
        format: FormatProfile::Fastq,  // FASTQ-aware splitting
        build_index: true,  // generate GZI index entries
        ..Default::default()
    };

    let mut transcoder = ParallelTranscoder::new(config);
    let stats = transcoder.transcode(input, output)?;

    println!("Transcoded {} bytes -> {} bytes ({} blocks)",
        stats.input_bytes,
        stats.output_bytes,
        stats.blocks_written
    );

    // Write index if requested
    if let Some(entries) = stats.index_entries {
        println!("Generated {} index entries", entries.len());
        // Write entries to .gzi file...
    }

    Ok(())
}

BGZF Detection

use rebgzf::{is_bgzf, validate_bgzf_strict, validate_bgzf_streaming};
use std::fs::File;
use std::io::{Seek, SeekFrom};

fn main() -> rebgzf::Result<()> {
    let mut file = File::open("input.gz")?;

    // Quick check (reads first 18 bytes)
    if is_bgzf(&mut file)? {
        println!("File appears to be BGZF");
    }

    // Strict validation (validates all blocks, requires Seek)
    file.seek(SeekFrom::Start(0))?;
    let validation = validate_bgzf_strict(&mut file)?;
    if validation.is_valid_bgzf {
        println!("Valid BGZF with {} blocks, {} uncompressed bytes",
            validation.block_count.unwrap_or(0),
            validation.total_uncompressed_size.unwrap_or(0));
    }

    // Streaming validation (works with stdin/pipes, no Seek required)
    // let validation = validate_bgzf_streaming(&mut stdin)?;

    Ok(())
}

How It Works

DEFLATE Token Extraction

A DEFLATE stream consists of:

Literals: raw bytes (0-255)
Back-references: (length, distance) pairs pointing to earlier data
End-of-block: marker indicating block completion

We parse the Huffman-coded stream to extract these LZ77 tokens without fully reconstructing the original data. The key insight is that we only need the tokens themselves, not the decompressed bytes.

Cross-Boundary Reference Resolution

When splitting into BGZF blocks, back-references may point across block boundaries. We maintain a 32KB sliding window to:

Detect when a reference crosses the current block boundary
Resolve cross-boundary references by emitting the referenced bytes as literals
Preserve intra-block references as-is (they compress better)

Re-encoding

Tokens are re-encoded using either:

Fixed Huffman tables (levels 1-3): Fast encoding using pre-defined tables (RFC 1951 section 3.2.6)
Dynamic Huffman tables (levels 4-9): Per-block optimal tables computed from token frequencies

At levels 7-9 with --format fastq, block boundaries are aligned to FASTQ record boundaries for better compression.

Optimization Techniques

Table-Based Huffman Decoding

Standard Huffman decoding reads one bit at a time, requiring multiple loop iterations per symbol. We use a lookup table approach:

Build a 1024-entry table (10-bit lookup) at decoder construction time
Peek 10 bits from the bitstream
Single table lookup returns both the symbol and its actual code length
Consume only the actual code length bits

This reduces most symbol decodes from ~9 loop iterations to a single table lookup, providing a ~30% speedup in parsing.

Conditional CRC Computation

CRC32 computation is handled differently based on thread count:

Single-threaded: CRC computed inline during boundary resolution (cache-efficient, no duplicate work)
Multi-threaded: Workers compute CRC in parallel by replaying tokens to bytes

This avoids adding sequential work to the main thread in parallel mode.

Hardware-Accelerated CRC32

Uses the crc32fast crate which automatically leverages:

SSE4.2 CRC instructions on x86_64
ARM CRC instructions on aarch64
Optimized software fallback on other platforms

Architecture

┌─────────────────────────────────────────────────────────────┐
│                      Input (gzip)                           │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  GzipHeader Parser                                          │
│  - Validates gzip magic bytes (1f 8b)                       │
│  - Skips optional fields (FEXTRA, FNAME, FCOMMENT, FHCRC)   │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  DEFLATE Parser (DeflateParser)                             │
│  - Parses block headers (BFINAL, BTYPE)                     │
│  - Decodes Huffman tables (fixed or dynamic)                │
│  - Uses table-based lookup for fast symbol decoding         │
│  - Extracts LZ77 tokens: Literal(u8), Copy{len, dist}       │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  Boundary Resolver (BoundaryResolver)                       │
│  - Maintains 32KB sliding window                            │
│  - Detects cross-boundary back-references                   │
│  - Resolves to literals when necessary                      │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  Huffman Encoder (HuffmanEncoder)                           │
│  - Re-encodes tokens to DEFLATE format                      │
│  - Fixed tables (levels 1-3) or dynamic (levels 4-9)        │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  BGZF Writer (BgzfWriter)                                   │
│  - Wraps DEFLATE data in BGZF block format                  │
│  - Adds BC extra field with block size                      │
│  - Computes CRC32 and writes footer                         │
│  - Appends EOF marker (28-byte empty block)                 │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                     Output (BGZF)                           │
└─────────────────────────────────────────────────────────────┘

Parallel Processing

For multi-threaded transcoding, we use a producer-consumer pipeline:

Main thread (Producer): Parses DEFLATE, accumulates tokens, resolves cross-boundary references
Worker pool (Consumers): Replay tokens to bytes, compute CRC32, encode to BGZF blocks
Output assembly: Blocks are reassembled in order using sequence numbers

The main thread is the bottleneck because DEFLATE parsing is inherently sequential (Huffman codes are variable-length and context-dependent). In practice, 2 threads is optimal: one for parsing, one for encoding. Additional workers have nothing to do.

Benchmarks

Run benchmarks with:

cargo bench

Benchmarks compare:

Single-threaded vs parallel transcoding
Various block sizes

Testing

# Run all tests
cargo test

# Run integration tests
cargo test --test integration

# Run with verbose output
cargo test -- --nocapture

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

License

MIT License - see LICENSE for details.

rebgzf 0.1.0