rebgzf
Note: This code was generated by Claude (Anthropic's AI assistant) and has not been thoroughly reviewed or tested in production environments. Use at your own risk. Contributions and reviews are welcome.
Efficient gzip to BGZF transcoder using Puffin-style half-decompression.
Motivation
BGZF (Blocked GNU Zip Format) is a variant of gzip that enables random access to compressed data. It's widely used in bioinformatics for BAM, CRAM, and VCF files.
Converting standard gzip files to BGZF traditionally requires:
- Full decompression - decompress the entire gzip stream
- Re-compression - compress the data into BGZF blocks
This is slow and CPU-intensive because compression is the expensive part.
The Puffin Approach
This tool uses a half-decompression technique inspired by Puffin:
- Parse the DEFLATE stream to extract LZ77 tokens (literals, lengths, distances)
- Track only enough context to resolve cross-boundary references
- Re-encode the tokens directly into new BGZF blocks
This avoids the expensive compression step entirely - we're just re-serializing already-compressed data into a different block structure.
Performance
Half-decompression is significantly faster than full decompress + recompress because:
- No dictionary lookups for compression
- No hash table maintenance
- Minimal memory footprint (just the sliding window)
- Parallel encoding of independent blocks
Benchmark Results (54 MB gzip → 80 MB BGZF, original data 208 MB uncompressed FASTQ):
| Threads | Time | Throughput | Speedup |
|---|---|---|---|
| 1 | 2.92s | 19.6 MB/s | 1.0x |
| 2 | 1.76s | 32.4 MB/s | 1.66x |
| 4 | 1.79s | 32.0 MB/s | 1.63x |
| 8 | 1.78s | 32.2 MB/s | 1.64x |
Throughput measured on compressed input size. Tested on Apple M1.
Output size: BGZF output is typically ~1.5x larger than the original gzip because:
- Each BGZF block has 26 bytes of overhead (header + footer)
- Cross-boundary LZ77 references are expanded to literals
- Fixed Huffman encoding (less optimal than original dynamic tables)
Why only 2 threads help:
The tool uses a producer-consumer architecture where the main thread sequentially parses DEFLATE blocks (this cannot be parallelized) while worker threads encode BGZF blocks in parallel. With 2 threads total:
- Main thread: parses DEFLATE, resolves cross-boundary references
- Worker thread: encodes tokens to BGZF, computes CRC32
Additional threads beyond 2 provide no benefit because the main thread's sequential parsing is the bottleneck. The single worker can easily keep up with the parsing rate.
Trade-off: BGZF files are larger than gzip, but enable random access. Best suited for format conversion pipelines where random access is needed
Installation
From source
From crates.io (coming soon)
Usage
Command Line
# Basic transcoding (fastest, level 1)
# With progress display
# Better compression with dynamic Huffman (level 6)
# Best compression for FASTQ files (dynamic Huffman + record-aligned blocks)
# Generate GZI index for random access
# Parallel transcoding (auto-detect threads)
# Check if a file is already BGZF
# Strict validation (all blocks, works with stdin)
|
# Force transcoding even if already BGZF
# Verbose output with statistics
CLI Options
Convert gzip files to BGZF format efficiently
Usage: rebgzf [OPTIONS] --input <INPUT>
Options:
-i, --input <INPUT> Input gzip file (use - for stdin)
-o, --output <OUTPUT> Output BGZF file (use - for stdout)
-t, --threads <THREADS> Number of threads (0 = auto, 1 = single-threaded) [default: 1]
-l, --level <LEVEL> Compression level 1-9 (1-3: fixed Huffman, 4-6: dynamic,
7-9: dynamic + smart boundaries) [default: 1]
--format <FORMAT> Input format profile: default, fastq, auto [default: default]
--block-size <BLOCK_SIZE> BGZF block size (default: 65280) [default: 65280]
-v, --verbose Show verbose statistics
-q, --quiet Quiet mode - suppress all output except errors
--json Output results as JSON (for scripting)
--check Check if input is BGZF and exit (0=BGZF, 1=not BGZF, 2=error)
--strict Validate all BGZF blocks (slower, more thorough)
--verify Verify BGZF by decompressing and checking CRC32
--stats Show file statistics without transcoding
--force Force transcoding even if input is already BGZF
-p, --progress Show progress during transcoding
--index [PATH] Write GZI index file (enables random access)
-h, --help Print help
-V, --version Print version
As a Library
use ;
use File;
BGZF Detection
use ;
use File;
use ;
How It Works
DEFLATE Token Extraction
A DEFLATE stream consists of:
- Literals: raw bytes (0-255)
- Back-references: (length, distance) pairs pointing to earlier data
- End-of-block: marker indicating block completion
We parse the Huffman-coded stream to extract these LZ77 tokens without fully reconstructing the original data. The key insight is that we only need the tokens themselves, not the decompressed bytes.
Cross-Boundary Reference Resolution
When splitting into BGZF blocks, back-references may point across block boundaries. We maintain a 32KB sliding window to:
- Detect when a reference crosses the current block boundary
- Resolve cross-boundary references by emitting the referenced bytes as literals
- Preserve intra-block references as-is (they compress better)
Re-encoding
Tokens are re-encoded using either:
- Fixed Huffman tables (levels 1-3): Fast encoding using pre-defined tables (RFC 1951 section 3.2.6)
- Dynamic Huffman tables (levels 4-9): Per-block optimal tables computed from token frequencies
At levels 7-9 with --format fastq, block boundaries are aligned to FASTQ record boundaries for better compression.
Optimization Techniques
Table-Based Huffman Decoding
Standard Huffman decoding reads one bit at a time, requiring multiple loop iterations per symbol. We use a lookup table approach:
- Build a 1024-entry table (10-bit lookup) at decoder construction time
- Peek 10 bits from the bitstream
- Single table lookup returns both the symbol and its actual code length
- Consume only the actual code length bits
This reduces most symbol decodes from ~9 loop iterations to a single table lookup, providing a ~30% speedup in parsing.
Conditional CRC Computation
CRC32 computation is handled differently based on thread count:
- Single-threaded: CRC computed inline during boundary resolution (cache-efficient, no duplicate work)
- Multi-threaded: Workers compute CRC in parallel by replaying tokens to bytes
This avoids adding sequential work to the main thread in parallel mode.
Hardware-Accelerated CRC32
Uses the crc32fast crate which automatically leverages:
- SSE4.2 CRC instructions on x86_64
- ARM CRC instructions on aarch64
- Optimized software fallback on other platforms
Architecture
┌─────────────────────────────────────────────────────────────┐
│ Input (gzip) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ GzipHeader Parser │
│ - Validates gzip magic bytes (1f 8b) │
│ - Skips optional fields (FEXTRA, FNAME, FCOMMENT, FHCRC) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ DEFLATE Parser (DeflateParser) │
│ - Parses block headers (BFINAL, BTYPE) │
│ - Decodes Huffman tables (fixed or dynamic) │
│ - Uses table-based lookup for fast symbol decoding │
│ - Extracts LZ77 tokens: Literal(u8), Copy{len, dist} │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Boundary Resolver (BoundaryResolver) │
│ - Maintains 32KB sliding window │
│ - Detects cross-boundary back-references │
│ - Resolves to literals when necessary │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Huffman Encoder (HuffmanEncoder) │
│ - Re-encodes tokens to DEFLATE format │
│ - Fixed tables (levels 1-3) or dynamic (levels 4-9) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ BGZF Writer (BgzfWriter) │
│ - Wraps DEFLATE data in BGZF block format │
│ - Adds BC extra field with block size │
│ - Computes CRC32 and writes footer │
│ - Appends EOF marker (28-byte empty block) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Output (BGZF) │
└─────────────────────────────────────────────────────────────┘
Parallel Processing
For multi-threaded transcoding, we use a producer-consumer pipeline:
- Main thread (Producer): Parses DEFLATE, accumulates tokens, resolves cross-boundary references
- Worker pool (Consumers): Replay tokens to bytes, compute CRC32, encode to BGZF blocks
- Output assembly: Blocks are reassembled in order using sequence numbers
The main thread is the bottleneck because DEFLATE parsing is inherently sequential (Huffman codes are variable-length and context-dependent). In practice, 2 threads is optimal: one for parsing, one for encoding. Additional workers have nothing to do.
Benchmarks
Run benchmarks with:
Benchmarks compare:
- Single-threaded vs parallel transcoding
- Various block sizes
Testing
# Run all tests
# Run integration tests
# Run with verbose output
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
License
MIT License - see LICENSE for details.