rebgzf 0.2.0

Efficient gzip to BGZF transcoder using Puffin-style half-decompression
Documentation
# rebgzf Feature Implementation Status

**Last Updated:** 2024-12-19
**Status:** All planned phases complete

---

## Overview

A high-performance gzip-to-BGZF transcoder that re-encodes gzip streams into BGZF format without full decompression/recompression. Supports compression levels 1-9, FASTQ-aware block splitting, GZI index generation, and parallel processing.

## Completed Features

| Feature | Description | Status |
|---------|-------------|--------|
| Compression levels 1-9 | Configurable encoding quality | ✓ Complete |
| Format detection | `--format fastq/auto/default` | ✓ Complete |
| Dynamic Huffman encoding | Per-block optimal tables (levels 4+) | ✓ Complete |
| Bulk bit refill | Reads 8 bytes at once for efficiency | ✓ Complete |
| Bulk bit writing | 64-bit buffer for efficient output | ✓ Complete |
| Progress tracking | `--progress` with throughput display | ✓ Complete |
| FASTQ-aware splitting | Record-aligned block boundaries (levels 7+) | ✓ Complete |
| GZI index output | `--index` for random access support | ✓ Complete |
| Streaming validation | Validate BGZF from stdin/pipes | ✓ Complete |
| Deep verification | `--verify` with CRC32 checking | ✓ Complete |
| File statistics | `--stats` mode | ✓ Complete |
| Quiet mode | `-q/--quiet` flag | ✓ Complete |
| JSON output | `--json` for scripting | ✓ Complete |
| Parallel transcoding | Multi-threaded encoding | ✓ Complete |
| Concatenated gzip | Handle multi-member gzip files | ✓ Complete |

---

## Compression Level Design

| Level | Encoding | Boundaries | Use Case |
|-------|----------|------------|----------|
| 1-3 | Fixed Huffman | Simple size-based | Fastest transcoding |
| 4-6 | Dynamic Huffman | Simple size-based | Better compression |
| 7-9 | Dynamic Huffman | Smart FASTQ-aware | Best for FASTQ files |

- `--format fastq` implies at least level 6
- Default: Level 1 (fast, preserves original behavior)

---

## Architecture

### Core Components

```
src/
├── bgzf/
│   ├── constants.rs    # BGZF format constants
│   ├── detector.rs     # BGZF detection & validation
│   ├── index.rs        # GZI index builder
│   ├── writer.rs       # BGZF block writer
│   └── mod.rs
├── bits/
│   ├── reader.rs       # Bit-level input (bulk refill)
│   └── writer.rs       # Bit-level output
├── deflate/
│   ├── parser.rs       # DEFLATE stream parser
│   ├── tables.rs       # Length/distance tables
│   └── tokens.rs       # LZ77 token types
├── huffman/
│   ├── decoder.rs      # Huffman decoding
│   └── encoder.rs      # Fixed + dynamic encoding
├── transcoder/
│   ├── boundary.rs     # Cross-block reference resolution
│   ├── parallel.rs     # Multi-threaded transcoder
│   ├── single.rs       # Single-threaded transcoder
│   ├── splitter.rs     # Block split point detection
│   └── window.rs       # Sliding window for LZ77
├── gzip/
│   └── header.rs       # Gzip header parsing
├── error.rs
└── lib.rs              # Public API
```

### CLI Options

```
rebgzf [OPTIONS] -i <INPUT> -o <OUTPUT>

Options:
  -i, --input <FILE>      Input gzip file (- for stdin)
  -o, --output <FILE>     Output BGZF file (- for stdout)
  -t, --threads <N>       Number of threads (0=auto, 1=single)
  -l, --level <1-9>       Compression level (default: 1)
      --format <TYPE>     Format profile: default, fastq, auto
      --block-size <N>    BGZF block size (default: 65280)
      --index [PATH]      Write GZI index file
      --check             Check if input is BGZF
      --strict            Validate all BGZF blocks
      --force             Force transcoding even if already BGZF
  -p, --progress          Show progress during transcoding
  -v, --verbose           Show detailed statistics
```

---

## Test Coverage

- **63 unit tests** covering all modules
- **36 integration tests** including:
  - Single/parallel transcoding
  - Various data sizes and patterns
  - BGZF detection and validation
  - Index generation verification
  - Streaming validation
  - Concatenated gzip handling

---

## Performance Characteristics

- **Transcoding**: ~100-300 MB/s single-threaded (data dependent)
- **Parallel scaling**: Near-linear up to 8 threads
- **Memory**: Bounded by block size + thread count
- **Overhead**: Minimal vs full decompress/recompress

---

## Potential Future Enhancements

### Performance
- [ ] SIMD-accelerated newline detection for FASTQ splitting
- [ ] Buffer pooling in parallel mode to reduce allocations
- [ ] Memory-mapped I/O option for large files

### Features
- [ ] GZI index reader for random access extraction
- [ ] BAM-aware block splitting
- [ ] Deep validation mode (decompress + verify CRC)
- [ ] Re-blocking existing BGZF files
- [ ] Compression ratio statistics per block

### CLI
- [ ] Multiple input files (batch processing)
- [ ] JSON stats output (`--json`)
- [ ] Quiet mode (`-q`)

---

## Commit History (Recent)

```
1e7a004 Add GZI index generation and streaming BGZF validation
c192b50 Add FASTQ-aware smart boundary splitting for levels 7+
d83bf93 Add I/O optimizations: bulk refill and progress tracking
3b87431 Implement dynamic Huffman encoding for compression levels 4+
397db16 Add compression level and format profile support
223a2d8 Optimize transcoder performance and clarify documentation
```

---

## Usage Examples

```bash
# Basic transcoding (fastest)
rebgzf -i input.gz -o output.bgzf

# With progress display
rebgzf -i input.gz -o output.bgzf --progress

# Best compression for FASTQ
rebgzf -i reads.fastq.gz -o reads.bgzf --format fastq --level 9

# Generate index for random access
rebgzf -i data.gz -o data.bgzf --index

# Parallel transcoding
rebgzf -i large.gz -o large.bgzf -t 8 --progress

# Check if file is BGZF
rebgzf -i file.gz --check

# Strict validation (all blocks)
rebgzf -i file.bgzf --check --strict
```