bamslice
Extract specific byte ranges from BAM/CRAM files and convert to interleaved FASTQ format. Designed for parallel processing across compute nodes without requiring pre-indexing.
Features
- No pre-indexing required - accepts approximate byte offsets
- Auto-aligns to block boundaries - finds the next valid BGZF block at or after the start offset
- Byte-range based - process arbitrary byte ranges for easy parallelization
- No overlap - using contiguous byte ranges guarantees no duplicate reads
- Interleaved FASTQ output - same format as
samtools fastq - Parallel-ready - designed for distributed processing
Installation
Binary: target/release/bamslice
Usage
Arguments
--input, -i: Input BAM--start-offset, -s: Starting byte offset (will find next BGZF block at or after this offset)--end-offset, -e: Ending byte offset (will stop when reaching a block at or after this offset)--output, -o: Output FASTQ file (default: stdout)
Examples
Extract first half of file:
FILE_SIZE= # macOS
# FILE_SIZE=$(stat -c%s input.bam) # Linux
HALF=
Extract second half (no overlap!):
Output to stdout:
|
Parallel Processing
The tool uses byte ranges, making it trivial to parallelize without coordination
Nextflow Example
See example.nf for a pipeline that pipes bamslice output through fastp for QC/filtering.
How It Works
- BGZF Structure: BAM files use BGZF (Blocked GZIP) - a series of independent compressed blocks
- Block Discovery: Given a start offset, scans forward to find the next valid BGZF block (magic:
0x1f 0x8b 0x08) - Range Processing: Processes all reads from blocks starting before
end_offset - No Overlap: Each block is processed by exactly one job when using contiguous byte ranges
- FASTQ Output: Converts BAM records to interleaved FASTQ format
Why Byte Ranges?
- No indexing overhead: Don't need to scan the entire file first
- Trivial parallelization: Just choose your start/end offsets (see example nextflow)
- No coordination: Each process works independently
- Guaranteed coverage: Contiguous ranges ensure no reads are skipped
- No duplication: Block alignment ensures no reads are processed twice
Testing
Run the test suite to verify correctness:
Development Commands
Run a coverage analysis:
&&
Build a flamegraph for performance profiling:
&&
License
AGPLv3 - See LICENSE file for details