bamslice

Extract specific byte ranges from BAM/CRAM files and convert to interleaved FASTQ format. Designed for parallel processing across compute nodes without requiring pre-indexing.

Features

No pre-indexing required - accepts approximate byte offsets
Auto-aligns to block boundaries - finds the next valid BGZF block at or after the start offset
Byte-range based - process arbitrary byte ranges for easy parallelization
No overlap - using contiguous byte ranges guarantees no duplicate reads
Interleaved FASTQ output - same format as samtools fastq
Parallel-ready - designed for distributed processing

Installation

cargo build --release

Binary: target/release/bamslice

Usage

bamslice \
  --input input.bam \
  --start-offset 0 \
  --end-offset 10000000 \
  --output output.fastq

Arguments

--input, -i: Input BAM
--start-offset, -s: Starting byte offset (will find next BGZF block at or after this offset)
--end-offset, -e: Ending byte offset (will stop when reaching a block at or after this offset)
--output, -o: Output FASTQ file (default: stdout)

Examples

Extract first half of file:

FILE_SIZE=$(stat -f%z input.bam)  # macOS
# FILE_SIZE=$(stat -c%s input.bam)  # Linux
HALF=$((FILE_SIZE / 2))

bamslice -i input.bam -s 0 -e $HALF -o first_half.fastq

Extract second half (no overlap!):

bamslice -i input.bam -s $HALF -e $FILE_SIZE -o second_half.fastq

Output to stdout:

bamslice -i input.bam -s 0 -e 1000000 | head -n 4

Parallel Processing

The tool uses byte ranges, making it trivial to parallelize without coordination

Nextflow Example

See example.nf for a pipeline that pipes bamslice output through fastp for QC/filtering.

nextflow run example.nf --bam input.bam --chunk_size 104857600

How It Works

BGZF Structure: BAM files use BGZF (Blocked GZIP) - a series of independent compressed blocks
Block Discovery: Given a start offset, scans forward to find the next valid BGZF block (magic: 0x1f 0x8b 0x08)
Range Processing: Processes all reads from blocks starting before end_offset
No Overlap: Each block is processed by exactly one job when using contiguous byte ranges
FASTQ Output: Converts BAM records to interleaved FASTQ format

Why Byte Ranges?

No indexing overhead: Don't need to scan the entire file first
Trivial parallelization: Just choose your start/end offsets (see example nextflow)
No coordination: Each process works independently
Guaranteed coverage: Contiguous ranges ensure no reads are skipped
No duplication: Block alignment ensures no reads are processed twice

Testing

Run the test suite to verify correctness:

cargo test

Development Commands

Run a coverage analysis:

bash ./coverage.sh && open target/coverage/html/index.html

Build a flamegraph for performance profiling:

bash ./flamegraph.sh && open flamegraph.svg

License

AGPLv3 - See LICENSE file for details

bamslice 0.1.2