bqtools
A command-line utility for working with BINSEQ files.
Overview
bqtools provides tools to encode, decode, manipulate, and analyze BINSEQ files.
It supports both (*.bq) and (*.vbq) files and makes use of the binseq library.
BINSEQ is a binary file format family designed for high-performance processing of DNA sequences. It currently has two variants: BQ and VBQ.
- BQ (*.bq): Optimized for fixed-length DNA sequences without quality scores.
- VBQ (*.vbq): Optimized for variable-length DNA sequences with optional quality scores.
Both support single and paired sequences and make use of two-bit or four-bit encoding for efficient nucleotide packing using bitnuc and efficient parallel FASTX processing using paraseq.
For more information about BINSEQ, see our preprint where we describe the format family and its applications.
Features
- Encode: Convert FASTA or FASTQ files to a BINSEQ format
- Decode: Convert a BINSEQ file back to FASTA, FASTQ, or TSV format
- Cat: Concatenate multiple BINSEQ files
- Count: Count records in a BINSEQ file
- Grep: Search for fixed-string or regex patterns in BINSEQ files.
Installation
From Cargo
bqtools can be installed using cargo, the Rust package manager:
To install cargo you can follow the instructions on the official Rust website.
From Source
# Clone the repository
# Install
# Check installation
Usage
# Get help information
# Get help for specific commands
Encoding
bqtools accepts input from stdin or from file paths.
It will auto-determine the input format and compression status.
Convert FASTA/FASTQ files to BINSEQ:
# Encode a single file to bq
# Encode a single file to vbq
# Encode a single file to vbq with 4bit encoding
# Encode a file stream to bq (auto-determine input format and compression status)
|
# Encode paired-end reads
# Encode paired-end reads to vbq
# Encode a SAM/BAM/CRAM file to BINSEQ
# Encode an paired-end CRAM file to BINSEQ (sorted by read name)
# Specify a policy for handling non-ATCG nucleotides (2-bit only)
# Set threads for parallel processing
# Include sequencing headers in the encoding (unused by .bq)
# Encode with ARCHIVE mode (useful for genomes, cDNA libraries, and larger sequences)
# where there are common Ns, large sequence sizes, and headers are important
Available policies for handling non-ATCG nucleotides:
i: Ignore sequences with non-ATCG charactersp: Break on invalid sequencesr: Randomly draw a nucleotide for each N (default)a: Set all Ns to Ac: Set all Ns to Cg: Set all Ns to Gt: Set all Ns to T
Note: These are only applied when encoding with 2-bit.
Recursive Encoding
You might have a directory or nested subdirectories with multiple FASTX files or FASTX file pairs.
bqtools makes use of the efficient walkdir crate to recursively identify all FASTX files with various compression formats.
It will then balance the provided file/file pairs among the thread pool to ensure efficient parallel encoding.
All options provided by bqtools encode will be passed through to the sub-encoders.
# Encode all FASTX files as BQ
# Encode all paired FASTX files as VBQ and index their output
# Encode recursively with a max-subdirectory depth of 2
Decoding
Convert BINSEQ files back to FASTA/FASTQ/TSV:
# Decode to FASTQ (default)
# Decode to compressed FASTQ (gzip/zstd)
# Decode to FASTA
# Decode paired-end reads into separate files
# Creates output_R1.fastq and output_R2.fastq
# Specify which read of a pair to output
# Specify output format
Concatenating
Combine multiple BINSEQ files:
Counting
Count records in a BINSEQ file:
Grep
You can easily search for specific subsequences or regular expressions within BINSEQ files:
By default the multiple pattern logic is AND (i.e. all patterns must match).
The logic can be changed to OR (i.e. any pattern must match) with the --or-logic option.
# See full options list
# Search for a specific regex in either sequence
# Search for a specific subsequence (in primary sequence)
# Search for a regular expression (in extended)
# Search for multiple regular expressions in either
# Search for multiple regular expressions (OR-logic)