bqtools
A command-line utility for working with BINSEQ files.
Overview
bqtools provides tools to encode, decode, manipulate, and analyze BINSEQ files.
It supports all BINSEQ variants (*.bq, *.cbq, *.vbq) and makes use of the binseq library.
BINSEQ is a binary file format family designed for high-performance processing of DNA sequences. It currently has two variants: BQ and VBQ.
- BQ (*.bq): Optimized for fixed-length DNA sequences without quality scores (2bit/4bit).
- VBQ (*.vbq): Optimized for variable-length DNA sequences with optional quality scores, headers with 2bit/4bit.
- CBQ (*.cbq): Optimized for variable-length DNA sequences with optional quality scores, headers with 2bit + N.
All support single and paired sequences and make use of two-bit or four-bit encoding for efficient nucleotide packing using bitnuc and efficient parallel FASTX processing using paraseq.
For more information about BINSEQ, see our preprint where we describe the format family and its applications.
Description of variants
TL;DR:
*.cbqis the recommended format for most applications.
For most applications the BINSEQ variant of choice is *.cbq.
This format is lossless by default and supports variable-length sequences.
It achieves better compression than *.vbq and *.bq by using blocked-columnar compression of sequence attributes.
It can optionally exclude quality scores and headers (but they are included by default).
For an overview of the format check out the BINSEQ docs.
If your application only requires sequences and has fixed-length reads then *.bq is the best choice.
It is the fastest variant but is lossy by design.
Note:
*.vbqwas originally designed for variable-length sequences with quality scores and headers, but it is now deprecated in favor of*.cbqwhich is more compressable, lossless, and has faster decoding.
Features
- Encode: Convert FASTA or FASTQ files to a BINSEQ format
- Decode: Convert a BINSEQ file back to FASTA, FASTQ, or TSV format
- Cat: Concatenate multiple BINSEQ files
- Info: Show information and statistics about a BINSEQ file.
- Grep: Search for fixed-string, regex, or fuzzy matches in BINSEQ files.
- Pipe: Create named-pipes for efficient data processing with legacy tools that don't support BINSEQ.
Installation
From Cargo
bqtools can be installed using cargo, the Rust package manager:
To install cargo you can follow the instructions on the official Rust website.
From Source
# Clone the repository
# Install
# Check installation
Feature Flags
bqtools supports the following feature flags:
htslib: Enable support for reading SAM/BAM/CRAM files using thehtsliblibrary (default).gcs: Enable support for reading Google Cloud Storage files (default).fuzzy: Enable fuzzy matching in thegrepcommand using thesassylibrary
To enable fuzzy matching, bqtools must be compiled using a native target cpu:
# Install from source
;
# Or install from crates but enforce native target cpu
; ;
To selectively enable/disable feature flags:
# (for fuzzy matching support sassy requires native target cpu)
;
# Install bqtools without htslib/gcs but with fuzzy matching
#
# Install bqtools without htslib but with fuzzy matching and gcs
Usage
# Get help information
# Get help for specific commands
Encoding
bqtools accepts input from stdin or from file paths.
It will auto-determine the input format and compression status.
Convert FASTA/FASTQ files to BINSEQ:
# Encode a single file to bq
# Encode a single file to vbq
# Encode a single file to vbq with 4bit encoding
# Encode a file stream to bq (auto-determine input format and compression status)
|
# Encode paired-end reads
# Encode paired-end reads to vbq
# Encode a SAM/BAM/CRAM file to BINSEQ
# Encode an paired-end CRAM file to BINSEQ (sorted by read name)
# Specify a policy for handling non-ATCG nucleotides (2-bit only)
# Set threads for parallel processing
# Include sequencing headers in the encoding (unused by .bq)
# Encode with ARCHIVE mode (useful for genomes, cDNA libraries, and larger sequences)
# where there are common Ns, large sequence sizes, and headers are important
Available policies for handling non-ATCG nucleotides:
i: Ignore sequences with non-ATCG charactersp: Break on invalid sequencesr: Randomly draw a nucleotide for each N (default)a: Set all Ns to Ac: Set all Ns to Cg: Set all Ns to Gt: Set all Ns to T
Note: These are only applied when encoding with 2-bit.
Encoding multiple files at the same time
Encoding FASTX files into BINSEQ is often IO-bound per-file and won't benefit much from parallelism.
However, file-level parallelism is still possible.
bqtools provides some options for making use of file-level parallelism by encoding into separate BINSEQ files or encoding many FASTX files into a single BINSEQ file.
bqtools will automatically find the pairs in the input files and respect pairing if the --paired flag is used.
To encode everything into a single BINSEQ file you can use the --collate flag.
# encodes all FASTX files into separate BINSEQ files
# encodes all paired FASTX files into separated paired-BINSEQ files
# encodes all FASTX files into a single BINSEQ file
# encodes all FASTX files into a single paired-BINSEQ file
Recursive Encoding
You might have a directory or nested subdirectories with multiple FASTX files or FASTX file pairs.
bqtools makes use of the efficient walkdir crate to recursively identify all FASTX files with various compression formats.
It will then balance the provided file/file pairs among the thread pool to ensure efficient parallel encoding.
All options provided by bqtools encode will be passed through to the sub-encoders.
# Encode all FASTX files as BQ
# Encode all paired FASTX files as VBQ and index their output
# Encode recursively with a max-subdirectory depth of 2
Decoding
Convert BINSEQ files back to FASTA/FASTQ/TSV:
# Decode to FASTQ (default)
# Decode to compressed FASTQ (gzip/zstd)
# Decode to FASTA
# Decode paired-end reads into separate files
# Creates output_R1.fastq and output_R2.fastq
# Specify which read of a pair to output
# Specify output format
Concatenating
Combine multiple BINSEQ files:
Information and Statistics
Show information and statistics about a BINSEQ file.
# print out the VBQ index
# print out the CBQ block headers
Grep
You can easily search for specific subsequences or regular expressions within BINSEQ files:
By default the multiple pattern logic is AND (i.e. all patterns must match).
The logic can be changed to OR (i.e. any pattern must match) with the --or-logic option.
# See full options list
# Search for a specific regex in either sequence
# Search for a specific subsequence (in primary sequence)
# Search for a regular expression (in extended)
# Search for multiple regular expressions in either
# Search for multiple regular expressions (OR-logic)
# Only search for patterns within a specified range per sequence (basepairs 30-80)
# Only search for patterns within a specified range per sequence (basepairs 0-80)
# Only search for patterns within a specified range per sequence (basepairs 80-max)
bqtools also support fuzzy matching by making use of sassy.
This requires installing using the fuzzy feature flag (see installation above):
# Run grep with fuzzy matching (-z)
# Run fuzzy matching with an edit distance of 2
# Run fuzzy matching but only write inexact matches
bqtools can also handle a large collection of patterns which can be provided on the CLI as a file.
You can provide files for either primary/extended, just primary, or just extended patterns with the relevant flags.
Notably this will match solely with OR logic.
This can be used also with fuzzy matching as well as with pattern counting described below.
Regex is also fully supported and files can be additionally paired with CLI arguments.
If your patterns are all fixed strings (and not regex), you can improve performance by using the -x/--fixed flag.
This will use the more efficient Aho-Corasick algorithm to match patterns.
# Run grep with patterns from a file
# Run grep with patterns from a file (primary)
# Run grep with patterns from a file (extended)
# Run grep with fixed-string patterns from a file
You can count the number of matching records with -C or get the fraction of matching records with --frac:
# Count the number of matching records
# Count matching records and show fraction of total
The output of --frac is a TSV with three columns: [Count, Total, Fraction]
bqtools also introduces a new feature for the counting the occurrences of individual patterns.
This is useful for seeing how many times each pattern occurs across a sequencing dataset without having to iterate over the dataset multiple times using traditional methods.
Some important notes are:
- A pattern will only be counted once across a sequencing record (primary and secondary)
- A sequencing record may contribute to multiple patterns occurrences
- Providing multiple patterns will match records with
ORlogic (this is different behavior frombqtools grepdefault which usesANDlogic when multiple patterns are provided) - Regular expressions are supported and treated as a single pattern (e.g.
ACGT|TCGAwill return a single output row but match on bothACGTandTCGA). - Invert is supported for counting patterns and will return the number of records a pattern does not occur in.
If your patterns are all fixed strings (and not regex), you can improve performance by using the -x/--fixed flag.
This will use the more efficient Aho-Corasick algorithm to match patterns.
The throughput gains for this can be massive for pattern counting, especially when dealing with high numbers of patterns.
# Count the number of occurrences for each of three expressions
# Count the number of occurrences for each of three patterns with fuzzy matching
# Count the number of records a pattern does not occur in
# Count the number of occurrences for each pattern from a file
# Count the number of occurrences for each pattern from a file (fixed strings)
The output of pattern count is a TSV with three columns: [Pattern, Count, Fraction of Total]
Pipe
Stream BINSEQ data to legacy tools through named pipes for parallel processing.
Because BINSEQ is a new format, many tools don't support it yet.
bqtools pipe creates a server that splits a BINSEQ file into multiple named pipes,
enabling parallel processing with tools that expect FASTQ/FASTA files.
Importantly, if your tool supports multiple parallel threads (i.e. parallelizes input files), you can make use of this feature to significantly improve performance.
# Create 4 named pipes (8 files for paired-end data, 4 files for single-end data)
# Pipes (single): fifo_[1234].fq
# Pipes (paired): fifo_[0123]_R[12].fq
&
# Process in parallel with tools that don't support BINSEQ
|
Key features:
- Each pipe streams a portion of the BINSEQ file sequentially
- No disk I/O for intermediate files - data flows through memory
- Automatic paired-end handling (
_R1/_R2pairs) - Blocks until all pipes are fully read (prevents data loss)
- Auto-scales to CPU count with
-p0(default) - Pipes can be read sequentially or in parallel without blocking.
Note: This feature is not available on Windows.