Expand description
§BINSEQ Format Specification
§Overview
BINSEQ is a binary file format family designed for efficient storage and processing of DNA sequences. They make use of two-bit encoding for nucleotides and are optimized for high-performance parallel processing.
BINSEQ has three variants:
- BQ: (
*.bq) files are for fixed-length records without quality scores. - VBQ: (
*.vbq) files are for variable-length records with optional quality scores and headers. - CBQ: (
*.cbq) files are for columnar variable-length records with optional quality scores and headers.
All variants support both single and paired sequences.
Note: For most use cases, the newest variant CBQ is recommended due to its flexibility, storage efficiency, and decoding speed. It supersedes VBQ in terms of performance and storage efficiency, at a small cost in encoding speed. VBQ will still be supported but newer projects should consider using CBQ instead. For information on the structure of CBQ files, see the documentation.
§Getting Started
This is a library for reading and writing BINSEQ files, for a command-line interface see bqtools.
To get started please refer to our documentation. For example programs which make use of the library check out our examples directory.
For more information about the BINSEQ file family, please refer to our preprint.
§BINSEQ
The binseq library provides efficient APIs for working with the BINSEQ file format family.
It offers methods to read and write BINSEQ files, providing:
- Compact multi-bit encoding and decoding of nucleotide sequences through
bitnuc - Support for both single and paired-end sequences
- Abstract
BinseqRecordtrait for representing records from all variants - Abstract
BinseqReaderenum for processing records from all variants - Abstract
BinseqWriterenum for writing records to all variants - Parallel processing capabilities for arbitrary tasks through the
ParallelProcessortrait. - Configurable
Policyfor handling invalid nucleotides (BQ/VBQ, CBQ natively supportsNnucleotides)
§Recent additions (v0.9.0):
§New variant: CBQ
cbq is a new variant of BINSEQ that solves many of the pain points around VBQ.
The CBQ format is a columnar-block-based format that offers improved compression and faster processing speeds compared to VBQ.
It natively supports N nucleotides and avoids the need for additional 4-bit encoding.
§Improved interface for writing records
BinseqWriter provides a unified interface for writing records generically to BINSEQ files.
This makes use of the new SequencingRecord which provides a cleaner builder API for writing records to BINSEQ files.
§Recent VBQ Format Changes (v0.7.0+)
The VBQ format has undergone significant improvements:
- Embedded Index: VBQ files now contain their index data embedded at the end of the file, improving portability.
- Headers Support: Optional sequence identifiers/headers can be stored with each record.
- Extended Capacity: u64 indexing supports files with more than 4 billion records.
- Multi-bit Encoding: Support for both 2-bit and 4-bit nucleotide encodings.
Legacy VBQ files are automatically migrated to the new format when accessed.
§Example: Memory-mapped Access
use binseq::Result;
use binseq::prelude::*;
#[derive(Clone, Default)]
pub struct Processor {
// Define fields here
}
impl ParallelProcessor for Processor {
fn process_record<B: BinseqRecord>(&mut self, record: B) -> Result<()> {
// Implement per-record logic here
Ok(())
}
fn on_batch_complete(&mut self) -> Result<()> {
// Implement per-batch logic here
Ok(())
}
}
fn main() -> Result<()> {
// provide an input path (*.bq or *.vbq)
let path = "./data/subset.bq";
// open a reader
let reader = BinseqReader::new(path)?;
// initialize a processor
let processor = Processor::default();
// process the records in parallel with 8 threads
reader.process_parallel(processor, 8)?;
Ok(())
}Re-exports§
pub use error::Error;pub use error::IntoBinseqError;pub use error::Result;pub use write::BinseqWriter;pub use write::BinseqWriterBuilder;
Modules§
- bq
- BQ - fixed length records, no quality scores
- cbq
- CBQ - Columnar variable length records, optional quality scores and headers
- error
- Error definitions
- prelude
- Prelude - Commonly used types and traits
- utils
- Utilities for working with BINSEQ files Utility modules for working with BINSEQ files
- vbq
- VBQ - Variable length records, optional quality scores, compressed blocks
- write
- Write operations generic over the BINSEQ variant Unified writer interface for BINSEQ formats
Structs§
- Sequencing
Record - A zero-copy record used to write sequences to binary sequence files.
- Sequencing
Record Builder - A convenience builder struct for creating a
SequencingRecord
Enums§
- Binseq
Reader - An enum abstraction for BINSEQ readers that can process records in parallel
- BitSize
- Re-export
bitnuc::BitSize - Policy
- Policy for handling invalid nucleotide sequences during encoding
Constants§
- RNG_
SEED - A global seed for the random number generator used in randomized policies
Traits§
- Binseq
Record - Record trait shared between BINSEQ variants.
- Parallel
Processor - Trait for types that can process records in parallel.
- Parallel
Reader - Trait for BINSEQ readers that can process records in parallel