Expand description
§BINSEQ Format Specification
§Overview
BINSEQ is a binary file format family designed for efficient storage and processing of DNA sequences. They make use of two-bit encoding for nucleotides and are optimized for high-performance parallel processing.
BINSEQ currently has two flavors:
- BQ: (
*.bq
) files are for fixed-length records without quality scores. - VBQ: (
*.vbq
) files are for variable-length records with optional quality scores.
Both flavors support both single and paired sequences.
§Getting Started
This is a library for reading and writing BINSEQ files, for a command-line interface see bqtools.
To get started please refer to our documentation. For example programs which make use of the library check out our examples directory.
For more information about the BINSEQ file family, please refer to our preprint.
§BINSEQ
The binseq
library provides efficient APIs for working with the BINSEQ file format family.
It offers methods to read and write BINSEQ files, providing:
- Compact 2-bit encoding and decoding of nucleotide sequences through
bitnuc
- Memory-mapped file access for efficient reading (
bq::MmapReader
andvbq::MmapReader
) - Parallel processing capabilities for arbitrary tasks through the
ParallelProcessor
trait. - Configurable
Policy
for handling invalid nucleotides - Support for both single and paired-end sequences
- Abstract
BinseqRecord
trait for representing records from both.bq
and.vbq
files. - Abstract
BinseqReader
enum for processing records from both.bq
and.vbq
files.
§Crate Organization
This library is split into 3 major parts.
There are the bq
and vbq
modules, which provide tools for reading and writing BQ
and VBQ
files respectively.
Then there are traits and utilities that are ubiquitous across the library which are available at the top-level of the crate.
§Example: Memory-mapped Access
use binseq::Result;
use binseq::prelude::*;
#[derive(Clone, Default)]
pub struct Processor {
// Define fields here
}
impl ParallelProcessor for Processor {
fn process_record<B: BinseqRecord>(&mut self, record: B) -> Result<()> {
// Implement per-record logic here
Ok(())
}
fn on_batch_complete(&mut self) -> Result<()> {
// Implement per-batch logic here
Ok(())
}
}
fn main() -> Result<()> {
// provide an input path (*.bq or *.vbq)
let path = "./data/subset.bq";
// open a reader
let reader = BinseqReader::new(path)?;
// initialize a processor
let processor = Processor::default();
// process the records in parallel with 8 threads
reader.process_parallel(processor, 8)?;
Ok(())
}
Re-exports§
Modules§
- bq
- BQ - fixed length records, no quality scores
- error
- Error definitions
- prelude
- Prelude - Commonly used types and traits
- vbq
- VBQ - Variable length records, optional quality scores, compressed blocks
Enums§
- Binseq
Reader - An enum abstraction for BINSEQ readers that can process records in parallel
- Policy
- Policy for handling invalid nucleotide sequences during encoding
Constants§
- RNG_
SEED - A global seed for the random number generator used in randomized policies
Traits§
- Binseq
Record - Record trait shared between BINSEQ variants.
- Parallel
Processor - Trait for types that can process records in parallel.
- Parallel
Reader - Trait for BINSEQ readers that can process records in parallel