Crate binseq

Crate binseq 

Source
Expand description

§BINSEQ Format Specification

MIT licensed actions status Crates.io docs.rs

§Overview

BINSEQ is a binary file format family designed for efficient storage and processing of DNA sequences. They make use of two-bit encoding for nucleotides and are optimized for high-performance parallel processing.

BINSEQ has three variants:

  1. BQ: (*.bq) files are for fixed-length records without quality scores.
  2. VBQ: (*.vbq) files are for variable-length records with optional quality scores and headers.
  3. CBQ: (*.cbq) files are for columnar variable-length records with optional quality scores and headers.

All variants support both single and paired sequences.

Note: For most use cases, the newest variant CBQ is recommended due to its flexibility, storage efficiency, and decoding speed. It supersedes VBQ in terms of performance and storage efficiency, at a small cost in encoding speed. VBQ will still be supported but newer projects should consider using CBQ instead. For information on the structure of CBQ files, see the documentation.

§Getting Started

This is a library for reading and writing BINSEQ files, for a command-line interface see bqtools.

To get started please refer to our documentation. For example programs which make use of the library check out our examples directory.

For more information about the BINSEQ file family, please refer to our preprint.

§BINSEQ

The binseq library provides efficient APIs for working with the BINSEQ file format family.

It offers methods to read and write BINSEQ files, providing:

  • Compact multi-bit encoding and decoding of nucleotide sequences through bitnuc
  • Support for both single and paired-end sequences
  • Abstract BinseqRecord trait for representing records from all variants
  • Abstract BinseqReader enum for processing records from all variants
  • Abstract BinseqWriter enum for writing records to all variants
  • Parallel processing capabilities for arbitrary tasks through the ParallelProcessor trait.
  • Configurable Policy for handling invalid nucleotides (BQ/VBQ, CBQ natively supports N nucleotides)

§Recent additions (v0.9.0):

§New variant: CBQ

cbq is a new variant of BINSEQ that solves many of the pain points around VBQ. The CBQ format is a columnar-block-based format that offers improved compression and faster processing speeds compared to VBQ. It natively supports N nucleotides and avoids the need for additional 4-bit encoding.

§Improved interface for writing records

BinseqWriter provides a unified interface for writing records generically to BINSEQ files. This makes use of the new SequencingRecord which provides a cleaner builder API for writing records to BINSEQ files.

§Recent VBQ Format Changes (v0.7.0+)

The VBQ format has undergone significant improvements:

  • Embedded Index: VBQ files now contain their index data embedded at the end of the file, improving portability.
  • Headers Support: Optional sequence identifiers/headers can be stored with each record.
  • Extended Capacity: u64 indexing supports files with more than 4 billion records.
  • Multi-bit Encoding: Support for both 2-bit and 4-bit nucleotide encodings.

Legacy VBQ files are automatically migrated to the new format when accessed.

§Example: Memory-mapped Access

use binseq::Result;
use binseq::prelude::*;

#[derive(Clone, Default)]
pub struct Processor {
    // Define fields here
}

impl ParallelProcessor for Processor {
    fn process_record<B: BinseqRecord>(&mut self, record: B) -> Result<()> {
        // Implement per-record logic here
        Ok(())
    }

    fn on_batch_complete(&mut self) -> Result<()> {
        // Implement per-batch logic here
        Ok(())
    }
}

fn main() -> Result<()> {
    // provide an input path (*.bq or *.vbq)
    let path = "./data/subset.bq";

    // open a reader
    let reader = BinseqReader::new(path)?;

    // initialize a processor
    let processor = Processor::default();

    // process the records in parallel with 8 threads
    reader.process_parallel(processor, 8)?;
    Ok(())
}

Re-exports§

pub use error::Error;
pub use error::IntoBinseqError;
pub use error::Result;
pub use write::BinseqWriter;
pub use write::BinseqWriterBuilder;

Modules§

bq
BQ - fixed length records, no quality scores
cbq
CBQ - Columnar variable length records, optional quality scores and headers
error
Error definitions
prelude
Prelude - Commonly used types and traits
utils
Utilities for working with BINSEQ files Utility modules for working with BINSEQ files
vbq
VBQ - Variable length records, optional quality scores, compressed blocks
write
Write operations generic over the BINSEQ variant Unified writer interface for BINSEQ formats

Structs§

SequencingRecord
A zero-copy record used to write sequences to binary sequence files.
SequencingRecordBuilder
A convenience builder struct for creating a SequencingRecord

Enums§

BinseqReader
An enum abstraction for BINSEQ readers that can process records in parallel
BitSize
Re-export bitnuc::BitSize
Policy
Policy for handling invalid nucleotide sequences during encoding

Constants§

RNG_SEED
A global seed for the random number generator used in randomized policies

Traits§

BinseqRecord
Record trait shared between BINSEQ variants.
ParallelProcessor
Trait for types that can process records in parallel.
ParallelReader
Trait for BINSEQ readers that can process records in parallel