Crate binseq

Crate binseq 

Source
Expand description

§BINSEQ Format Specification

MIT licensed actions status Crates.io docs.rs

§Overview

BINSEQ is a binary file format family designed for efficient storage and processing of DNA sequences. They make use of two-bit encoding for nucleotides and are optimized for high-performance parallel processing.

BINSEQ currently has two flavors:

  1. BQ: (*.bq) files are for fixed-length records without quality scores.
  2. VBQ: (*.vbq) files are for variable-length records with optional quality scores and headers.

Both flavors support both single and paired sequences.

§Getting Started

This is a library for reading and writing BINSEQ files, for a command-line interface see bqtools.

To get started please refer to our documentation. For example programs which make use of the library check out our examples directory.

For more information about the BINSEQ file family, please refer to our preprint.

§BINSEQ

The binseq library provides efficient APIs for working with the BINSEQ file format family.

It offers methods to read and write BINSEQ files, providing:

  • Compact multi-bit encoding and decoding of nucleotide sequences through bitnuc
  • Memory-mapped file access for efficient reading (bq::MmapReader and vbq::MmapReader)
  • Parallel processing capabilities for arbitrary tasks through the ParallelProcessor trait.
  • Configurable Policy for handling invalid nucleotides
  • Support for both single and paired-end sequences
  • Optional sequence headers/identifiers (VBQ format)
  • Abstract BinseqRecord trait for representing records from both .bq and .vbq files.
  • Abstract BinseqReader enum for processing records from both .bq and .vbq files.

§Recent VBQ Format Changes (v0.7.0+)

The VBQ format has undergone significant improvements:

  • Embedded Index: VBQ files now contain their index data embedded at the end of the file, eliminating separate .vqi index files and improving portability.
  • Headers Support: Optional sequence identifiers/headers can be stored with each record.
  • Extended Capacity: u64 indexing supports files with more than 4 billion records.
  • Multi-bit Encoding: Support for both 2-bit and 4-bit nucleotide encodings.

Legacy VBQ files are automatically migrated to the new format when accessed.

§Crate Organization

This library is split into 3 major parts.

There are the bq and vbq modules, which provide tools for reading and writing BQ and VBQ files respectively. Then there are traits and utilities that are ubiquitous across the library which are available at the top-level of the crate.

§Example: Memory-mapped Access

use binseq::Result;
use binseq::prelude::*;

#[derive(Clone, Default)]
pub struct Processor {
    // Define fields here
}

impl ParallelProcessor for Processor {
    fn process_record<B: BinseqRecord>(&mut self, record: B) -> Result<()> {
        // Implement per-record logic here
        Ok(())
    }

    fn on_batch_complete(&mut self) -> Result<()> {
        // Implement per-batch logic here
        Ok(())
    }
}

fn main() -> Result<()> {
    // provide an input path (*.bq or *.vbq)
    let path = "./data/subset.bq";

    // open a reader
    let reader = BinseqReader::new(path)?;

    // initialize a processor
    let processor = Processor::default();

    // process the records in parallel with 8 threads
    reader.process_parallel(processor, 8)?;
    Ok(())
}

Re-exports§

pub use error::Error;
pub use error::Result;

Modules§

bq
BQ - fixed length records, no quality scores
error
Error definitions
prelude
Prelude - Commonly used types and traits
vbq
VBQ - Variable length records, optional quality scores, compressed blocks

Enums§

BinseqReader
An enum abstraction for BINSEQ readers that can process records in parallel
BitSize
Re-export bitnuc::BitSize
Policy
Policy for handling invalid nucleotide sequences during encoding

Constants§

RNG_SEED
A global seed for the random number generator used in randomized policies

Traits§

BinseqRecord
Record trait shared between BINSEQ variants.
ParallelProcessor
Trait for types that can process records in parallel.
ParallelReader
Trait for BINSEQ readers that can process records in parallel