Crate binseq

Source
Expand description

§BINSEQ Format Specification

MIT licensed actions status Crates.io docs.rs

§Overview

BINSEQ is a binary file format family designed for efficient storage and processing of DNA sequences. They make use of two-bit encoding for nucleotides and are optimized for high-performance parallel processing.

BINSEQ currently has two flavors:

  1. BQ: (*.bq) files are for fixed-length records without quality scores.
  2. VBQ: (*.vbq) files are for variable-length records with optional quality scores.

Both flavors support both single and paired sequences.

§Getting Started

This is a library for reading and writing BINSEQ files, for a command-line interface see bqtools.

To get started please refer to our documentation. For example programs which make use of the library check out our examples directory.

For more information about the BINSEQ file family, please refer to our preprint.

§BINSEQ

The binseq library provides efficient APIs for working with the BINSEQ file format family.

It offers methods to read and write BINSEQ files, providing:

  • Compact 2-bit encoding and decoding of nucleotide sequences through bitnuc
  • Memory-mapped file access for efficient reading (bq::MmapReader and vbq::MmapReader)
  • Parallel processing capabilities for arbitrary tasks through the ParallelProcessor trait.
  • Configurable Policy for handling invalid nucleotides
  • Support for both single and paired-end sequences
  • Abstract BinseqRecord trait for representing records from both .bq and .vbq files.
  • Abstract BinseqReader enum for processing records from both .bq and .vbq files.

§Crate Organization

This library is split into 3 major parts.

There are the bq and vbq modules, which provide tools for reading and writing BQ and VBQ files respectively. Then there are traits and utilities that are ubiquitous across the library which are available at the top-level of the crate.

§Example: Memory-mapped Access

use binseq::Result;
use binseq::prelude::*;

#[derive(Clone, Default)]
pub struct Processor {
    // Define fields here
}

impl ParallelProcessor for Processor {
    fn process_record<B: BinseqRecord>(&mut self, record: B) -> Result<()> {
        // Implement per-record logic here
        Ok(())
    }

    fn on_batch_complete(&mut self) -> Result<()> {
        // Implement per-batch logic here
        Ok(())
    }
}

fn main() -> Result<()> {
    // provide an input path (*.bq or *.vbq)
    let path = "./data/subset.bq";

    // open a reader
    let reader = BinseqReader::new(path)?;

    // initialize a processor
    let processor = Processor::default();

    // process the records in parallel with 8 threads
    reader.process_parallel(processor, 8)?;
    Ok(())
}

Re-exports§

pub use error::Error;
pub use error::Result;

Modules§

bq
BQ - fixed length records, no quality scores
error
Error definitions
prelude
Prelude - Commonly used types and traits
vbq
VBQ - Variable length records, optional quality scores, compressed blocks

Enums§

BinseqReader
An enum abstraction for BINSEQ readers that can process records in parallel
Policy
Policy for handling invalid nucleotide sequences during encoding

Constants§

RNG_SEED
A global seed for the random number generator used in randomized policies

Traits§

BinseqRecord
Record trait shared between BINSEQ variants.
ParallelProcessor
Trait for types that can process records in parallel.
ParallelReader
Trait for BINSEQ readers that can process records in parallel