Module vbq

Module vbq 

Source
Expand description

VBQ - Variable length records, optional quality scores, compressed blocks

§VBINSEQ Format

VBINSEQ is a high-performance binary format for variable-length nucleotide sequences that optimizes both storage efficiency and parallel processing capabilities.

For more information on the format, please refer to our preprint.

§Overview

VBINSEQ extends the core principles of BINSEQ to accommodate:

  • Variable-length sequences: Unlike BINSEQ which requires fixed-length reads, VBINSEQ can store sequences of any length, making it suitable for technologies like PacBio and Oxford Nanopore.

  • Quality scores: Optional storage of quality scores alongside nucleotide data when needed.

  • Sequence headers: Optional storage of sequence identifiers/headers with each record.

  • Block-based organization: Data is organized into fixed-size independent record blocks for efficient parallel processing.

  • Compression: Optional ZSTD compression of individual blocks balances storage efficiency with processing speed.

  • Paired-end support: Native support for paired sequences without needing multiple files.

  • Multi-bit encoding: Support for 2-bit and 4-bit nucleotide encodings.

  • Embedded index: Self-contained files with embedded index data for efficient random access.

§File Structure

A VBINSEQ file consists of a 32-byte header followed by record blocks and an embedded index:

┌───────────────────┐
│    File Header    │ 32 bytes
├───────────────────┤
│   Block Header    │ 32 bytes
├───────────────────┤
│                   │
│   Block Records   │ Variable size
│                   │
├───────────────────┤
│       ...         │ More blocks
├───────────────────┤
│ Compressed Index  │ Variable size
├───────────────────┤
│   Index Size      │ 8 bytes (u64)
├───────────────────┤
│ Index End Magic   │ 8 bytes
└───────────────────┘

§Record Format

Each record contains the following fields in order:

  • Flag field (8 bytes)
  • Primary sequence length (8 bytes)
  • Extended sequence length (8 bytes, 0 if not paired)
  • Primary sequence data (2-bit or 4-bit encoded)
  • Extended sequence data (optional, for paired-end)
  • Primary quality scores (optional, if qual flag set)
  • Extended quality scores (optional, if paired and qual flag set)
  • Primary header length (8 bytes, if headers flag set)
  • Primary header data (UTF-8 string, if headers flag set)
  • Extended header length (8 bytes, if paired and headers flag set)
  • Extended header data (UTF-8 string, if paired and headers flag set)

§Recent Format Changes (v0.7.0+)

  • Embedded Index: Index data is now stored within the VBQ file itself, eliminating separate .vqi files and improving portability.
  • Headers Support: Optional sequence identifiers can be stored with each record.
  • Extended Capacity: u64 indexing supports files with more than 4 billion records.
  • Multi-bit Encoding: Support for both 2-bit and 4-bit nucleotide encodings.

§Performance Characteristics

VBINSEQ is designed for high-throughput parallel processing:

  • Independent blocks enable true parallel processing without synchronization
  • Memory-mapped access provides efficient I/O
  • Embedded index enables fast random access without auxiliary files
  • Multi-bit encoding (2-bit/4-bit) optimizes storage for different use cases
  • Optional ZSTD compression reduces file size with minimal performance impact

§Usage Example

use std::fs::File;
use std::io::BufWriter;
use binseq::vbq::{VBinseqHeaderBuilder, VBinseqWriterBuilder, MmapReader};
use binseq::BinseqRecord;

/*
   WRITING
*/

// Create a header for sequences with quality scores and headers
let header = VBinseqHeaderBuilder::new()
    .qual(true)
    .compressed(true)
    .headers(true)
    .build();

// Create a writer
let file = File::create("example.vbq").unwrap();
let mut writer = VBinseqWriterBuilder::default()
    .header(header)
    .build(BufWriter::new(file))
    .unwrap();

// Write a sequence with quality scores and header
let sequence = b"ACGTACGT";
let quality = b"IIIIFFFF";
let header_str = b"sequence_001";
writer.write_record(None, Some(header_str), sequence, Some(quality)).unwrap();
writer.finish().unwrap();

/*
   READING
*/

// Read the sequences back
let mut reader = MmapReader::new("example.vbq").unwrap();
let mut block = reader.new_block();

// Process blocks one at a time
let mut seq_buffer = Vec::new();
let mut header_buffer = Vec::new();
while reader.read_block_into(&mut block).unwrap() {
    for record in block.iter() {
        record.decode_s(&mut seq_buffer).unwrap();
        record.sheader(&mut header_buffer);
        println!("Header: {}", std::str::from_utf8(&header_buffer).unwrap());
        println!("Sequence: {}", std::str::from_utf8(&seq_buffer).unwrap());
        println!("Quality: {}", std::str::from_utf8(record.squal()).unwrap());
        seq_buffer.clear();
        header_buffer.clear();
    }
}

Structs§

BlockHeader
Block header for VBINSEQ block data
BlockIndex
Complete index for a VBINSEQ file
BlockRange
Descriptor of the dimensions of a block in a VBINSEQ file
MmapReader
Memory-mapped reader for VBINSEQ files
RecordBlock
A container for a block of VBINSEQ records
RecordBlockIter
RefRecord
A reference to a record in a VBINSEQ file
VBinseqHeader
File header for VBINSEQ files
VBinseqHeaderBuilder
VBinseqWriter
Writer for VBINSEQ format files
VBinseqWriterBuilder
A builder for creating configured VBinseqWriter instances