Expand description
VBQ - Variable length records, optional quality scores, compressed blocks
§VBINSEQ Format
VBINSEQ is a high-performance binary format for variable-length nucleotide sequences that optimizes both storage efficiency and parallel processing capabilities.
For more information on the format, please refer to our preprint.
§Overview
VBINSEQ extends the core principles of BINSEQ to accommodate:
-
Variable-length sequences: Unlike BINSEQ which requires fixed-length reads, VBINSEQ can store sequences of any length, making it suitable for technologies like PacBio and Oxford Nanopore.
-
Quality scores: Optional storage of quality scores alongside nucleotide data when needed.
-
Sequence headers: Optional storage of sequence identifiers/headers with each record.
-
Block-based organization: Data is organized into fixed-size independent record blocks for efficient parallel processing.
-
Compression: Optional ZSTD compression of individual blocks balances storage efficiency with processing speed.
-
Paired-end support: Native support for paired sequences without needing multiple files.
-
Multi-bit encoding: Support for 2-bit and 4-bit nucleotide encodings.
-
Embedded index: Self-contained files with embedded index data for efficient random access.
§File Structure
A VBINSEQ file consists of a 32-byte header followed by record blocks and an embedded index:
┌───────────────────┐
│ File Header │ 32 bytes
├───────────────────┤
│ Block Header │ 32 bytes
├───────────────────┤
│ │
│ Block Records │ Variable size
│ │
├───────────────────┤
│ ... │ More blocks
├───────────────────┤
│ Compressed Index │ Variable size
├───────────────────┤
│ Index Size │ 8 bytes (u64)
├───────────────────┤
│ Index End Magic │ 8 bytes
└───────────────────┘§Record Format
Each record contains the following fields in order:
- Flag field (8 bytes)
- Primary sequence length (8 bytes)
- Extended sequence length (8 bytes, 0 if not paired)
- Primary sequence data (2-bit or 4-bit encoded)
- Extended sequence data (optional, for paired-end)
- Primary quality scores (optional, if
qualflag set) - Extended quality scores (optional, if paired and
qualflag set) - Primary header length (8 bytes, if
headersflag set) - Primary header data (UTF-8 string, if
headersflag set) - Extended header length (8 bytes, if paired and
headersflag set) - Extended header data (UTF-8 string, if paired and
headersflag set)
§Recent Format Changes (v0.7.0+)
- Embedded Index: Index data is now stored within the VBQ file itself, eliminating
separate
.vqifiles and improving portability. - Headers Support: Optional sequence identifiers can be stored with each record.
- Extended Capacity: u64 indexing supports files with more than 4 billion records.
- Multi-bit Encoding: Support for both 2-bit and 4-bit nucleotide encodings.
§Performance Characteristics
VBINSEQ is designed for high-throughput parallel processing:
- Independent blocks enable true parallel processing without synchronization
- Memory-mapped access provides efficient I/O
- Embedded index enables fast random access without auxiliary files
- Multi-bit encoding (2-bit/4-bit) optimizes storage for different use cases
- Optional ZSTD compression reduces file size with minimal performance impact
§Usage Example
use std::fs::File;
use std::io::BufWriter;
use binseq::vbq::{VBinseqHeaderBuilder, VBinseqWriterBuilder, MmapReader};
use binseq::BinseqRecord;
/*
WRITING
*/
// Create a header for sequences with quality scores and headers
let header = VBinseqHeaderBuilder::new()
.qual(true)
.compressed(true)
.headers(true)
.build();
// Create a writer
let file = File::create("example.vbq").unwrap();
let mut writer = VBinseqWriterBuilder::default()
.header(header)
.build(BufWriter::new(file))
.unwrap();
// Write a sequence with quality scores and header
let sequence = b"ACGTACGT";
let quality = b"IIIIFFFF";
let header_str = b"sequence_001";
writer.write_record(None, Some(header_str), sequence, Some(quality)).unwrap();
writer.finish().unwrap();
/*
READING
*/
// Read the sequences back
let mut reader = MmapReader::new("example.vbq").unwrap();
let mut block = reader.new_block();
// Process blocks one at a time
let mut seq_buffer = Vec::new();
let mut header_buffer = Vec::new();
while reader.read_block_into(&mut block).unwrap() {
for record in block.iter() {
record.decode_s(&mut seq_buffer).unwrap();
record.sheader(&mut header_buffer);
println!("Header: {}", std::str::from_utf8(&header_buffer).unwrap());
println!("Sequence: {}", std::str::from_utf8(&seq_buffer).unwrap());
println!("Quality: {}", std::str::from_utf8(record.squal()).unwrap());
seq_buffer.clear();
header_buffer.clear();
}
}Structs§
- Block
Header - Block header for VBINSEQ block data
- Block
Index - Complete index for a VBINSEQ file
- Block
Range - Descriptor of the dimensions of a block in a VBINSEQ file
- Mmap
Reader - Memory-mapped reader for VBINSEQ files
- Record
Block - A container for a block of VBINSEQ records
- Record
Block Iter - RefRecord
- A reference to a record in a VBINSEQ file
- VBinseq
Header - File header for VBINSEQ files
- VBinseq
Header Builder - VBinseq
Writer - Writer for VBINSEQ format files
- VBinseq
Writer Builder - A builder for creating configured
VBinseqWriterinstances