1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
//! # VBQ Format
//!
//! VBQ is a high-performance binary format for variable-length nucleotide sequences
//! that optimizes both storage efficiency and parallel processing capabilities.
//!
//! For more information on the format, please refer to our [preprint](https://www.biorxiv.org/content/10.1101/2025.04.08.647863v1).
//!
//! ## Overview
//!
//! VBQ extends the core principles of BINSEQ to accommodate:
//!
//! * **Variable-length sequences**: Unlike BINSEQ which requires fixed-length reads, VBQ can store
//! sequences of any length, making it suitable for technologies like PacBio and Oxford Nanopore.
//!
//! * **Quality scores**: Optional storage of quality scores alongside nucleotide data when needed.
//!
//! * **Sequence headers**: Optional storage of sequence identifiers/headers with each record.
//!
//! * **Block-based organization**: Data is organized into fixed-size independent record blocks
//! for efficient parallel processing.
//!
//! * **Compression**: Optional ZSTD compression of individual blocks balances storage
//! efficiency with processing speed.
//!
//! * **Paired-end support**: Native support for paired sequences without needing multiple files.
//!
//! * **Multi-bit encoding**: Support for 2-bit and 4-bit nucleotide encodings.
//!
//! * **Embedded index**: Self-contained files with embedded index data for efficient random access.
//!
//! ## File Structure
//!
//! A VBQ file consists of a 32-byte header followed by record blocks and an embedded index:
//!
//! ```text
//! ┌───────────────────┐
//! │ File Header │ 32 bytes
//! ├───────────────────┤
//! │ Block Header │ 32 bytes
//! ├───────────────────┤
//! │ │
//! │ Block Records │ Variable size
//! │ │
//! ├───────────────────┤
//! │ ... │ More blocks
//! ├───────────────────┤
//! │ Compressed Index │ Variable size
//! ├───────────────────┤
//! │ Index Size │ 8 bytes (u64)
//! ├───────────────────┤
//! │ Index End Magic │ 8 bytes
//! └───────────────────┘
//! ```
//!
//! ## Record Format
//!
//! Each record contains the following fields in order:
//!
//! * Flag field (8 bytes)
//! * Primary sequence length (8 bytes)
//! * Extended sequence length (8 bytes, 0 if not paired)
//! * Primary sequence data (2-bit or 4-bit encoded)
//! * Extended sequence data (optional, for paired-end)
//! * Primary quality scores (optional, if `qual` flag set)
//! * Extended quality scores (optional, if paired and `qual` flag set)
//! * Primary header length (8 bytes, if `headers` flag set)
//! * Primary header data (UTF-8 string, if `headers` flag set)
//! * Extended header length (8 bytes, if paired and `headers` flag set)
//! * Extended header data (UTF-8 string, if paired and `headers` flag set)
//!
//! ## Recent Format Changes (v0.7.0+)
//!
//! * **Embedded Index**: Index data is now stored within the VBQ file itself, eliminating
//! improving portability.
//! * **Headers Support**: Optional sequence identifiers can be stored with each record.
//! * **Extended Capacity**: u64 indexing supports files with more than 4 billion records.
//! * **Multi-bit Encoding**: Support for both 2-bit and 4-bit nucleotide encodings.
//!
//! ## Performance Characteristics
//!
//! VBQ is designed for high-throughput parallel processing:
//!
//! * Independent blocks enable true parallel processing without synchronization
//! * Memory-mapped access provides efficient I/O
//! * Embedded index enables fast random access without auxiliary files
//! * Multi-bit encoding (2-bit/4-bit) optimizes storage for different use cases
//! * Optional ZSTD compression reduces file size with minimal performance impact
//!
//! ## Usage Example
//!
//! ```
//! use std::fs::File;
//! use std::io::BufWriter;
//! use binseq::vbq::{FileHeaderBuilder, WriterBuilder, MmapReader};
//! use binseq::{BinseqRecord, SequencingRecordBuilder};
//!
//! /*
//! WRITING
//! */
//!
//! // Create a header for sequences with quality scores and headers
//! let header = FileHeaderBuilder::new()
//! .qual(true)
//! .compressed(true)
//! .headers(true)
//! .build();
//!
//! // Create a writer
//! let file = File::create("example.vbq").unwrap();
//! let mut writer = WriterBuilder::default()
//! .header(header)
//! .build(BufWriter::new(file))
//! .unwrap();
//!
//! // Write a sequence with quality scores and header
//! let record = SequencingRecordBuilder::default()
//! .s_seq(b"ACGTACGT")
//! .s_qual(b"IIIIFFFF")
//! .s_header(b"sequence_001")
//! .build()
//! .unwrap();
//! writer.push(record).unwrap();
//! writer.finish().unwrap();
//!
//! /*
//! READING
//! */
//!
//! // Read the sequences back
//! let mut reader = MmapReader::new("example.vbq").unwrap();
//! let mut block = reader.new_block();
//!
//! // Process blocks one at a time
//! let mut seq_buffer = Vec::new();
//! while reader.read_block_into(&mut block).unwrap() {
//! for record in block.iter() {
//! record.decode_s(&mut seq_buffer).unwrap();
//! let header = record.sheader();
//! println!("Header: {}", std::str::from_utf8(header).unwrap());
//! println!("Sequence: {}", std::str::from_utf8(&seq_buffer).unwrap());
//! println!("Quality: {}", std::str::from_utf8(record.squal()).unwrap());
//! seq_buffer.clear();
//! }
//! }
//! # std::fs::remove_file("example.vbq").unwrap_or(());
//! ```
pub use ;
pub use ;
pub use ;
pub use ;