Expand description
CBQ - Columnar variable length records, optional quality scores and headers
§CBQ Format
CBQ is a high-performance binary format built around blocked columnar storage. It optimizes for storage efficiency and parallel processing of records.
§Overview
CBQ was built to solve the rough edges of VBQ. It keeps the blocked structure of VBQ, but instead of interleaving the internal data of all records in the block, it stores each attribute in a separate column. Each of these columns are then ZSTD compressed and optionally decoded when reading.
It was built to be performant, efficient, and lossless by default.
This has a few benefits and advantages over VBQ:
- Better compression ratios for each individual attribute.
- Significantly faster throughput for reading (easier decompression + pay-per-use decompression).
- Simple record parsing and manipulation.
Notably this format only performs two-bit encoding of sequences.
However, it tracks the positions of all ambiguous nucleotides (N) within the sequence.
When it is decoded and the two-bit encoded sequence is decoded back to nucleotides, the N positions are backfilled with N.
To make use of the sparse-but-clustered nature of the N-positions, we make use of an Elias-Fano encoding of the N-positions.
This encoding is then used to efficiently store and retrieve the positions of Ns within the sequence.
§File Structure
A CBQ file consists of a FileHeader, followed by record blocks and an embedded Index.
Each record block is composed of a BlockHeader which provides metadata about the block, and a ColumnarBlock containing the actual data.
The IndexHeader and IndexFooter are used to locate and access the data within the file when reading as memory mapped.
┌───────────────────┐
│ File Header │ 64 bytes
├───────────────────┤
│ Block Header │ 96 bytes
├───────────────────┤
│ │
│ Block Records │ Variable size
│ │
├───────────────────┤
│ ... │ More blocks
├───────────────────┤
│ Index Header │ 24 bytes
├───────────────────┤
│ Compressed Index │ Variable size
├───────────────────┤
│ Index Footer │ 16 bytes
└───────────────────┘§Block Format
The blocks on-disk are stored as ZSTD compressed data. Each column is ZSTD compressed and stored contiguously next to each other.
The BlockHeader contains the compressed sizes of each of the columns as well as the relevant information for their uncompressed sizes.
[BlockHeader][col1][col2][col3]...[BlockHeader][col1][col2][col3]...The order of columns in the block is as follows:
z_seq_len- sequence lengthsz_header_len- header lengths (optional)z_npos- Elias-Fano encoded positions of N’s (optional)z_seq- sequence data (2-bit encoded)z_flags- flags (optional)z_headers- sequence headers (optional)z_qual- sequence quality scores (optional)
Structs§
- Block
Header - A block header for a
ColumnarBlock - Block
Range - A struct representing a block range in a CBQ file and stored in the
Index - Columnar
Block - A block of records where all data is stored in separate columns.
- Columnar
Block Writer - Writer for CBQ files operating on generic writers (streaming).
- File
Header - The file header for a CBQ file.
- File
Header Builder - A convenience struct for building a
FileHeaderusing a builder pattern. - Index
- An index of block ranges for quick lookups
- Index
Footer - The footer for a compressed index.
- Index
Header - The header for a compressed index.
- Mmap
Reader - A memory-mapped reader for CBQ files.
- Reader
- A reader for CBQ files operating on generic readers (streaming).
- RefRecord
- A reference to a record in a
ColumnarBlockthat implements theBinseqRecordtrait - RefRecord
Iter - A zero-copy iterator over
RefRecords in aColumnarBlock
Constants§
- BLOCK_
MAGIC - The magic number for CBQ blocks.
- DEFAULT_
BLOCK_ SIZE - The default block size.
- DEFAULT_
COMPRESSION_ LEVEL - The default compression level.
- FILE_
MAGIC - The magic number for CBQ files.
- FILE_
VERSION - The current file version.
- INDEX_
MAGIC - The magic number for CBQ index files.