Skip to main content

Module cbq

Module cbq 

Source
Expand description

CBQ - Columnar variable length records, optional quality scores and headers

§CBQ Format

CBQ is a high-performance binary format built around blocked columnar storage. It optimizes for storage efficiency and parallel processing of records.

§Overview

CBQ was built to solve the rough edges of VBQ. It keeps the blocked structure of VBQ, but instead of interleaving the internal data of all records in the block, it stores each attribute in a separate column. Each of these columns are then ZSTD compressed and optionally decoded when reading.

It was built to be performant, efficient, and lossless by default.

This has a few benefits and advantages over VBQ:

  1. Better compression ratios for each individual attribute.
  2. Significantly faster throughput for reading (easier decompression + pay-per-use decompression).
  3. Simple record parsing and manipulation.

Notably this format only performs two-bit encoding of sequences. However, it tracks the positions of all ambiguous nucleotides (N) within the sequence. When it is decoded and the two-bit encoded sequence is decoded back to nucleotides, the N positions are backfilled with N.

To make use of the sparse-but-clustered nature of the N-positions, we make use of an Elias-Fano encoding of the N-positions. This encoding is then used to efficiently store and retrieve the positions of Ns within the sequence.

§File Structure

A CBQ file consists of a FileHeader, followed by record blocks and an embedded Index. Each record block is composed of a BlockHeader which provides metadata about the block, and a ColumnarBlock containing the actual data.

The IndexHeader and IndexFooter are used to locate and access the data within the file when reading as memory mapped.

┌───────────────────┐
│    File Header    │ 64 bytes
├───────────────────┤
│   Block Header    │ 96 bytes
├───────────────────┤
│                   │
│   Block Records   │ Variable size
│                   │
├───────────────────┤
│       ...         │ More blocks
├───────────────────┤
│    Index Header   │ 24 bytes
├───────────────────┤
│ Compressed Index  │ Variable size
├───────────────────┤
│    Index Footer   │ 16 bytes
└───────────────────┘

§Block Format

The blocks on-disk are stored as ZSTD compressed data. Each column is ZSTD compressed and stored contiguously next to each other.

The BlockHeader contains the compressed sizes of each of the columns as well as the relevant information for their uncompressed sizes.

[BlockHeader][col1][col2][col3]...[BlockHeader][col1][col2][col3]...

The order of columns in the block is as follows:

  1. z_seq_len - sequence lengths
  2. z_header_len - header lengths (optional)
  3. z_npos - Elias-Fano encoded positions of N’s (optional)
  4. z_seq - sequence data (2-bit encoded)
  5. z_flags - flags (optional)
  6. z_headers - sequence headers (optional)
  7. z_qual - sequence quality scores (optional)

Structs§

BlockHeader
A block header for a ColumnarBlock
BlockRange
A struct representing a block range in a CBQ file and stored in the Index
ColumnarBlock
A block of records where all data is stored in separate columns.
ColumnarBlockWriter
Writer for CBQ files operating on generic writers (streaming).
FileHeader
The file header for a CBQ file.
FileHeaderBuilder
A convenience struct for building a FileHeader using a builder pattern.
Index
An index of block ranges for quick lookups
IndexFooter
The footer for a compressed index.
IndexHeader
The header for a compressed index.
MmapReader
A memory-mapped reader for CBQ files.
Reader
A reader for CBQ files operating on generic readers (streaming).
RefRecord
A reference to a record in a ColumnarBlock that implements the BinseqRecord trait
RefRecordIter
A zero-copy iterator over RefRecords in a ColumnarBlock

Constants§

BLOCK_MAGIC
The magic number for CBQ blocks.
DEFAULT_BLOCK_SIZE
The default block size.
DEFAULT_COMPRESSION_LEVEL
The default compression level.
FILE_MAGIC
The magic number for CBQ files.
FILE_VERSION
The current file version.
INDEX_MAGIC
The magic number for CBQ index files.