Skip to main content

Module reference

Module reference 

Source
Expand description

Reference genome FASTA reading with all sequences loaded into memory.

This module provides thread-safe access to reference genome sequences, which is needed for tasks like NM/UQ/MD tag calculation and variant calling.

Following fgbio’s approach, the entire reference is loaded into memory at startup to ensure O(1) lookup performance for each read during tag regeneration.

Uses FAI index for fast raw-byte reading (htsjdk-style) instead of line-by-line parsing.

§Memory Usage

Reference sequences are stored as raw bytes, matching htsjdk’s approach. For a typical human reference (~3GB), this uses approximately 3GB of memory but provides the fastest possible load times (~9s vs ~22s for compressed storage).

§Future improvement

The custom FAI-based raw-byte reading (read_sequence_raw) could be replaced with noodles’ built-in indexed reader once https://github.com/zaeleus/noodles/pull/365 is merged and released, which adds the same optimization to noodles.

Structs§

ReferenceReader
A thread-safe reference genome reader with all sequences preloaded into memory.

Functions§

find_dict_path
Find sequence dictionary path for a FASTA file.