Expand description
Reference genome FASTA reading with all sequences loaded into memory.
This module provides thread-safe access to reference genome sequences, which is needed for tasks like NM/UQ/MD tag calculation and variant calling.
Following fgbio’s approach, the entire reference is loaded into memory at startup to ensure O(1) lookup performance for each read during tag regeneration.
Uses FAI index for fast raw-byte reading (htsjdk-style) instead of line-by-line parsing.
§Memory Usage
Reference sequences are stored as raw bytes, matching htsjdk’s approach. For a typical human reference (~3GB), this uses approximately 3GB of memory but provides the fastest possible load times (~9s vs ~22s for compressed storage).
§Future improvement
The custom FAI-based raw-byte reading (read_sequence_raw) could be replaced with
noodles’ built-in indexed reader once https://github.com/zaeleus/noodles/pull/365
is merged and released, which adds the same optimization to noodles.
Structs§
- Reference
Reader - A thread-safe reference genome reader with all sequences preloaded into memory.
Functions§
- find_
dict_ path - Find sequence dictionary path for a FASTA file.