Skip to main content

Module fasta

Module fasta 

Source
Expand description

Indexed reference genome reader backed by a flat memory-mapped binary.

Thin wrapper around memmap2::Mmap that serves 0-based half-open sequences by UCSC-style chromosome name on the plus strand. All three conventions match crate::types::TranscriptModel so downstream consumers never need to translate between coordinate systems.

§On-disk format

The reader expects a pair of files produced by vareffect-cli setup:

  • GRCh38.bin — flat binary genome. Concatenated, newline-stripped, uppercased chromosome sequences. One byte per base, standard IUPAC nucleotide codes (A/C/G/T/N plus ambiguity codes R/Y/S/ W/K/M/B/D/H/V). No headers, no line breaks, no padding between contigs. ~3.1 GB for GRCh38.p14 (primary + patches).
  • GRCh38.bin.idx — MessagePack-serialized GenomeBinIndex mapping each contig name to its (offset, length) in the .bin file. ~10 KB.

Use write_genome_binary to produce these files from raw contig data (used by vareffect-cli setup and by unit tests).

§Coordinate convention

All coordinates exposed by FastaReader are 0-based half-open [start, end) — identical to TranscriptModel::tx_start / tx_end. Converting from a VCF 1-based position is vcf_pos - 1.

§Chromosome name translation

Callers always use UCSC-style names (chr1, chr17, chrX, chrY, chrM, chr9_KN196479v1_fix, …) to stay consistent with TranscriptModel::chrom. The on-disk binary may use any of three naming conventions (inherited from the source FASTA); the reader detects which one at open time by scanning the index entries:

  • NCBI RefSeq (NC_000001.11, NW_*, …) — produced by vareffect-cli setup from the NCBI GRCh38.p14 assembly. The reader translates primary chroms via crate::chrom::ucsc_to_refseq and patch contigs via an optional runtime alias CSV loaded through FastaReader::open_with_patch_aliases.
  • UCSC-prefixed (chr1, chrM, …) — pass-through translation.
  • Ensembl bare (1, MT, …) — the reader strips the chr prefix and maps chrM -> MT.

Patch contigs (chr9_KN196479v1_fix, chr22_KI270879v1_alt, …) can only be served against an NCBI-naming binary when a patch_chrom_aliases.csv is supplied via FastaReader::open_with_patch_aliases.

§Thread safety

FastaReader is inherently Send + Sync — the underlying memmap2::Mmap derefs to &[u8] with no Mutex required. All threads can read from the same reader concurrently with zero contention. FastaReader::try_clone is retained for API compatibility but simply clones a handful of Arcs — it is no longer needed for parallel workloads.

§Soft-masking

The flat binary stores uppercase IUPAC nucleotide codes. Soft-mask information (Ensembl lowercase = repeat region) is destroyed at build time. This matches VEP’s internal behavior (which uppercases all fetched bases) and GA4GH refget v2.0 (specifies uppercase IUPAC). FastaReader::fetch_sequence_raw is retained for API compatibility but returns the same uppercase bytes as FastaReader::fetch_sequence.

§Usage

use std::path::Path;
use vareffect::FastaReader;

// Open the flat binary genome produced by `vareffect-cli setup`.
let reader = FastaReader::open(Path::new("data/vareffect/GRCh38.bin"))?;

// TP53 c.742C>T lives at chr17:7674221 (1-based VCF) = chr17:7674220 (0-based).
let base = reader.fetch_base("chr17", 7674220)?;
assert_eq!(base, b'C');

// Patch-contig reads against an NCBI-source binary need the alias CSV.
let reader = FastaReader::open_with_patch_aliases(
    Path::new("data/vareffect/GRCh38.bin"),
    Some(Path::new("data/vareffect/patch_chrom_aliases.csv")),
)?;

Structs§

ContigEntry
One contig in the GenomeBinIndex.
FastaReader
Memory-mapped reference genome reader for random-access sequence retrieval.
GenomeBinIndex
Flat binary genome index.

Constants§

GENOME_BIN_INDEX_VERSION
Current format version for GenomeBinIndex. Increment on breaking changes to the on-disk layout. Public so vareffect-cli uses the same constant — there must be a single source of truth for reader/writer version agreement.

Functions§

is_iupac_nucleotide
Check whether a byte is a valid uppercase IUPAC nucleotide code.
write_genome_binary
Build a flat binary genome from raw contig sequences.