Expand description
Indexed reference genome reader backed by a flat memory-mapped binary.
Thin wrapper around memmap2::Mmap that serves 0-based half-open
sequences by UCSC-style chromosome name on the plus strand. All
three conventions match crate::types::TranscriptModel so downstream
consumers never need to translate between coordinate systems.
§On-disk format
The reader expects a pair of files produced by vareffect-cli setup:
GRCh38.bin— flat binary genome. Concatenated, newline-stripped, uppercased chromosome sequences. One byte per base, standard IUPAC nucleotide codes (A/C/G/T/Nplus ambiguity codesR/Y/S/W/K/M/B/D/H/V). No headers, no line breaks, no padding between contigs. ~3.1 GB for GRCh38.p14 (primary + patches).GRCh38.bin.idx— MessagePack-serializedGenomeBinIndexmapping each contig name to its(offset, length)in the.binfile. ~10 KB.
Use write_genome_binary to produce these files from raw contig data
(used by vareffect-cli setup and by unit tests).
§Coordinate convention
All coordinates exposed by FastaReader are 0-based half-open
[start, end) — identical to TranscriptModel::tx_start / tx_end.
Converting from a VCF 1-based position is vcf_pos - 1.
§Chromosome name translation
Callers always use UCSC-style names (chr1, chr17, chrX, chrY,
chrM, chr9_KN196479v1_fix, …) to stay consistent with
TranscriptModel::chrom. The on-disk binary may use any of three naming
conventions (inherited from the source FASTA); the reader detects which
one at open time by scanning the index entries:
- NCBI RefSeq (
NC_000001.11,NW_*, …) — produced byvareffect-cli setupfrom the NCBI GRCh38.p14 assembly. The reader translates primary chroms viacrate::chrom::ucsc_to_refseqand patch contigs via an optional runtime alias CSV loaded throughFastaReader::open_with_patch_aliases. - UCSC-prefixed (
chr1,chrM, …) — pass-through translation. - Ensembl bare (
1,MT, …) — the reader strips thechrprefix and mapschrM -> MT.
Patch contigs (chr9_KN196479v1_fix, chr22_KI270879v1_alt, …) can
only be served against an NCBI-naming binary when a
patch_chrom_aliases.csv is supplied via
FastaReader::open_with_patch_aliases.
§Thread safety
FastaReader is inherently Send + Sync — the underlying
memmap2::Mmap derefs to &[u8] with no Mutex required. All threads
can read from the same reader concurrently with zero contention.
FastaReader::try_clone is retained for API compatibility but simply
clones a handful of Arcs — it is no longer needed for parallel
workloads.
§Soft-masking
The flat binary stores uppercase IUPAC nucleotide codes. Soft-mask
information (Ensembl lowercase = repeat region) is destroyed at build
time. This matches VEP’s internal behavior (which uppercases all fetched
bases) and GA4GH refget v2.0 (specifies uppercase IUPAC).
FastaReader::fetch_sequence_raw is retained for API compatibility but
returns the same uppercase bytes as FastaReader::fetch_sequence.
§Usage
use std::path::Path;
use vareffect::FastaReader;
// Open the flat binary genome produced by `vareffect-cli setup`.
let reader = FastaReader::open(Path::new("data/vareffect/GRCh38.bin"))?;
// TP53 c.742C>T lives at chr17:7674221 (1-based VCF) = chr17:7674220 (0-based).
let base = reader.fetch_base("chr17", 7674220)?;
assert_eq!(base, b'C');
// Patch-contig reads against an NCBI-source binary need the alias CSV.
let reader = FastaReader::open_with_patch_aliases(
Path::new("data/vareffect/GRCh38.bin"),
Some(Path::new("data/vareffect/patch_chrom_aliases.csv")),
)?;Structs§
- Contig
Entry - One contig in the
GenomeBinIndex. - Fasta
Reader - Memory-mapped reference genome reader for random-access sequence retrieval.
- Genome
BinIndex - Flat binary genome index.
Constants§
- GENOME_
BIN_ INDEX_ VERSION - Current format version for
GenomeBinIndex. Increment on breaking changes to the on-disk layout. Public sovareffect-cliuses the same constant — there must be a single source of truth for reader/writer version agreement.
Functions§
- is_
iupac_ nucleotide - Check whether a byte is a valid uppercase IUPAC nucleotide code.
- write_
genome_ binary - Build a flat binary genome from raw contig sequences.