Skip to main content

Crate vareffect

Crate vareffect 

Source
Expand description

vareffect — Variant consequence prediction and HGVS notation, targeting near-100% concordance with Ensembl VEP (release 115/116).

Consumers point the store loaders at whatever transcript and genome files their build pipeline produces — vareffect ships no embedded reference data and has no runtime dependency on an orchestrator CLI.

§Transcript model store

An in-memory store of MANE transcript models indexed by genomic interval for O(log n + k) overlap queries. Each TranscriptModel carries per-exon CdsSegments with the GFF3 column-8 phase captured, so downstream codon walks and frameshift detection don’t have to re-derive phase from scratch.

use std::path::Path;
use vareffect::{Biotype, TranscriptStore};

let store = TranscriptStore::load_from_path(
    Path::new("data/vareffect/transcript_models.bin"),
)?;

// Overlap query: all transcripts whose tx_start..tx_end intersects the interval.
for (tx, _idx) in store.query_overlap("chr6", 33_409_450, 33_409_451) {
    println!(
        "{} ({}): cds [{:?}, {:?}), {} segments, biotype={:?}",
        tx.accession,
        tx.gene_symbol,
        tx.cds_genomic_start,
        tx.cds_genomic_end,
        tx.cds_segments.len(),
        tx.biotype,
    );

    // Walk CDS segments in transcript 5'→3' order (reversed for minus strand):
    for seg in &tx.cds_segments {
        println!(
            "  segment in exon[{}], phase {}: [{}, {})",
            seg.exon_index, seg.phase, seg.genomic_start, seg.genomic_end
        );
    }
}

// `biotype` is an enum with `Other(String)` for unknown upstream labels.
let total_protein_coding = store
    .transcripts()
    .iter()
    .filter(|t| matches!(t.biotype, Biotype::ProteinCoding))
    .count();

§Reference genome reader

Memory-mapped random access to the reference genome via FastaReader. Pair it with TranscriptStore to extract codons, verify REF alleles, and walk downstream for frameshift termination. See the fasta module for the on-disk format, coordinate conventions, and chromosome-name handling.

The flat binary format stores uppercase IUPAC nucleotide codes, matching GA4GH refget v2.0 conventions. Most bases are A/C/G/T/N; the NCBI GRCh38.p14 assembly also uses ambiguity codes (M, R, Y, etc.) in some patch-scaffold regions. Soft-mask information is not preserved.

§Variant consequence assignment

VarEffect::annotate takes a variant’s position and alleles, locates it within every overlapping transcript, extracts the reference codon(s) from FASTA, translates ref and alt codons, and assigns SO consequence term(s) with VEP-concordant IMPACT ratings. The codon module provides the standard and mitochondrial genetic code translation tables.

use std::path::Path;
use vareffect::VarEffect;

let ve = VarEffect::open(
    Path::new("data/vareffect/transcript_models.bin"),
    Path::new("data/vareffect/GRCh38.bin"),
)?;

// Annotate TP53 c.742C>T (p.R248W) — chr17, 0-based position 7,674,219.
let results = ve.annotate("chr17", 7_674_219, b"C", b"T")?;
for r in &results {
    for csq in &r.consequences {
        println!("{} ({})", csq.as_str(), r.impact);
    }
}

For lower-level building blocks (per-transcript annotation when you already hold a &TranscriptModel), see annotate_snv, annotate_deletion, and annotate_insertion.

§Coordinate convention

All coordinates in TranscriptModel are 0-based, half-open (BED/UCSC style). GFF3 input (1-based, fully-closed) is converted at build time by vareffect-cli. See transcript for the interval-tree indexing details.

cds_genomic_start / cds_genomic_end are the genomic min / max coordinates across all CDS segments, not transcript-relative. For a minus-strand gene, cds_genomic_start is biologically the 3’ end of the protein in transcript order. Walk cds_segments (ordered 5’→3’ on the transcript) when you need the true coding walk.

§Thread safety

Both TranscriptStore and FastaReader are Send + Sync (proven by a compile-time assertion at the bottom of this file). TranscriptStore is lock-free for reads. FastaReader is backed by a memory-mapped &[u8] — inherently Send + Sync with zero contention. All threads can read from the same FastaReader concurrently without cloning.

Re-exports§

pub use consequence::Consequence;
pub use consequence::ConsequenceResult;
pub use consequence::Impact;
pub use consequence::annotate_deletion;
pub use consequence::annotate_insertion;
pub use consequence::annotate_snv;
pub use error::VarEffectError;
pub use fasta::FastaReader;
pub use hgvs_reverse::GenomicVariant;
pub use locate::IndelLocation;
pub use locate::IndelRegion;
pub use locate::LocateIndex;
pub use locate::SpliceOverlapDetail;
pub use locate::SpliceSide;
pub use locate::VariantLocation;
pub use locate::locate_indel;
pub use locate::locate_variant;
pub use transcript::TranscriptStore;
pub use types::Biotype;
pub use types::CdsSegment;
pub use types::Exon;
pub use types::Strand;
pub use types::TranscriptModel;
pub use types::TranscriptTier;

Modules§

chrom
Chromosome name conversion.
codon
Codon translation, DNA complement, and VEP-style display formatting.
consequence
Variant consequence assignment – determines SO consequence terms for SNVs, simple indels, boundary-spanning deletions, complex delins, and MNVs against transcript models.
error
Error type for vareffect.
fasta
Indexed reference genome reader backed by a flat memory-mapped binary.
hgvs_c
HGVS coding DNA notation (c. / n.) generation.
hgvs_p
HGVS protein notation (p.) for variant consequences.
hgvs_reverse
HGVS c. reverse mapper: parse HGVS c. notation and resolve to genomic VCF-style coordinates.
locate
Variant locator: classify where a genomic position falls within a transcript.
transcript
In-memory transcript store with per-chromosome interval tree indexing.
types
Runtime types for the transcript model store.

Structs§

VarEffect
Stateful entrypoint to vareffect: bundles a TranscriptStore and a FastaReader so callers don’t have to thread both handles through every annotation call.