vareffect 0.1.2

Variant consequence prediction and HGVS notation, concordant with Ensembl VEP.
Documentation

vareffect

Rust variant consequence prediction and HGVS notation, concordant with Ensembl VEP.

crates.io docs.rs CI status

vareffect takes a variant (chromosome, position, reference allele, alternate allele) and tells you what it does to every transcript that overlaps it: which protein residues change, whether it disrupts a splice site, whether it introduces a premature termination codon and is likely to trigger nonsense-mediated decay, and what the Sequence Ontology consequence terms are — the same questions VEP answers, returned as strongly typed Rust values instead of TSV.

The crate is deliberately small. It does one thing — assign SO consequences and HGVS notation — and leaves regulatory layers, frequency lookups, and plugin ecosystems to downstream code. If you need a fast, embeddable consequence predictor inside a Rust pipeline, this is for you. If you need the full VEP experience including gnomAD enrichment and custom plugins, use VEP.

Quick start

Add the library to your project and install the companion CLI that provisions the reference data:

# Cargo.toml
[dependencies]
vareffect = "0.1.2"
# One-time data setup: downloads GRCh38, builds the transcript store,
# and writes everything to `data/vareffect/`. Takes ~10 minutes and
# ~3 GB of disk, then you never have to touch it again.
cargo install vareffect-cli
vareffect init
vareffect setup

Then annotate a variant against every overlapping transcript:

use std::path::Path;
use vareffect::{Consequence, VarEffect};

# fn main() -> Result<(), vareffect::VarEffectError> {
let ve = VarEffect::open(
    Path::new("data/vareffect/transcript_models.bin"),
    Path::new("data/vareffect/GRCh38.bin"),
)?;

// TP53 c.742C>T (p.Arg248Trp) — a well-known hotspot missense variant.
// chr17:7674220 is 0-based (BED / UCSC style).
let results = ve.annotate("chr17", 7_674_220, b"G", b"A")?;

for result in &results {
    println!(
        "{} {}: {} ({})",
        result.transcript,
        result.hgvs_p.as_deref().unwrap_or("-"),
        result
            .consequences
            .iter()
            .map(Consequence::as_str)
            .collect::<Vec<_>>()
            .join(","),
        result.impact,
    );
}
# Ok(())
# }

Expected output (one line per overlapping transcript):

NM_000546.6 p.Arg248Trp: missense_variant (MODERATE)

For sharing across threads, wrap the loaded VarEffect in a std::sync::Arc and clone the Arc into each worker — the underlying stores are Send + Sync with zero interior mutability.

Performance

Single-threaded throughput, measured on a modern x86_64 laptop. All numbers exclude startup cost (genome load, transcript store parse).

Tool Language Variants / sec (1 thread)
VEP (--cache) Perl ~200 – 500
SnpEff Java ~2,000 – 10,000
Nirvana C# ~5,000 – 15,000
bcftools csq C ~10,000 – 50,000
vareffect Rust ~50,000 – 200,000

The gap is almost entirely I/O: VEP reads BGZF-compressed FASTA through block decompression per base; vareffect memory-maps a flat uppercase binary and reads bytes directly.

Features

Variant consequence prediction

  • 24 Sequence Ontology termsmissense_variant, synonymous_variant, stop_gained, stop_lost, start_lost, start_retained_variant, stop_retained_variant, frameshift_variant, inframe_insertion, inframe_deletion, splice_donor_variant, splice_acceptor_variant, splice_region_variant, incomplete_terminal_codon_variant, 5_prime_UTR_variant, 3_prime_UTR_variant, intron_variant, non_coding_transcript_exon_variant, upstream_gene_variant, downstream_gene_variant, intergenic_variant, coding_sequence_variant, transcript_ablation, protein_altering_variant.
  • IMPACT rankingHIGH / MODERATE / LOW / MODIFIER, matching VEP's severity scale. Consequence derives Ord for sorting.
  • Multi-transcript annotation — one ConsequenceResult per overlapping transcript; the caller picks a canonical isoform if desired (every result carries MANE Select / MANE Plus Clinical / RefSeq Select tier metadata).
  • NMD prediction — the 50-nucleotide rule applied to truncating variants (stop_gained, frameshift_variant).
  • Standard and mitochondrial genetic codeschrM variants translate with NCBI table 2 automatically.

HGVS nomenclature

  • Forward c. / n. — substitutions, deletions, insertions, duplications, delins, intronic offsets (c.672+1), 5'/3' UTR offsets (c.-15, c.*42), combined forms (c.-15+1).
  • Forward p. — missense, synonymous, stop gain / loss, start loss (p.Met1?), frameshift with extension walk (p.Glu23ValfsTer17), incomplete terminal codon, stop extension (p.Ter394CysextTer9). Matches VEP's default capitalisation and three-letter code convention.
  • 3' normalization — shift indels to the most 3' equivalent position on the coding strand for HGVS notation (matches VEP's --shift_hgvs).
  • Reverse c. — parse an HGVS c. string back to plus-strand 0-based genomic coordinates and alleles, round-tripped through the transcript store and verified against the genome.

Variant localization

  • Classify variants as CDS exon, intron, 5' / 3' UTR, splice donor / acceptor, splice region, upstream / downstream, intergenic.
  • Multi-exon indel handling with exon-boundary-spanning logic.
  • Reference-allele verification against the loaded genome before annotation, so VCFs built against the wrong build fail fast instead of silently producing wrong calls.

Transcript models

  • MANE Select, MANE Plus Clinical, and RefSeq Select tiers — your build pipeline decides which to ingest.
  • Full exon and CDS segment layout with GFF3 phase preserved, so codon walks across exon boundaries and frameshift detection are O(1) lookups.
  • Interval-tree indexed per chromosome for O(log n + k) overlap queries.
  • Strand-aware: correct reverse-complement handling on minus-strand genes.

Runtime characteristics

  • Memory-mapped reference genome — base fetches are a pointer dereference (~5 ns), not a BGZF block decompression.
  • Thread-safeVarEffect, TranscriptStore, and FastaReader are all Send + Sync with zero interior mutability. A compile-time assertion in lib.rs guarantees this for every release.
  • Zero external runtime — no network, no database, no background workers. Load two files at startup, share an Arc<VarEffect> across your worker pool, call .annotate(...).

What vareffect does not do

  • No regulatory / TFBS / motif annotation layer.
  • No co-located variant lookup (no gnomAD, ClinVar, dbSNP).
  • No allele frequency or population annotation.
  • No canonical transcript selection — every overlapping transcript is returned with its tier metadata; callers decide which to keep.
  • No multi-allele VCF splitting — the caller must split comma-separated ALTs before invoking annotate.
  • No plugin system.
  • No alternate genome builds out of the box — GRCh37 or CHM13 require regenerating the transcript and genome binaries with your own build.

See VEP_DIVERGENCES.md for the complete list of intentional divergences from VEP and features that are not yet implemented.

Setting up the data files

vareffect does not ship reference data. You provide two files at runtime:

  1. transcript_models.bin — a MessagePack-serialised Vec<TranscriptModel> built from a MANE / RefSeq Select GFF3.
  2. GRCh38.bin (or whatever build you use) — a flat uppercase-IUPAC binary plus a .bin.idx MessagePack sidecar produced by the builder.

Recommended: use vareffect-cli

The companion crate vareffect-cli ships a vareffect binary that handles the entire provisioning flow in one command. It downloads the GRCh38 reference FASTA from NCBI, writes the flat-binary genome + index, fetches the latest MANE release GFF3 + summary, builds the transcript store, and writes NCBI patch-contig aliases — all under data/vareffect/.

cargo install vareffect-cli

# Scaffold a config file (one-time):
vareffect init

# Full provisioning (GRCh38 genome + MANE transcript models).
vareffect setup

# Validate everything is in place:
vareffect check

# Write runtime files to a custom directory instead of the config default:
vareffect setup --output /data/genomes/vareffect

# Only the reference genome:
vareffect setup --fasta-only

# Only rebuild the transcript model store (reuses an existing genome):
vareffect setup --models-only

setup is idempotent — source archives are cached in data/raw/, the genome binary is skipped if it already exists, and transcript models are rebuilt on every run so a new MANE release picks up automatically.

After vareffect setup finishes, you have the layout the library expects:

data/vareffect/
  GRCh38.bin               # flat-binary reference genome
  GRCh38.bin.idx           # MessagePack contig index
  transcript_models.bin    # serialised Vec<TranscriptModel>
  patch_chrom_aliases.csv  # UCSC <-> RefSeq patch-contig map

Alternative: roll your own

If you're building a custom store (a different transcript source, a non-human genome, a subset of the human transcriptome), the [fasta::write_genome_binary] function is public so you can generate the flat binary yourself:

use std::path::Path;
use vareffect::fasta::write_genome_binary;

# fn main() -> Result<(), vareffect::VarEffectError> {
// Uppercase ASCII bytes, one contig per tuple.
let chr_toy: &[u8] = b"ACGTACGTNNN";
let contigs: &[(&str, &[u8])] = &[("chrToy", chr_toy)];

write_genome_binary(
    contigs,
    "toy",                               // build label, stored in the index
    Path::new("out/toy.bin"),            // flat binary
    Path::new("out/toy.bin.idx"),        // MessagePack index sidecar
)?;
# Ok(())
# }

Building the transcript store by hand means producing a Vec<vareffect::TranscriptModel> and MessagePack-serialising it with rmp-serde; see crates/vareffect-cli/src/builders/ in the source tree for a worked example.

HGVS reverse resolution

Take an HGVS c. string, resolve it to plus-strand genomic coordinates, and feed the result straight back into annotate — useful when your input is transcript-relative rather than coordinate-based:

# use vareffect::VarEffect;
# fn example(ve: &VarEffect) -> Result<(), vareffect::VarEffectError> {
let gv = ve.resolve_hgvs_c("NM_000546.6:c.742C>T")?;

let results = ve.annotate(&gv.chrom, gv.pos, &gv.ref_allele, &gv.alt_allele)?;
# let _ = results;
# Ok(())
# }

resolve_hgvs_c supports substitutions, deletions, insertions, duplications, and delins across CDS, 5' UTR, 3' UTR, and intronic positions.

Threading model

VarEffect, TranscriptStore, and FastaReader are all Send + Sync — construct one VarEffect at startup, wrap it in std::sync::Arc, and share it across every worker thread or async task. There is no interior mutability, no contention, and no hidden cloning cost; every thread reads from the same memory-mapped bytes.

Coordinate conventions

  • 0-based, half-open (BED / UCSC) for every coordinate in the public API. GFF3's 1-based-fully-closed input is converted at build time.
  • UCSC chromosome names (chr17, chrM). The FastaReader has an alias table so it can transparently accept "17", "NC_000017.11", etc. Patch-contig lookups additionally need open_with_patch_aliases and a refseq,ucsc alias CSV (written automatically by vareffect setup).
  • Uppercase ASCII allele bytes — no case coercion at call time, so lowercase input is a bug in your code, not something we silently fix.

VEP feature parity

vareffect targets near-100% concordance with VEP release 115 / 116 on the core coding and transcript consequence layer. Every intentional divergence, not-yet-implemented feature, and out-of-scope design decision is catalogued in VEP_DIVERGENCES.md. The validation suite lives in tests/vep_concordance_*.rs.

Testing

The unit and fast integration tests run with no external data:

cargo test -p vareffect

The VEP concordance tests are #[ignore]-gated because they need the transcript store and reference genome on disk (run vareffect setup first if you haven't already):

FASTA_PATH="$(pwd)/data/vareffect/GRCh38.bin" CONCORDANCE_THREADS=1 \
    cargo test -p vareffect --release -- --ignored vep_concordance

License

Apache License, Version 2.0 (LICENSE-APACHE or https://www.apache.org/licenses/LICENSE-2.0)

Contributing

Contributions are welcome. The most valuable areas for outside help are:

  • Adding SO terms currently listed as not-yet-implemented in VEP_DIVERGENCES.md (splice polypyrimidine-tract, donor region, donor 5th base).
  • Broadening the VEP concordance corpus in tests/vep_concordance_*.rs with variants that stress a part of the pipeline that isn't already covered.
  • Alternate genome build support (GRCh37, CHM13) in the vareffect-cli provisioning flow.

Open an issue before starting on anything larger than a bug fix so we can agree on scope.