vareffect
Rust variant consequence prediction and HGVS notation, concordant with Ensembl VEP.
vareffect takes a variant (chromosome, position, reference allele,
alternate allele) and tells you what it does to every transcript that
overlaps it: which protein residues change, whether it disrupts a splice
site, whether it introduces a premature termination codon and is likely to
trigger nonsense-mediated decay, and what the Sequence Ontology consequence
terms are — the same questions VEP answers, returned as strongly typed Rust
values instead of TSV.
The crate is deliberately small. It does one thing — assign SO consequences and HGVS notation — and leaves regulatory layers, frequency lookups, and plugin ecosystems to downstream code. If you need a fast, embeddable consequence predictor inside a Rust pipeline, this is for you. If you need the full VEP experience including gnomAD enrichment and custom plugins, use VEP.
Quick start
Add the library to your project and install the companion CLI that provisions the reference data:
# Cargo.toml
[]
= "0.1.2"
# One-time data setup: downloads GRCh38, builds the transcript store,
# and writes everything to `data/vareffect/`. Takes ~10 minutes and
# ~3 GB of disk, then you never have to touch it again.
Then annotate a variant against every overlapping transcript:
use Path;
use ;
#
Expected output (one line per overlapping transcript):
NM_000546.6 p.Arg248Trp: missense_variant (MODERATE)
For sharing across threads, wrap the loaded VarEffect in a
std::sync::Arc and clone the Arc into each worker — the underlying
stores are Send + Sync with zero interior mutability.
Performance
Single-threaded throughput, measured on a modern x86_64 laptop. All numbers exclude startup cost (genome load, transcript store parse).
| Tool | Language | Variants / sec (1 thread) |
|---|---|---|
VEP (--cache) |
Perl | ~200 – 500 |
| SnpEff | Java | ~2,000 – 10,000 |
| Nirvana | C# | ~5,000 – 15,000 |
| bcftools csq | C | ~10,000 – 50,000 |
| vareffect | Rust | ~50,000 – 200,000 |
The gap is almost entirely I/O: VEP reads BGZF-compressed FASTA through
block decompression per base; vareffect memory-maps a flat uppercase
binary and reads bytes directly.
Features
Variant consequence prediction
- 24 Sequence Ontology terms —
missense_variant,synonymous_variant,stop_gained,stop_lost,start_lost,start_retained_variant,stop_retained_variant,frameshift_variant,inframe_insertion,inframe_deletion,splice_donor_variant,splice_acceptor_variant,splice_region_variant,incomplete_terminal_codon_variant,5_prime_UTR_variant,3_prime_UTR_variant,intron_variant,non_coding_transcript_exon_variant,upstream_gene_variant,downstream_gene_variant,intergenic_variant,coding_sequence_variant,transcript_ablation,protein_altering_variant. - IMPACT ranking —
HIGH/MODERATE/LOW/MODIFIER, matching VEP's severity scale.ConsequencederivesOrdfor sorting. - Multi-transcript annotation — one
ConsequenceResultper overlapping transcript; the caller picks a canonical isoform if desired (every result carries MANE Select / MANE Plus Clinical / RefSeq Select tier metadata). - NMD prediction — the 50-nucleotide rule applied to truncating variants
(
stop_gained,frameshift_variant). - Standard and mitochondrial genetic codes —
chrMvariants translate with NCBI table 2 automatically.
HGVS nomenclature
- Forward c. / n. — substitutions, deletions, insertions, duplications,
delins, intronic offsets (
c.672+1), 5'/3' UTR offsets (c.-15,c.*42), combined forms (c.-15+1). - Forward p. — missense, synonymous, stop gain / loss, start loss
(
p.Met1?), frameshift with extension walk (p.Glu23ValfsTer17), incomplete terminal codon, stop extension (p.Ter394CysextTer9). Matches VEP's default capitalisation and three-letter code convention. - 3' normalization — shift indels to the most 3' equivalent position on
the coding strand for HGVS notation (matches VEP's
--shift_hgvs). - Reverse c. — parse an HGVS c. string back to plus-strand 0-based genomic coordinates and alleles, round-tripped through the transcript store and verified against the genome.
Variant localization
- Classify variants as CDS exon, intron, 5' / 3' UTR, splice donor / acceptor, splice region, upstream / downstream, intergenic.
- Multi-exon indel handling with exon-boundary-spanning logic.
- Reference-allele verification against the loaded genome before annotation, so VCFs built against the wrong build fail fast instead of silently producing wrong calls.
Transcript models
- MANE Select, MANE Plus Clinical, and RefSeq Select tiers — your build pipeline decides which to ingest.
- Full exon and CDS segment layout with GFF3 phase preserved, so codon walks across exon boundaries and frameshift detection are O(1) lookups.
- Interval-tree indexed per chromosome for O(log n + k) overlap queries.
- Strand-aware: correct reverse-complement handling on minus-strand genes.
Runtime characteristics
- Memory-mapped reference genome — base fetches are a pointer dereference (~5 ns), not a BGZF block decompression.
- Thread-safe —
VarEffect,TranscriptStore, andFastaReaderare allSend + Syncwith zero interior mutability. A compile-time assertion inlib.rsguarantees this for every release. - Zero external runtime — no network, no database, no background
workers. Load two files at startup, share an
Arc<VarEffect>across your worker pool, call.annotate(...).
What vareffect does not do
- No regulatory / TFBS / motif annotation layer.
- No co-located variant lookup (no gnomAD, ClinVar, dbSNP).
- No allele frequency or population annotation.
- No canonical transcript selection — every overlapping transcript is returned with its tier metadata; callers decide which to keep.
- No multi-allele VCF splitting — the caller must split comma-separated
ALTs before invokingannotate. - No plugin system.
- No alternate genome builds out of the box — GRCh37 or CHM13 require regenerating the transcript and genome binaries with your own build.
See VEP_DIVERGENCES.md for the complete list of
intentional divergences from VEP and features that are not yet implemented.
Setting up the data files
vareffect does not ship reference data. You provide two files at runtime:
transcript_models.bin— a MessagePack-serialisedVec<TranscriptModel>built from a MANE / RefSeq Select GFF3.GRCh38.bin(or whatever build you use) — a flat uppercase-IUPAC binary plus a.bin.idxMessagePack sidecar produced by the builder.
Recommended: use vareffect-cli
The companion crate vareffect-cli ships a vareffect binary that handles
the entire provisioning flow in one command. It downloads the GRCh38
reference FASTA from NCBI, writes the flat-binary genome + index, fetches
the latest MANE release GFF3 + summary, builds the transcript store, and
writes NCBI patch-contig aliases — all under data/vareffect/.
# Scaffold a config file (one-time):
# Full provisioning (GRCh38 genome + MANE transcript models).
# Validate everything is in place:
# Write runtime files to a custom directory instead of the config default:
# Only the reference genome:
# Only rebuild the transcript model store (reuses an existing genome):
setup is idempotent — source archives are cached in data/raw/, the
genome binary is skipped if it already exists, and transcript models are
rebuilt on every run so a new MANE release picks up automatically.
After vareffect setup finishes, you have the layout the library expects:
data/vareffect/
GRCh38.bin # flat-binary reference genome
GRCh38.bin.idx # MessagePack contig index
transcript_models.bin # serialised Vec<TranscriptModel>
patch_chrom_aliases.csv # UCSC <-> RefSeq patch-contig map
Alternative: roll your own
If you're building a custom store (a different transcript source, a
non-human genome, a subset of the human transcriptome), the
[fasta::write_genome_binary] function is public so you can generate the
flat binary yourself:
use Path;
use write_genome_binary;
#
Building the transcript store by hand means producing a
Vec<vareffect::TranscriptModel> and MessagePack-serialising it with
rmp-serde; see crates/vareffect-cli/src/builders/ in the source tree
for a worked example.
HGVS reverse resolution
Take an HGVS c. string, resolve it to plus-strand genomic coordinates, and
feed the result straight back into annotate — useful when your input is
transcript-relative rather than coordinate-based:
# use VarEffect;
#
resolve_hgvs_c supports substitutions, deletions, insertions,
duplications, and delins across CDS, 5' UTR, 3' UTR, and intronic
positions.
Threading model
VarEffect, TranscriptStore, and FastaReader are all Send + Sync —
construct one VarEffect at startup, wrap it in std::sync::Arc, and
share it across every worker thread or async task. There is no interior
mutability, no contention, and no hidden cloning cost; every thread reads
from the same memory-mapped bytes.
Coordinate conventions
- 0-based, half-open (BED / UCSC) for every coordinate in the public API. GFF3's 1-based-fully-closed input is converted at build time.
- UCSC chromosome names (
chr17,chrM). TheFastaReaderhas an alias table so it can transparently accept"17","NC_000017.11", etc. Patch-contig lookups additionally needopen_with_patch_aliasesand arefseq,ucscalias CSV (written automatically byvareffect setup). - Uppercase ASCII allele bytes — no case coercion at call time, so lowercase input is a bug in your code, not something we silently fix.
VEP feature parity
vareffect targets near-100% concordance with VEP release 115 / 116 on the
core coding and transcript consequence layer. Every intentional divergence,
not-yet-implemented feature, and out-of-scope design decision is catalogued
in VEP_DIVERGENCES.md. The validation suite lives
in tests/vep_concordance_*.rs.
Testing
The unit and fast integration tests run with no external data:
The VEP concordance tests are #[ignore]-gated because they need the
transcript store and reference genome on disk (run vareffect setup first
if you haven't already):
FASTA_PATH="/data/vareffect/GRCh38.bin" CONCORDANCE_THREADS=1 \
License
Apache License, Version 2.0 (LICENSE-APACHE or https://www.apache.org/licenses/LICENSE-2.0)
Contributing
Contributions are welcome. The most valuable areas for outside help are:
- Adding SO terms currently listed as not-yet-implemented in
VEP_DIVERGENCES.md(splice polypyrimidine-tract, donor region, donor 5th base). - Broadening the VEP concordance corpus in
tests/vep_concordance_*.rswith variants that stress a part of the pipeline that isn't already covered. - Alternate genome build support (GRCh37, CHM13) in the
vareffect-cliprovisioning flow.
Open an issue before starting on anything larger than a bug fix so we can agree on scope.