Expand description
§vareffect
Rust variant consequence prediction and HGVS notation, concordant with Ensembl VEP.
vareffect takes a variant (chromosome, position, reference allele,
alternate allele) and tells you what it does to every transcript that
overlaps it: which protein residues change, whether it disrupts a splice
site, whether it introduces a premature termination codon and is likely to
trigger nonsense-mediated decay, and what the Sequence Ontology consequence
terms are — the same questions VEP answers, returned as strongly typed Rust
values instead of TSV.
The crate is deliberately small. It does one thing — assign SO consequences and HGVS notation — and leaves regulatory layers, frequency lookups, and plugin ecosystems to downstream code. If you need a fast, embeddable consequence predictor inside a Rust pipeline, this is for you. If you need the full VEP experience including gnomAD enrichment and custom plugins, use VEP.
§Quick start
Add the library to your project and install the companion CLI that provisions the reference data:
# Cargo.toml
[dependencies]
vareffect = "0.1.2"# One-time data setup: downloads GRCh38, builds the transcript store,
# and writes everything to `data/vareffect/`. Takes ~10 minutes and
# ~3 GB of disk, then you never have to touch it again.
cargo install vareffect-cli
vareffect init
vareffect setupThen annotate a variant against every overlapping transcript:
use std::path::Path;
use vareffect::{Consequence, VarEffect};
let ve = VarEffect::open(
Path::new("data/vareffect/transcript_models.bin"),
Path::new("data/vareffect/GRCh38.bin"),
)?;
// TP53 c.742C>T (p.Arg248Trp) — a well-known hotspot missense variant.
// chr17:7674220 is 0-based (BED / UCSC style).
let results = ve.annotate("chr17", 7_674_220, b"G", b"A")?;
for result in &results {
println!(
"{} {}: {} ({})",
result.transcript,
result.hgvs_p.as_deref().unwrap_or("-"),
result
.consequences
.iter()
.map(Consequence::as_str)
.collect::<Vec<_>>()
.join(","),
result.impact,
);
}Expected output (one line per overlapping transcript):
NM_000546.6 p.Arg248Trp: missense_variant (MODERATE)For sharing across threads, wrap the loaded VarEffect in a
std::sync::Arc and clone the Arc into each worker — the underlying
stores are Send + Sync with zero interior mutability.
§Performance
Single-threaded throughput, measured on a modern x86_64 laptop. All numbers exclude startup cost (genome load, transcript store parse).
| Tool | Language | Variants / sec (1 thread) |
|---|---|---|
VEP (--cache) | Perl | ~200 – 500 |
| SnpEff | Java | ~2,000 – 10,000 |
| Nirvana | C# | ~5,000 – 15,000 |
| bcftools csq | C | ~10,000 – 50,000 |
| vareffect | Rust | ~50,000 – 200,000 |
The gap is almost entirely I/O: VEP reads BGZF-compressed FASTA through
block decompression per base; vareffect memory-maps a flat uppercase
binary and reads bytes directly.
§Features
§Variant consequence prediction
- 24 Sequence Ontology terms —
missense_variant,synonymous_variant,stop_gained,stop_lost,start_lost,start_retained_variant,stop_retained_variant,frameshift_variant,inframe_insertion,inframe_deletion,splice_donor_variant,splice_acceptor_variant,splice_region_variant,incomplete_terminal_codon_variant,5_prime_UTR_variant,3_prime_UTR_variant,intron_variant,non_coding_transcript_exon_variant,upstream_gene_variant,downstream_gene_variant,intergenic_variant,coding_sequence_variant,transcript_ablation,protein_altering_variant. - IMPACT ranking —
HIGH/MODERATE/LOW/MODIFIER, matching VEP’s severity scale.ConsequencederivesOrdfor sorting. - Multi-transcript annotation — one
ConsequenceResultper overlapping transcript; the caller picks a canonical isoform if desired (every result carries MANE Select / MANE Plus Clinical / RefSeq Select tier metadata). - NMD prediction — the 50-nucleotide rule applied to truncating variants
(
stop_gained,frameshift_variant). - Standard and mitochondrial genetic codes —
chrMvariants translate with NCBI table 2 automatically.
§HGVS nomenclature
- Forward c. / n. — substitutions, deletions, insertions, duplications,
delins, intronic offsets (
c.672+1), 5’/3’ UTR offsets (c.-15,c.*42), combined forms (c.-15+1). - Forward p. — missense, synonymous, stop gain / loss, start loss
(
p.Met1?), frameshift with extension walk (p.Glu23ValfsTer17), incomplete terminal codon, stop extension (p.Ter394CysextTer9). Matches VEP’s default capitalisation and three-letter code convention. - 3’ normalization — shift indels to the most 3’ equivalent position on
the coding strand for HGVS notation (matches VEP’s
--shift_hgvs). - Reverse c. — parse an HGVS c. string back to plus-strand 0-based genomic coordinates and alleles, round-tripped through the transcript store and verified against the genome.
§Variant localization
- Classify variants as CDS exon, intron, 5’ / 3’ UTR, splice donor / acceptor, splice region, upstream / downstream, intergenic.
- Multi-exon indel handling with exon-boundary-spanning logic.
- Reference-allele verification against the loaded genome before annotation, so VCFs built against the wrong build fail fast instead of silently producing wrong calls.
§Transcript models
- MANE Select, MANE Plus Clinical, and RefSeq Select tiers — your build pipeline decides which to ingest.
- Full exon and CDS segment layout with GFF3 phase preserved, so codon walks across exon boundaries and frameshift detection are O(1) lookups.
- Interval-tree indexed per chromosome for O(log n + k) overlap queries.
- Strand-aware: correct reverse-complement handling on minus-strand genes.
§Runtime characteristics
- Memory-mapped reference genome — base fetches are a pointer dereference (~5 ns), not a BGZF block decompression.
- Thread-safe —
VarEffect,TranscriptStore, andFastaReaderare allSend + Syncwith zero interior mutability. A compile-time assertion inlib.rsguarantees this for every release. - Zero external runtime — no network, no database, no background
workers. Load two files at startup, share an
Arc<VarEffect>across your worker pool, call.annotate(...).
§What vareffect does not do
- No regulatory / TFBS / motif annotation layer.
- No co-located variant lookup (no gnomAD, ClinVar, dbSNP).
- No allele frequency or population annotation.
- No canonical transcript selection — every overlapping transcript is returned with its tier metadata; callers decide which to keep.
- No multi-allele VCF splitting — the caller must split comma-separated
ALTs before invokingannotate. - No plugin system.
- No alternate genome builds out of the box — GRCh37 or CHM13 require regenerating the transcript and genome binaries with your own build.
See VEP_DIVERGENCES.md for the complete list of
intentional divergences from VEP and features that are not yet implemented.
§Setting up the data files
vareffect does not ship reference data. You provide two files at runtime:
transcript_models.bin— a MessagePack-serialisedVec<TranscriptModel>built from a MANE / RefSeq Select GFF3.GRCh38.bin(or whatever build you use) — a flat uppercase-IUPAC binary plus a.bin.idxMessagePack sidecar produced by the builder.
§Recommended: use vareffect-cli
The companion crate vareffect-cli ships a vareffect binary that handles
the entire provisioning flow in one command. It downloads the GRCh38
reference FASTA from NCBI, writes the flat-binary genome + index, fetches
the latest MANE release GFF3 + summary, builds the transcript store, and
writes NCBI patch-contig aliases — all under data/vareffect/.
cargo install vareffect-cli
# Scaffold a config file (one-time):
vareffect init
# Full provisioning (GRCh38 genome + MANE transcript models).
vareffect setup
# Validate everything is in place:
vareffect check
# Write runtime files to a custom directory instead of the config default:
vareffect setup --output /data/genomes/vareffect
# Only the reference genome:
vareffect setup --fasta-only
# Only rebuild the transcript model store (reuses an existing genome):
vareffect setup --models-onlysetup is idempotent — source archives are cached in data/raw/, the
genome binary is skipped if it already exists, and transcript models are
rebuilt on every run so a new MANE release picks up automatically.
After vareffect setup finishes, you have the layout the library expects:
data/vareffect/
GRCh38.bin # flat-binary reference genome
GRCh38.bin.idx # MessagePack contig index
transcript_models.bin # serialised Vec<TranscriptModel>
patch_chrom_aliases.csv # UCSC <-> RefSeq patch-contig map§Alternative: roll your own
If you’re building a custom store (a different transcript source, a
non-human genome, a subset of the human transcriptome), the
fasta::write_genome_binary function is public so you can generate the
flat binary yourself:
use std::path::Path;
use vareffect::fasta::write_genome_binary;
// Uppercase ASCII bytes, one contig per tuple.
let chr_toy: &[u8] = b"ACGTACGTNNN";
let contigs: &[(&str, &[u8])] = &[("chrToy", chr_toy)];
write_genome_binary(
contigs,
"toy", // build label, stored in the index
Path::new("out/toy.bin"), // flat binary
Path::new("out/toy.bin.idx"), // MessagePack index sidecar
)?;Building the transcript store by hand means producing a
Vec<vareffect::TranscriptModel> and MessagePack-serialising it with
rmp-serde; see crates/vareffect-cli/src/builders/ in the source tree
for a worked example.
§HGVS reverse resolution
Take an HGVS c. string, resolve it to plus-strand genomic coordinates, and
feed the result straight back into annotate — useful when your input is
transcript-relative rather than coordinate-based:
let gv = ve.resolve_hgvs_c("NM_000546.6:c.742C>T")?;
let results = ve.annotate(&gv.chrom, gv.pos, &gv.ref_allele, &gv.alt_allele)?;resolve_hgvs_c supports substitutions, deletions, insertions,
duplications, and delins across CDS, 5’ UTR, 3’ UTR, and intronic
positions.
§Threading model
VarEffect, TranscriptStore, and FastaReader are all Send + Sync —
construct one VarEffect at startup, wrap it in std::sync::Arc, and
share it across every worker thread or async task. There is no interior
mutability, no contention, and no hidden cloning cost; every thread reads
from the same memory-mapped bytes.
§Coordinate conventions
- 0-based, half-open (BED / UCSC) for every coordinate in the public API. GFF3’s 1-based-fully-closed input is converted at build time.
- UCSC chromosome names (
chr17,chrM). TheFastaReaderhas an alias table so it can transparently accept"17","NC_000017.11", etc. Patch-contig lookups additionally needopen_with_patch_aliasesand arefseq,ucscalias CSV (written automatically byvareffect setup). - Uppercase ASCII allele bytes — no case coercion at call time, so lowercase input is a bug in your code, not something we silently fix.
§VEP feature parity
vareffect targets near-100% concordance with VEP release 115 / 116 on the
core coding and transcript consequence layer. Every intentional divergence,
not-yet-implemented feature, and out-of-scope design decision is catalogued
in VEP_DIVERGENCES.md. The validation suite lives
in tests/vep_concordance_*.rs.
§Testing
The unit and fast integration tests run with no external data:
cargo test -p vareffectThe VEP concordance tests are #[ignore]-gated because they need the
transcript store and reference genome on disk (run vareffect setup first
if you haven’t already):
FASTA_PATH="$(pwd)/data/vareffect/GRCh38.bin" CONCORDANCE_THREADS=1 \
cargo test -p vareffect --release -- --ignored vep_concordance§License
Apache License, Version 2.0 (LICENSE-APACHE or https://www.apache.org/licenses/LICENSE-2.0)
§Contributing
Contributions are welcome. The most valuable areas for outside help are:
- Adding SO terms currently listed as not-yet-implemented in
VEP_DIVERGENCES.md(splice polypyrimidine-tract, donor region, donor 5th base). - Broadening the VEP concordance corpus in
tests/vep_concordance_*.rswith variants that stress a part of the pipeline that isn’t already covered. - Alternate genome build support (GRCh37, CHM13) in the
vareffect-cliprovisioning flow.
Open an issue before starting on anything larger than a bug fix so we can
agree on scope.
vareffect — Variant consequence prediction and HGVS notation, targeting
near-100% concordance with Ensembl VEP (release 115/116).
Consumers point the store loaders at whatever transcript and genome files
their build pipeline produces — vareffect ships no embedded reference
data and has no runtime dependency on an orchestrator CLI.
§Transcript model store
An in-memory store of MANE transcript models indexed by genomic interval
for O(log n + k) overlap queries. Each TranscriptModel carries
per-exon CdsSegments with the GFF3 column-8 phase captured, so
downstream codon walks and frameshift detection don’t have to re-derive
phase from scratch.
use std::path::Path;
use vareffect::{Biotype, TranscriptStore};
let store = TranscriptStore::load_from_path(
Path::new("data/vareffect/transcript_models.bin"),
)?;
// Overlap query: all transcripts whose tx_start..tx_end intersects the interval.
for (tx, _idx) in store.query_overlap("chr6", 33_409_450, 33_409_451) {
println!(
"{} ({}): cds [{:?}, {:?}), {} segments, biotype={:?}",
tx.accession,
tx.gene_symbol,
tx.cds_genomic_start,
tx.cds_genomic_end,
tx.cds_segments.len(),
tx.biotype,
);
// Walk CDS segments in transcript 5'→3' order (reversed for minus strand):
for seg in &tx.cds_segments {
println!(
" segment in exon[{}], phase {}: [{}, {})",
seg.exon_index, seg.phase, seg.genomic_start, seg.genomic_end
);
}
}
// `biotype` is an enum with `Other(String)` for unknown upstream labels.
let total_protein_coding = store
.transcripts()
.iter()
.filter(|t| matches!(t.biotype, Biotype::ProteinCoding))
.count();§Reference genome reader
Memory-mapped random access to the reference genome via FastaReader.
Pair it with TranscriptStore to extract codons, verify REF alleles, and
walk downstream for frameshift termination. See the fasta module for
the on-disk format, coordinate conventions, and chromosome-name handling.
The flat binary format stores uppercase IUPAC nucleotide codes, matching
GA4GH refget v2.0 conventions. Most bases are A/C/G/T/N; the
NCBI GRCh38.p14 assembly also uses ambiguity codes (M, R, Y, etc.)
in some patch-scaffold regions. Soft-mask information is not preserved.
§Variant consequence assignment
VarEffect::annotate takes a variant’s position and alleles, locates it
within every overlapping transcript, extracts the reference codon(s) from
FASTA, translates ref and alt codons, and assigns SO consequence term(s)
with VEP-concordant IMPACT ratings. The codon module provides the
standard and mitochondrial genetic code translation tables.
use std::path::Path;
use vareffect::VarEffect;
let ve = VarEffect::open(
Path::new("data/vareffect/transcript_models.bin"),
Path::new("data/vareffect/GRCh38.bin"),
)?;
// Annotate TP53 c.742C>T (p.R248W) — chr17, 0-based position 7,674,219.
let results = ve.annotate("chr17", 7_674_219, b"C", b"T")?;
for r in &results {
for csq in &r.consequences {
println!("{} ({})", csq.as_str(), r.impact);
}
}For lower-level building blocks (per-transcript annotation when you
already hold a &TranscriptModel), see annotate_snv,
annotate_deletion, and annotate_insertion.
§Coordinate convention
All coordinates in TranscriptModel are 0-based, half-open (BED/UCSC
style). GFF3 input (1-based, fully-closed) is converted at build time by
vareffect-cli. See transcript for the interval-tree indexing details.
cds_genomic_start / cds_genomic_end are the genomic min / max
coordinates across all CDS segments, not transcript-relative. For a
minus-strand gene, cds_genomic_start is biologically the 3’ end of the
protein in transcript order. Walk cds_segments (ordered 5’→3’ on the
transcript) when you need the true coding walk.
§Thread safety
Both TranscriptStore and FastaReader are Send + Sync (proven by
a compile-time assertion at the bottom of this file). TranscriptStore
is lock-free for reads. FastaReader is backed by a memory-mapped
&[u8] — inherently Send + Sync with zero contention. All threads
can read from the same FastaReader concurrently without cloning.
Re-exports§
pub use consequence::Consequence;pub use consequence::ConsequenceResult;pub use consequence::Impact;pub use consequence::annotate_deletion;pub use consequence::annotate_insertion;pub use consequence::annotate_snv;pub use error::VarEffectError;pub use fasta::FastaReader;pub use hgvs_reverse::GenomicVariant;pub use locate::IndelLocation;pub use locate::IndelRegion;pub use locate::LocateIndex;pub use locate::SpliceOverlapDetail;pub use locate::SpliceSide;pub use locate::VariantLocation;pub use locate::locate_indel;pub use locate::locate_variant;pub use transcript::TranscriptStore;pub use types::Biotype;pub use types::CdsSegment;pub use types::Exon;pub use types::Strand;pub use types::TranscriptModel;pub use types::TranscriptTier;
Modules§
- chrom
- Chromosome name conversion.
- codon
- Codon translation, DNA complement, and VEP-style display formatting.
- consequence
- Variant consequence assignment – determines SO consequence terms for SNVs, simple indels, boundary-spanning deletions, complex delins, and MNVs against transcript models.
- error
- Error type for
vareffect. - fasta
- Indexed reference genome reader backed by a flat memory-mapped binary.
- hgvs_c
- HGVS coding DNA notation (c. / n.) generation.
- hgvs_p
- HGVS protein notation (
p.) for variant consequences. - hgvs_
reverse - HGVS c. reverse mapper: parse HGVS c. notation and resolve to genomic VCF-style coordinates.
- locate
- Variant locator: classify where a genomic position falls within a transcript.
- transcript
- In-memory transcript store with per-chromosome interval tree indexing.
- types
- Runtime types for the transcript model store.
Structs§
- VarEffect
- Stateful entrypoint to vareffect: bundles a
TranscriptStoreand aFastaReaderso callers don’t have to thread both handles through every annotation call.