ferro-hgvs
A high-performance HGVS variant nomenclature parser and normalizer written in Rust.
WARNING: ALPHA SOFTWARE - USE AT YOUR OWN RISK
This software is currently in ALPHA. While we have extensively tested it across a wide variety of HGVS patterns, no guarantees are made regarding correctness or stability.
Features
- Full HGVS Parsing: All coordinate systems (g/c/n/r/p/m/o) and edit types
- Variant Normalization: 3'/5' shifting per HGVS specification
- High Performance: ~2.5M variants/sec parsing, zero-copy with nom
- Type-Safe: Leverages Rust's type system for correctness
Installation
Add to your Cargo.toml:
[]
= "0.1"
Or install the CLI:
Quick Start
CLI
# Parse a variant
# Parse from file
# Prepare reference data (downloads RefSeq, genome, cdot)
# Verify reference data is ready
# Normalize with reference
Library
use ;
Supported HGVS Syntax
| Type | Prefix | Example |
|---|---|---|
| Genomic | g. |
NC_000001.11:g.12345A>G |
| Coding DNA | c. |
NM_000088.3:c.459A>G |
| Non-coding | n. |
NR_000001.1:n.100A>G |
| RNA | r. |
NM_000088.3:r.459a>g |
| Protein | p. |
NP_000079.2:p.Val600Glu |
| Mitochondrial | m. |
NC_012920.1:m.3243A>G |
Edit Types
- Substitution:
A>G,Val600Glu - Deletion:
del,100_200del - Insertion:
100_101insATG - Deletion-Insertion:
100_102delinsATG - Duplication:
100_102dup - Inversion:
100_200inv - Repeat:
100CAG[20]
CLI Commands
The ferro CLI provides commands beyond parsing and normalization:
| Command | Description |
|---|---|
prepare |
Download and prepare reference data for normalization |
check |
Verify reference data setup |
parse |
Parse and validate HGVS variants |
normalize |
Normalize HGVS variants (3'/5' shifting) |
explain |
Explain error/warning codes (e.g., ferro explain W1001) |
annotate-vcf |
Annotate VCF files with HGVS notation |
vcf-to-hgvs |
Convert VCF records to HGVS |
hgvs-to-vcf |
Convert HGVS to VCF format |
liftover |
Liftover coordinates between genome builds |
describe |
Generate HGVS from reference/observed sequences |
effect |
Predict protein effect from variant |
backtranslate |
Reverse translate protein to DNA variants |
convert-gff |
Convert GFF3/GTF to transcripts.json |
generate |
Generate HGVS descriptions from components |
extract-hgvs |
Extract HGVS from VEP-annotated VCFs |
Error Handling
ferro-hgvs provides configurable error handling with three modes:
| Mode | Behavior |
|---|---|
strict |
Reject non-conformant input (default) |
lenient |
Auto-correct with warnings |
silent |
Auto-correct silently |
# Use lenient mode to auto-correct common issues
# Ignore specific warnings
# Get help on any error/warning code
Configuration File
Create .ferro.toml in your project directory:
[]
= "lenient"
= ["W1001", "W2001"] # Silently correct these
= ["W4002"] # Always reject these
Why ferro-hgvs?
ferro-hgvs provides the most comprehensive HGVS variant normalization across all pattern types, with performance orders of magnitude faster than alternatives.
Normalization Capabilities Comparison
| Pattern Type | ferro | mutalyzer | biocommons | hgvs-rs |
|---|---|---|---|---|
| Genomic (g.) | ✓ | ✓ | ✓ | ✓ |
| Coding (c.) exonic | ✓ | ✓ | ✓ | ✓ |
| Coding (c.) intronic | ✓ | ✓* | ✗ | ✗ |
| Non-coding (n.) | ✓ | ✓ | ✓ | ✓ |
| RNA (r.) | ✓ | ✓ | ✓ | ✓ |
| Protein (p.) | ✓ | Net** | ✗ | ✓ |
* mutalyzer intronic support requires genomic context rewriting (enabled by default) ** mutalyzer protein normalization requires network access for NP_→NM_ lookups
Performance Comparison
| Tool | Speed (local) | Speed (network) | ferro Speedup |
|---|---|---|---|
| ferro-hgvs | ~4M patterns/sec | N/A (offline) | — |
| mutalyzer | ~20 patterns/sec | ~1 pattern/sec | 200,000x |
| biocommons/hgvs | ~20 patterns/sec | ~0.2 patterns/sec | 200,000x |
| hgvs-rs | ~2 patterns/sec | ~0.2 patterns/sec | 2,000,000x |
Reference Data: What ferro Prepares
The ferro prepare command downloads and organizes all reference data needed for comprehensive normalization. This data is then shared with other tools (mutalyzer, biocommons, hgvs-rs) to enable their local operation.
| Data Type | Source | Size | Enables |
|---|---|---|---|
| RefSeq transcripts | NCBI | ~1GB | NM_/NR_/XM_ normalization |
| cdot metadata | MANE | ~200MB | Transcript-to-genome mappings |
| GRCh38 + GRCh37 genomes | NCBI | ~4GB | NC_ genomic normalization |
| RefSeqGene | NCBI | ~600MB | NG_ gene region normalization |
| LRG sequences | EBI | ~50MB | LRG_ stable reference normalization |
| Protein sequences | Derived from CDS | ~200MB | NP_/XP_ protein normalization |
| Legacy transcript versions | NCBI | ~50MB | Historical ClinVar variants |
Key insight: Without ferro's reference preparation, other tools require network access for each variant lookup (adding 100-1000ms latency per variant). With ferro's cached reference data, all tools can operate fully offline with consistent, reproducible results.
Benchmark: Reference Data & Tool Comparison
The main ferro binary includes commands to prepare reference data (ferro prepare) and check its status (ferro check). The ferro-benchmark tool (build with --features benchmark) extends this for tool comparison benchmarks.
| Command | Description |
|---|---|
prepare <tool> |
Prepare reference data for a tool |
check <tool> |
Verify tool configuration and dependencies |
parse <tool> |
Parse HGVS patterns with specified tool |
normalize <tool> |
Normalize HGVS patterns with specified tool |
compare results |
Compare parse/normalize results between tools |
extract |
Extract patterns from ClinVar, VCFs, or create samples |
setup |
Set up UTA database, SeqRepo, and other services |
generate |
Generate summary reports and configs |
collate |
Aggregate sharded results |
Quick Start
# Prepare ferro reference (main binary - no special features needed)
# Check reference data
# Normalize with ferro
# For tool comparison, build with benchmark support
# Prepare other tools (uses ferro reference for transcript data)
# Compare results between tools
Supported tools: ferro-hgvs, mutalyzer, biocommons/hgvs, hgvs-rs
Note: The
pixi.tomlandpixi.lockfiles in this repository define a pixi environment for the Python-based external tools (mutalyzer, biocommons/hgvs, seqrepo) used in benchmarking. Runpixi shellto activate it.
See docs/BENCHMARK_GUIDE.md for detailed usage.
Development
License
Licensed under the MIT License. See LICENSE for details.
Disclaimer
This software is under active development. While we make a best effort to test this software and to fix issues as they are reported, this software is provided as-is without any warranty (see the license for details). Please submit an issue, and better yet a pull request as well, if you discover a bug or identify a missing feature. Please contact Fulcrum Genomics if you are considering using this software or are interested in sponsoring its development.
Contributing
See CONTRIBUTING.md for guidelines.