ref-solver
Identify which human reference genome was used to align a BAM/SAM/CRAM file.
Visit us at Fulcrum Genomics to learn more about how we can power your Bioinformatics with ref-solver and beyond.
The Problem
When working with BAM files from external sources (collaborators, public repositories, sequencing vendors), it's often unclear exactly which reference genome was used for alignment. While the reference might be labeled "GRCh38" or "hg19", there are dozens of variations:
- Naming conventions:
chr1vs1vsNC_000001.11 - Contig sets: With or without ALT contigs, decoys, HLA alleles
- Mitochondrial sequences: rCRS (16,569 bp) vs old Cambridge (16,571 bp)
- Sources: UCSC, NCBI, Broad, Illumina DRAGEN, 1000 Genomes
ref-solver solves this by matching the sequence dictionary from your BAM file against a catalog of known human reference genomes, providing:
- Exact matches when possible
- Detailed diagnostics when differences exist
- Actionable suggestions for fixing mismatches
Quickstart
Installation
# From crates.io (when published)
# From source
Basic Usage
# Identify reference from a BAM file
# From stdin (pipe samtools header)
|
# JSON output for scripting
# Compare two files/references
# Score one file against another directly
# List all known references
# Start interactive web UI
Example Output
#1 hg38 (UCSC) (EXACT)
ID: hg38_ucsc
Assembly: GRCh38
Source: UCSC
Match Type: Exact
Score: 100.0%
Contigs: 25 exact, 0 renamed, 0 by name+length, 0 unmatched, 0 conflicts
Suggestions:
- Safe to use as-is
Download: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
Features
- Multiple input formats: BAM, SAM, CRAM, Picard
.dict, TSV - MD5-based matching: Uses sequence checksums when available for exact identification
- Fuzzy matching: Falls back to name+length matching when MD5s are missing
- Rename detection: Identifies when files differ only in contig naming (chr1 vs 1)
- Order detection: Detects when contigs are reordered vs. reference
- Conflict detection: Identifies problematic differences (e.g., wrong mitochondrial sequence)
- Actionable suggestions: Provides commands to fix issues using fgbio/Picard tools
- Web interface: Interactive browser-based UI for pasting headers
- Embedded catalog: 15+ common human reference genomes built-in
Supported References
The built-in catalog includes 15 human reference genomes:
GRCh38/hg38 Family
| ID | Name | Source | Notes |
|---|---|---|---|
hg38_ucsc |
hg38 (UCSC) | UCSC | Standard UCSC reference, chr-prefixed naming |
grch38_ncbi |
GRCh38 (NCBI) | NCBI | NCBI numeric naming (1, 2, ..., X, Y, MT) |
grch38_broad_analysis_set |
GRCh38 Broad Analysis Set | Broad | GATK Best Practices reference |
hs38 |
hs38 (no-ALT) | lh3/ref-gen | Recommended. Primary + unplaced/unlocalized (195 seq) |
hs38DH |
hs38DH | lh3/bwakit | Full set: ALT (261) + decoy (2386) + HLA (525) = 3155 seq |
grch38_1kg_analysis |
GRCh38 1KG Analysis Set | 1000 Genomes | 1000 Genomes Project analysis set with decoy/HLA |
grch38_gdc |
GRCh38.d1.vd1 | NCI GDC | TCGA/TARGET reference with 10 viral genomes (2779 seq) |
grch38_dragen |
GRCh38 DRAGEN | Illumina | Standard Illumina DRAGEN reference |
grch38_dragen_altmasked |
GRCh38 DRAGEN ALT-masked | Illumina | DRAGEN v3.9+ with ALT regions N-masked |
GRCh37/hg19 Family
| ID | Name | Source | Notes |
|---|---|---|---|
hg19_ucsc |
hg19 (UCSC) | UCSC | ⚠️ Old Cambridge chrM (16571bp), not rCRS |
grch37_ncbi |
GRCh37 (NCBI) | NCBI | NCBI naming with rCRS mitochondrial |
hs37 |
hs37 (minimal) | lh3/ref-gen | Recommended. 25 primary sequences only |
hs37d5 |
hs37d5 | 1000 Genomes | With hs37d5 decoy sequence |
b37_broad |
b37 (Broad) | Broad | Legacy GATK Best Practices |
T2T-CHM13
| ID | Name | Source | Notes |
|---|---|---|---|
chm13v2 |
T2T-CHM13v2.0 | T2T Consortium | Complete gapless assembly, all centromeres resolved |
Quick Reference Selection Guide
| Use Case | GRCh38 | GRCh37 |
|---|---|---|
| Standard analysis (BWA-MEM, GATK) | hs38 |
hs37 |
| ALT-aware alignment (BWA-MEM2) | hs38DH |
hs37d5 |
| Illumina DRAGEN | grch38_dragen_altmasked |
— |
| TCGA/GDC compatibility | grch38_gdc |
— |
| Legacy pipelines | hg38_ucsc |
hg19_ucsc |
Commands
identify
Identify the reference genome from a BAM/SAM file.
<INPUT> Input )
compare
Compare two headers or a header against a known reference.
<INPUT_A> First
<INPUT_B> Second
catalog
Manage the reference catalog.
score
Compare two files directly without using the catalog. Useful for comparing arbitrary files. By default, scoring is asymmetric: it measures how well the query matches the reference.
<QUERY> Query )
<REFERENCE> Reference
)
)
)
)
Example:
# Compare a BAM against a reference FASTA index
# Compare in both directions
# Custom scoring weights (emphasize coverage)
serve
Start the web interface.
Output Formats
Use --format to control output:
text(default): Human-readable tabular outputjson: Structured JSON for scriptingtsv: Tab-separated values
Understanding Results
Match Types
| Type | Meaning |
|---|---|
Exact |
All contigs match exactly (name, length, MD5) |
Renamed |
Same sequences, different naming convention |
Reordered |
Same contigs, different order |
Partial |
Most contigs match, some differences |
Mixed |
Contigs appear to come from multiple references |
NoMatch |
No good match found |
Confidence Levels
| Level | Score Range | Meaning |
|---|---|---|
Exact |
100% | Perfect match |
High |
≥95% | Very confident |
Medium |
≥80% | Likely match |
Low |
<80% | Uncertain |
Common Issues
Wrong Mitochondrial Sequence
The UCSC hg19 chrM is 16,571 bp (old Cambridge sequence), while most modern references use rCRS (16,569 bp). This is a real sequence difference, not just naming.
Chr Prefix Mismatch
UCSC uses chr1, NCBI/Ensembl use 1. Use fgbio to rename:
Contig Order Differences
Some tools are sensitive to contig order. Use Picard to reorder:
Custom Catalogs
Export and modify the built-in catalog:
# Export current catalog
# Edit to add custom references...
# Use custom catalog
UCSC-Style Naming for Patches
ref-solver automatically generates UCSC-style names for fix-patches and novel-patches in GRCh38 assembly reports. This is particularly important for assembly reports prior to p13, where the UCSC-style-name column shows "na" for patches.
Naming Convention
UCSC uses the following format for patch contigs:
| Patch Type | Format | Example |
|---|---|---|
| Fix patches | chr{chr}_{accession}v{version}_fix |
chr1_KN196472v1_fix |
| Novel patches | chr{chr}_{accession}v{version}_alt |
chr1_KQ458382v1_alt |
The transformation converts NCBI GenBank accessions (e.g., KN196472.1) to UCSC-style names by:
- Replacing
.withvin the accession - Prepending
chr{chromosome}_ - Appending
_fixor_altbased on patch type
Examples
| NCBI Accession | Patch Type | Chromosome | UCSC Name |
|---|---|---|---|
KN196472.1 |
fix-patch | 1 | chr1_KN196472v1_fix |
KQ458382.1 |
novel-patch | 1 | chr1_KQ458382v1_alt |
KN196487.1 |
fix-patch | Y | chrY_KN196487v1_fix |
KV766199.1 |
novel-patch | X | chrX_KV766199v1_alt |
Disabling UCSC Name Generation
When building custom catalogs, you can disable automatic UCSC name generation:
This is useful when you want strict adherence to names in the assembly report.
Official Documentation
Library Usage
use ;
use parse_header_text;
// Load catalog
let catalog = load_embedded?;
// Parse a header
let header_text = "@SQ\tSN:chr1\tLN:248956422\tM5:6aef897c3d6ff0c78aff06ac189178dd\n";
let query = parse_header_text?;
// Find matches
let engine = new;
let matches = engine.find_matches;
for m in matches
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
Adding New References
To add a new reference to the catalog:
- Obtain the sequence dictionary (run
samtools dict reference.fa) - Add the reference to
catalogs/human_references.json - Include MD5 checksums for all contigs
- Run tests to verify matching works correctly
License
MIT License. See LICENSE for details.
Acknowledgments
- Heng Li's ref-gen for reference genome recommendations (hs37, hs38, hs38DH)
- NCI GDC for GRCh38.d1.vd1 reference documentation
- Illumina DRAGEN for DRAGEN reference specifications
- noodles for SAM/BAM parsing
- GATK for reference genome documentation
- T2T Consortium for CHM13 resources
- 1000 Genomes Project for hs37d5 and analysis sets
Citation
If you use ref-solver in your research, please cite:
ref-solver: A tool for identifying human reference genomes from BAM files
https://github.com/fulcrumgenomics/ref-solver