Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
convert_genome
convert_genome converts direct-to-consumer (DTC) genotype exports (23andMe, AncestryDNA, MyHeritage, etc.) into standard VCF, BCF, or PLINK binary formats. The converter supports remote references via HTTP(S) and handles compressed .gz and .zip archives.
Features
- Multiple output formats: VCF text, BCF binary, and PLINK 1.9 binary (.bed/.bim/.fam).
- Flexible input parsing: Supports 23andMe, AncestryDNA (5-column), MyHeritage (CSV), deCODEme (6-column with strand flipping), and standard VCF/BCF.
- Auto-detection:
- Genome Build: Automatically detects
GRCh37vsGRCh38vshg19. - Biological Sex: Infers sex from X/Y chromosome heterozygosity and density.
- Genome Build: Automatically detects
- Automatic Liftover: Seamlessly converts data between genome builds (e.g.,
hg19toGRCh38) by automatically downloading UCSC chain files when a build mismatch is detected. - Reference Panel Support: Harmonize your data against a reference VCF (e.g., 1000 Genomes) to correct strand flips and align alleles.
- Structured Reporting: Automatically generates a
<output>_report.jsonfile with full run metadata, stats, and inference results. - Remote reference support: Fetch
http://andhttps://URLs with transparent decompression of.gzand.ziparchives. - Sex chromosome handling: Correct ploidy enforcement for X, Y, and MT chromosomes with PAR region awareness.
- Allele polarization: Optional
--standardizemode to normalize alleles against the reference genome. - Comprehensive CI/CD: Formatting, linting, testing, and coverage across Linux (with macOS/Windows support).
Installation
Automatic Install (Recommended)
Installs the latest binary for your platform (macOS/Linux/Windows):
# macOS / Linux / Windows (Git Bash)
|
Manual installation
The project targets Rust nightly (see rust-toolchain.toml). Install the converter directly from the repository:
Alternatively, build the binary without installing:
The resulting executable lives at target/release/convert_genome.
CLI Usage
Supported Inputs
- DTC genotype exports
- 23andMe
- AncestryDNA (5-column)
- MyHeritage (CSV)
- FTDNA (CSV / whitespace-delimited, depending on export)
- deCODEme (6-column with strand indicator)
- VCF / BCF (for format conversion or downstream harmonization)
Compressed inputs are supported directly:
- Gzip:
.gz - Zip:
.zip
Workflow Examples
Simple Format Conversion
Convert a DTC genotype export to VCF without extra harmonization steps:
This will:
- Detect the genome build.
- Infer biological sex from the data.
- Convert to VCF.
- Produce
genotypes.vcfandgenotypes_report.json.
Preparing for Imputation
Imputation tools (Beagle, Shapeit, Eagle, Minimac, etc.) tend to be strict about:
- reference/alternate allele consistency
- strand/polarity
- expected ploidy on sex chromosomes
- compactness and indexing (BCF is usually preferred)
An “imputation-ready” run typically includes --standardize plus --panel:
Handling Ancient/Old Data
--assembly defines the target output build label and drives liftover decisions.
If your input is detected as hg19 / GRCh37, but you request GRCh38 output, the tool will automatically trigger liftover:
Bi-directional liftover is supported (up-lift to GRCh38 or down-lift to GRCh37/hg19), and the chain registry includes older conversions (e.g., NCBI36).
PLINK output
Command-line options
| Option | Description |
|---|---|
--input <PATH> |
Required. Input genotype file (DTC, VCF, or BCF) |
--reference <PATH> |
Required. Reference genome FASTA (local or URL) |
--output <PATH> |
Required. Output file path (or prefix for PLINK) |
--format <vcf|bcf|plink> |
Output format (default: vcf) |
--sex <male|female> |
Explicitly set sex (disables auto-inference) |
--sample <NAME> |
Sample identifier for VCF header |
--assembly <NAME> |
Assembly label for metadata (default: GRCh38) |
--panel <PATH> |
VCF panel for harmonization (e.g., 1000G sites) |
--standardize |
Standardize alleles to reference forward strand |
--variants-only |
Omit reference-only sites from output |
--input-format <dtc|vcf|bcf|auto> |
Input format (default: auto-detect) |
--reference-fai <PATH> |
Explicit FASTA index path |
--log-level <LEVEL> |
Logging verbosity (default: info) |
Flag Deep Dives
--standardize
Enforces reference-matching alleles against the provided reference FASTA.
- For SNPs, this can include allele polarization (swapping
REF/ALTand remappingGT) when the reference base is found among the alleles. - Applies ploidy rules for sex chromosomes (with PAR region awareness) when sex is known or inferred.
--panel
Harmonizes alleles against a reference VCF panel (e.g., 1000 Genomes sites).
- This is especially important for strand-ambiguous SNPs (A/T and C/G), where naive complement checks cannot uniquely determine orientation.
- The panel provides an additional anchor for allele alignment and padding.
--sex
If omitted, convert_genome infers sex using X/Y heterozygosity and variant density.
You can override inference with:
--sex male--sex female
This influences X/Y ploidy enforcement (e.g., male non-PAR X/Y is treated as haploid).
--assembly
Defines the target output build label (default: GRCh38).
If the input build is detected as GRCh37/hg19 but you specify --assembly GRCh38, the tool will automatically trigger the liftover engine.
Using as a Rust Library
Add the Dependency
Add this to your Cargo.toml:
[]
= "0.1.2"
Core Types and Entry Points
ConversionConfig: configuration struct for controlling input/output formats, paths, inference toggles, and harmonization.convert_dtc_file: primary entry point for running a conversion and getting aConversionSummaryback.
The library uses anyhow::Result for top-level errors and also has internal error types for record-level failures.
Minimal Example
use ;
use Sex;
use InputFormat;
use PathBuf;
Genome Build Detection & Liftover
How does it know?
The tool samples variants from your input and checks concordance against expected reference alleles to distinguish common builds (e.g., GRCh37/hg19 vs GRCh38). This prevents accidentally emitting a file labeled as one build while containing coordinates from another.
Liftover Details
- Trigger:
--assemblydefines the target build. If the input is detected ashg19/GRCh37but you requestGRCh38, liftover is automatically enabled. - Automatic downloads: required UCSC chain files (e.g.,
hg19ToHg38.over.chain.gz) are fetched on demand and cached locally. The first run may require an internet connection. - Fail-safe behavior: variants that cannot be mapped (deleted regions, gaps, ambiguous mappings) are safely filtered and counted in the output report.
How It Works (Data Processing Logic)
Strand Flipping Logic
At a trusted SNP site, one allele should match the reference base for the assumed build.
- If alleles do not match the reference, the tool checks the complement.
- If the complement matches, the allele representation is flipped.
- Strand-ambiguous SNPs (A/T and C/G) require additional context and are best handled with
--panel.
Ploidy Enforcement
The converter enforces expected ploidy on sex chromosomes:
- Male non-PAR X/Y: haploid
- Female X: diploid
- Female Y: dropped
PAR boundaries are assembly-specific.
Build Detection
Build detection samples input sites and estimates concordance against expected reference alleles to infer GRCh37/hg19 vs GRCh38.
Output Report
Why the report matters
Every run produces a JSON report alongside the output file (e.g., sample_report.json). Treat this as an audit trail:
- what the tool inferred (sex, build)
- whether liftover was applied
- how many sites were standardized or harmonized
- how many variants were filtered or failed verification
Every run produces a JSON report alongside the output file (e.g., sample_report.json) containing:
Key field definitions
total_records: number of input rows processed from the source (after basic parsing).emitted_records: number of records written to the output.variant_records: emitted records with at least one ALT allele.reference_records: emitted reference-matching records (ALT empty).
Liftover-specific counters are also included:
liftover_unmapped: no chain interval foundliftover_ambiguous: multiple chains/intervals eligible (rejected)liftover_incompatible: alleles incompatible with target reference checksliftover_straddled: endpoints do not lift consistently (e.g., indel spans blocks)liftover_contig_missing: lifted contig does not exist in target reference