# convert_genome
[](https://crates.io/crates/convert_genome)
[](https://docs.rs/convert_genome)
[](https://github.com/SauersML/convert_genome/actions/workflows/ci.yml)
`convert_genome` converts direct-to-consumer (DTC) genotype exports (23andMe, AncestryDNA, MyHeritage, etc.) into standard [VCF](https://samtools.github.io/hts-specs/VCFv4.5.pdf), [BCF](https://samtools.github.io/hts-specs/BCFv2_qref.pdf), or [PLINK](https://www.cog-genomics.org/plink/1.9/formats) binary formats. The converter supports remote references via HTTP(S) and handles compressed `.gz` and `.zip` archives.
## Features
- **Multiple output formats**: VCF text, BCF binary, and PLINK 1.9 binary (.bed/.bim/.fam).
- **Flexible input parsing**: Supports 23andMe, AncestryDNA (5-column), MyHeritage (CSV), deCODEme (6-column with strand flipping), and standard VCF/BCF.
- **Auto-detection**:
- **Genome Build**: Automatically detects `GRCh37` vs `GRCh38` vs `hg19`.
- **Biological Sex**: Infers sex from X/Y chromosome heterozygosity and density.
- **Automatic Liftover**: Seamlessly converts data between genome builds (e.g., `hg19` to `GRCh38`) by automatically downloading UCSC chain files when a build mismatch is detected.
- **Reference Panel Support**: Harmonize your data against a reference VCF (e.g., 1000 Genomes) to correct strand flips and align alleles.
- **Structured Reporting**: Automatically generates a `<output>_report.json` file with full run metadata, stats, and inference results.
- **Remote reference support**: Fetch `http://` and `https://` URLs with transparent decompression of `.gz` and `.zip` archives.
- **Sex chromosome handling**: Correct ploidy enforcement for X, Y, and MT chromosomes with PAR region awareness.
- **Allele polarization**: Optional `--standardize` mode to normalize alleles against the reference genome.
- **Comprehensive CI/CD**: Formatting, linting, testing, and coverage across Linux (with macOS/Windows support).
## Installation
### Automatic Install (Recommended)
Installs the latest binary for your platform (macOS/Linux/Windows):
```bash
# macOS / Linux / Windows (Git Bash)
curl -fsSL https://raw.githubusercontent.com/SauersML/convert_genome/main/install.sh | bash
```
## Manual installation
The project targets Rust **nightly** (see `rust-toolchain.toml`). Install the converter directly from the repository:
```bash
cargo install --path .
```
Alternatively, build the binary without installing:
```bash
cargo build --release
```
The resulting executable lives at `target/release/convert_genome`.
## CLI Usage
### Supported Inputs
- **DTC genotype exports**
- **23andMe**
- **AncestryDNA** (5-column)
- **MyHeritage** (CSV)
- **FTDNA** (CSV / whitespace-delimited, depending on export)
- **deCODEme** (6-column with strand indicator)
- **VCF / BCF** (for format conversion or downstream harmonization)
Compressed inputs are supported directly:
- **Gzip**: `.gz`
- **Zip**: `.zip`
### Workflow Examples
#### Simple Format Conversion
Convert a DTC genotype export to VCF without extra harmonization steps:
```bash
convert_genome \
--input data/genotypes.txt \
--reference GRCh38.fa \
--output genotypes.vcf
```
This will:
1. Detect the genome build.
2. Infer biological sex from the data.
3. Convert to VCF.
4. Produce `genotypes.vcf` and `genotypes_report.json`.
#### Preparing for Imputation
Imputation tools (Beagle, Shapeit, Eagle, Minimac, etc.) tend to be strict about:
- reference/alternate allele consistency
- strand/polarity
- expected ploidy on sex chromosomes
- compactness and indexing (BCF is usually preferred)
An “imputation-ready” run typically includes `--standardize` plus `--panel`:
```bash
convert_genome \
--input sample.txt \
--reference hg38.fa \
--standardize \
--panel panel.vcf \
--output sample.bcf \
--format bcf
```
#### Handling Ancient/Old Data
`--assembly` defines the **target** output build label and drives liftover decisions.
If your input is detected as `hg19` / `GRCh37`, but you request `GRCh38` output, the tool will automatically trigger liftover:
```bash
convert_genome \
--input ancient.txt \
--reference GRCh38.fa \
--assembly GRCh38 \
--output ancient.vcf
```
Bi-directional liftover is supported (up-lift to `GRCh38` or down-lift to `GRCh37`/`hg19`), and the chain registry includes older conversions (e.g., `NCBI36`).
### PLINK output
```bash
convert_genome \
--input input.txt \
--reference reference.fa \
--output output \
--format plink \
--sex male \ # Override auto-detection
--variants-only # Drop reference-matching sites
```
### Command-line options
| `--input <PATH>` | **Required.** Input genotype file (DTC, VCF, or BCF) |
| `--reference <PATH>` | **Required.** Reference genome FASTA (local or URL) |
| `--output <PATH>` | **Required.** Output file path (or prefix for PLINK) |
| `--format <vcf\|bcf\|plink>` | Output format (default: vcf) |
| `--sex <male\|female>` | Explicitly set sex (disables auto-inference) |
| `--sample <NAME>` | Sample identifier for VCF header |
| `--assembly <NAME>` | Assembly label for metadata (default: GRCh38) |
| `--panel <PATH>` | VCF panel for harmonization (e.g., 1000G sites) |
| `--standardize` | Standardize alleles to reference forward strand |
| `--variants-only` | Omit reference-only sites from output |
| `--input-format <dtc\|vcf\|bcf\|auto>` | Input format (default: auto-detect) |
| `--reference-fai <PATH>` | Explicit FASTA index path |
| `--log-level <LEVEL>` | Logging verbosity (default: info) |
### Flag Deep Dives
#### `--standardize`
Enforces reference-matching alleles against the provided reference FASTA.
- For SNPs, this can include allele polarization (swapping `REF/ALT` and remapping `GT`) when the reference base is found among the alleles.
- Applies ploidy rules for sex chromosomes (with PAR region awareness) when sex is known or inferred.
#### `--panel`
Harmonizes alleles against a reference VCF panel (e.g., 1000 Genomes sites).
- This is especially important for strand-ambiguous SNPs (A/T and C/G), where naive complement checks cannot uniquely determine orientation.
- The panel provides an additional anchor for allele alignment and padding.
#### `--sex`
If omitted, `convert_genome` infers sex using X/Y heterozygosity and variant density.
You can override inference with:
- `--sex male`
- `--sex female`
This influences X/Y ploidy enforcement (e.g., male non-PAR X/Y is treated as haploid).
#### `--assembly`
Defines the **target** output build label (default: `GRCh38`).
If the input build is detected as `GRCh37`/`hg19` but you specify `--assembly GRCh38`, the tool will automatically trigger the liftover engine.
## Using as a Rust Library
### Add the Dependency
Add this to your `Cargo.toml`:
```toml
[dependencies]
convert_genome = "0.1.2"
```
### Core Types and Entry Points
- `ConversionConfig`: configuration struct for controlling input/output formats, paths, inference toggles, and harmonization.
- `convert_dtc_file`: primary entry point for running a conversion and getting a `ConversionSummary` back.
The library uses `anyhow::Result` for top-level errors and also has internal error types for record-level failures.
### Minimal Example
```rust
use convert_genome::{convert_dtc_file, ConversionConfig, OutputFormat};
use convert_genome::cli::Sex;
use convert_genome::input::InputFormat;
use std::path::PathBuf;
fn main() -> anyhow::Result<()> {
let summary = convert_dtc_file(ConversionConfig {
input: PathBuf::from("data/genotypes.txt"),
input_format: InputFormat::Auto,
input_origin: String::from("local"),
reference_fasta: Some(PathBuf::from("GRCh38.fa")),
reference_origin: Some(String::from("local")),
reference_fai: None,
reference_fai_origin: None,
output: PathBuf::from("out.vcf"),
output_dir: None,
output_format: OutputFormat::Vcf,
sample_id: String::from("SAMPLE"),
assembly: String::from("GRCh38"),
include_reference_sites: true,
sex: Some(Sex::Female),
par_boundaries: None,
standardize: true,
panel: None,
})?;
eprintln!("emitted_records={}", summary.emitted_records);
Ok(())
}
```
## Genome Build Detection & Liftover
### How does it know?
The tool samples variants from your input and checks concordance against expected reference alleles to distinguish common builds (e.g., `GRCh37`/`hg19` vs `GRCh38`). This prevents accidentally emitting a file labeled as one build while containing coordinates from another.
### Liftover Details
- **Trigger:** `--assembly` defines the **target** build.
If the input is detected as `hg19`/`GRCh37` but you request `GRCh38`, liftover is automatically enabled.
- **Automatic downloads:** required UCSC chain files (e.g., `hg19ToHg38.over.chain.gz`) are fetched on demand and cached locally. The first run may require an internet connection.
- **Fail-safe behavior:** variants that cannot be mapped (deleted regions, gaps, ambiguous mappings) are safely filtered and counted in the output report.
## How It Works (Data Processing Logic)
### Strand Flipping Logic
At a trusted SNP site, one allele should match the reference base for the assumed build.
- If alleles do not match the reference, the tool checks the complement.
- If the complement matches, the allele representation is flipped.
- Strand-ambiguous SNPs (A/T and C/G) require additional context and are best handled with `--panel`.
### Ploidy Enforcement
The converter enforces expected ploidy on sex chromosomes:
- Male non-PAR X/Y: haploid
- Female X: diploid
- Female Y: dropped
PAR boundaries are assembly-specific.
### Build Detection
Build detection samples input sites and estimates concordance against expected reference alleles to infer `GRCh37`/`hg19` vs `GRCh38`.
## Output Report
### Why the report matters
Every run produces a JSON report alongside the output file (e.g., `sample_report.json`). Treat this as an audit trail:
- what the tool inferred (sex, build)
- whether liftover was applied
- how many sites were standardized or harmonized
- how many variants were filtered or failed verification
Every run produces a JSON report alongside the output file (e.g., `sample_report.json`) containing:
```json
{
"version": "0.1.0",
"timestamp": "2025-12-16T22:44:00Z",
"input": { "format": "dtc", ... },
"output": { "format": "vcf", ... },
"sample": { "id": "SAMPLE", "sex": "male", "sex_inferred": true },
"build_detection": { "detected_build": "GRCh38" },
"standardize": true,
"panel": {
"total_sites": 120000,
"modified_sites": 150, // Sites flipped/swapped to match panel
"novel_sites": 25 // Sites not in panel
},
"statistics": {
"total_records": 638234,
"emitted_records": 612847,
...
}
}
```
### Key field definitions
- `total_records`: number of input rows processed from the source (after basic parsing).
- `emitted_records`: number of records written to the output.
- `variant_records`: emitted records with at least one ALT allele.
- `reference_records`: emitted reference-matching records (ALT empty).
Liftover-specific counters are also included:
- `liftover_unmapped`: no chain interval found
- `liftover_ambiguous`: multiple chains/intervals eligible (rejected)
- `liftover_incompatible`: alleles incompatible with target reference checks
- `liftover_straddled`: endpoints do not lift consistently (e.g., indel spans blocks)
- `liftover_contig_missing`: lifted contig does not exist in target reference