convert_genome 0.1.6

Convert DTC, VCF, or BCF genome files to VCF, BCF, or PLINK 1.9
docs.rs failed to build convert_genome-0.1.6
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
Visit the last successful build: convert_genome-0.1.9

convert_genome

Crates.io docs.rs CI

convert_genome converts genomes in one format to another. This includes, for example, direct-to-consumer (DTC) genotype exports (23andMe, AncestryDNA, MyHeritage, etc.) into standard VCF, BCF, or PLINK binary formats. The converter supports remote references via HTTP(S) and handles compressed .gz and .zip archives. You can also convert from one genome build from another. It automatically detects genome builds, infers biological sex, handles coordinate liftovers, and harmonizes alleles against reference panels.

Core Capabilities

  1. Format Conversion: Transforms generic CSV/TSV genotype tables into compliant VCF v4.5, BCF v2.2, or PLINK 1.9 binary formats.
  2. Smart Input Handling: Transparently handles plain text, GZIP-compressed files, and ZIP archives.
  3. Automatic Inference:
    • Genome Build: Detects if input is GRCh37/hg19 or GRCh38/hg19 based on coordinate/allele concordance.
    • Biological Sex: Infers sex based on X-chromosome heterozygosity and Y-chromosome variant density (needed to determine ploidy for VCF).
    • Strand Orientation: Detects if the input file is reported on the forward or reverse strand relative to the reference.
  4. Automatic Liftover: If the detected input build differs from the requested output assembly (e.g., input is hg19, output is GRCh38), the tool automatically downloads UCSC chain files and lifts coordinates over.
  5. Standardization & Harmonization:
    • Allele Polarization: Swaps REF/ALT alleles to match the provided reference genome.
    • Ploidy Enforcement: Enforces correct haploid/diploid states for sex chromosomes based on sex and Pseudoautosomal Regions (PAR).
    • Panel Alignment: Aligns alleles against an external reference VCF (e.g., 1000 Genomes) to resolve strand ambiguity (A/T and C/G SNPs).

Usage Logic & Behavior

Input Parsing

The tool automatically detects the input format. It supports:

  • DTC Formats: 23andMe, AncestryDNA (5-column), MyHeritage (CSV), deCODEme (6-column with strand), and generic whitespace-delimited formats.
  • Standard Formats: VCF and BCF inputs are also supported for normalization/liftover tasks.

Compression: Inputs can be .gz, .zip, bgzipped, or some combination thereof.

Genome Build Detection & Liftover

The conversion pipeline uses a specific logic flow to ensure coordinates are correct:

  1. Detection: The tool samples variants from the input and checks concordance against known reference alleles for both GRCh37 and GRCh38.
  2. Trigger: The user specifies a target --assembly (default: GRCh38).
  3. Action:
    • If Input Build == Target Assembly, conversion proceeds directly.
    • If Input Build != Target Assembly, the "Liftover Engine" is engaged.
  4. Liftover Execution:
    • Required chain files (e.g., hg19ToHg38.over.chain.gz) are automatically downloaded from UCSC to a local cache.
    • Coordinates are remapped.
    • Fail-Closed Behavior: Variants that map to multiple locations (ambiguous), do not map at all, or "straddle" chain boundaries (split indels) are discarded to preserve data integrity.

Sex Inference & Ploidy

Unless explicitly overridden via flags, the tool infers sex to ensure correct VCF representation:

  • Female: X is Diploid, Y is absent (or filtered).
  • Male: X and Y are Haploid, except in Pseudoautosomal Regions (PAR1/PAR2), where they remain Diploid.
  • Mitochondrial (MT): Always treated as Haploid.

Allele Standardization (Polarization)

When --standardize is enabled, the tool ensures the REF allele in the output VCF matches the FASTA reference provided.

  • If the input reports "A" but the Reference genome says "G":
    • The tool checks if the "A" is a valid ALT.
    • It swaps the alleles (REF becomes G, ALT becomes A) and updates the Genotype (GT) indices (e.g., 0/0 -> 1/1).
  • Synthetic IDs are generated for variants lacking identifiers (formatted as chrom:pos:ref:alt).

Panel Harmonization

For preparing data for imputation (e.g., Beagle), strand ambiguity is a major issue (e.g., an A/T SNP is indistinguishable from a T/A SNP if the strand is unknown).

  • By providing a --panel (VCF/BCF), the tool checks if the input alleles match the panel's alleles.
  • It attempts to align the input to the panel, flipping strands if necessary.
  • It outputs a "padded" panel containing any novel alleles found in the user data, ensuring the reference panel and target VCF are perfectly compatible for imputation tools.

Output Formats

VCF (Variant Call Format)

  • Version: 4.5.
  • Encoding: Standard GT (Genotype) field.
  • Metadata: Headers include assembly, conversion software version, and date.
  • Symbolic Alleles: Large deletions/insertions are normalized to <DEL> or <INS> symbolic alleles with SVTYPE info fields.

BCF (Binary Call Format)

  • Version: 2.2.
  • Behavior: Functionally identical to VCF but highly compressed and indexed. Recommended for large-scale pipelines.

PLINK 1.9 (Binary)

Produces a file trio using the output filename as a prefix:

  1. .bed: Primary binary genotype matrix.
  2. .bim: Variant information (Chromosome, SNP ID, cM, Position, Allele 1, Allele 2).
  3. .fam: Sample information (FID, IID, Paternal ID, Maternal ID, Sex, Phenotype).
    • Note: Sex is encoded as 1 (Male) or 2 (Female). Phenotype is set to -9 (Missing).

Run Report (_report.json)

Every execution produces a sidecar JSON file containing a comprehensive audit trail:

  • Inference Results: What sex and build were detected.
  • Statistics: Total records, valid records, filtered variants.
  • Liftover Details: Specific counts for unmapped variants, ambiguous mappings, or reference mismatches.
  • Panel Stats: How many sites were harmonized vs. novel.

CLI Options Overview

The tool is controlled via a unified command-line interface.

Required:

  • --input: Path to the genotype file.
  • --reference: Path to the reference FASTA (or a URL).
  • --output (or --output-dir): Destination for the converted data.

Key Flags:

  • --format: vcf, bcf, or plink.
  • --assembly: The target assembly (e.g., GRCh38). Drive liftover logic.
  • --sex: Override automatic sex inference (male or female).
  • --standardize: Enable reference-based allele polarization and normalization.
  • --panel: Path to a VCF/BCF reference panel for harmonization.
  • --variants-only: Output only sites where the sample differs from the reference.

Library Usage (API Concepts)

For Rust developers, the core logic is exposed via the conversion module. The primary entry point is the convert_dtc_file function, driven by a configuration struct.

ConversionConfig

The configuration object controls the entire pipeline state:

  • Paths: Input, output, reference FASTA, and optional panel paths.
  • Format Enums: InputFormat (DTC/VCF/BCF) and OutputFormat.
  • Biological Context: Sex enum, ParBoundaries (defined ranges for X/Y recombination), and assembly string.
  • Flags: standardize (bool), include_reference_sites (bool).

convert_dtc_file(config: ConversionConfig) -> Result<ConversionSummary>

This function executes the pipeline:

  1. Pre-scan: Reads a subset of the file to run the Inference Engine (Sex/Build).
  2. Resource Loading: Fetches/Loads reference genomes, chain files, and panels.
  3. Source Iterator: Wraps the input in a smart iterator that handles parsing, sorting, and initial validation.
  4. Transformation Stream:
    • Liftover Adapter: If chains are loaded, maps coordinates on-the-fly.
    • Standardizer: Polarizes alleles against the loaded ReferenceGenome.
    • Harmonizer: Aligns against the PaddedPanel.
  5. Writing: Streams processed records to the specific output writer (VCF/BCF/PLINK).

ConversionSummary

The return object provides precise metrics on the run, useful for quality control or integration testing:

  • total_records vs emitted_records.
  • liftover_unmapped: Count of variants lost due to missing chain mappings.
  • invalid_genotypes: Count of malformed input lines.
  • reference_failures: Count of sites where reference lookup failed.