check_build 0.4.0

A tool to verify a VCF file against hg19 and hg38 references using a streaming, low-memory approach.
Documentation

check_build

A fast, memory-efficient tool to verify VCF files against hg19 and hg38 reference genomes. Also available as a library for general-purpose use beyond VCF.

Quick Start

What build is my file?

check_build --detect my_variants.vcf
# Output: Hg38 (100.0% match, high confidence)

Full verification:

check_build my_variants.vcf

Installation

cargo install check_build

Or from source:

git clone https://github.com/SauersML/check_build.git
cd check_build
cargo build --release

Usage

CLI

# Simple build detection
check_build --detect sample.vcf

# Full verification with summary
check_build sample.vcf

# Quiet mode (no progress bars)
check_build -q sample.vcf

# Summary only (no mismatch details)
check_build -s sample.vcf

# Single reference
check_build --hg38-only sample.vcf

# Custom reference paths
check_build --hg19-path /data/hg19.fa --hg38-path /data/hg38.fa sample.vcf

Library

Add to Cargo.toml:

[dependencies]
check_build = { git = "https://github.com/SauersML/check_build" }

Simple usage:

use check_build::detect_build;

let result = detect_build("sample.vcf")?;
println!("{}", result);  // "Hg38 (100.0% match, high confidence)"

Full control:

use check_build::{Verifier, Reference};

let result = Verifier::new("sample.vcf")
    .quiet()
    .verify_both()?;

println!("hg19: {:.1}% match", result.match_rate(Reference::Hg19));
println!("hg38: {:.1}% match", result.match_rate(Reference::Hg38));

// Detailed detection with edge case handling
match result.detect() {
    BuildDetection::Detected { build, confidence, .. } => {
        println!("Build: {:?} ({} confidence)", build, confidence);
    }
    BuildDetection::Ambiguous { reason, .. } => {
        println!("Cannot determine: {}", reason);
    }
    BuildDetection::Unknown { reason, .. } => {
        println!("Problem with file: {}", reason);
    }
    BuildDetection::NoData => {
        println!("No valid variants found");
    }
}

Features

  • Fast: Parallel verification of hg19/hg38 using rayon
  • Memory-efficient: Streams references, processes one contig at a time
  • Auto-download: Fetches reference FASTAs if not present
  • Edge case handling: Detects ambiguous, unknown, or corrupt files
  • Dual interface: Both CLI and library

How It Works

  1. Splits VCF by contig into temp files
  2. Streams each reference FASTA (never loads full genome)
  3. Verifies REF alleles match reference bases
  4. Reports match rates and infers build

Exit Codes

Code Meaning
0 Success (build detected or verification passed)
1 Error (file not found, download failed, etc.)
2 Ambiguous (matches both builds similarly)
3 Unknown (low match on both, possibly corrupt)
4 No data (VCF had no valid variants)

License

MIT