rsomics-plink-missing 0.1.0

Per-sample and per-variant genotype missingness from a PLINK1 binary fileset (plink --missing)
Documentation

rsomics-plink-missing

Per-sample and per-variant genotype missingness from a PLINK1 binary fileset — a Rust reimplementation of plink --missing.

It writes the same two reports PLINK does:

file one row per columns
.imiss sample FID IID MISS_PHENO N_MISS N_GENO F_MISS
.lmiss variant CHR SNP N_MISS N_GENO F_MISS

A genotype is missing when its packed 2-bit code is 01. F_MISS is N_MISS / N_GENO; N_GENO is the variant count in .imiss and the sample count in .lmiss. MISS_PHENO is Y when the .fam phenotype is unset (-9 or 0), else N.

Usage

# both reports to stdout (.imiss then .lmiss)
rsomics-plink-missing --bfile path/to/fileset

# write path/to/out.imiss and path/to/out.lmiss (matches plink --out)
rsomics-plink-missing --bfile path/to/fileset --out path/to/out

# explicit component paths instead of a shared prefix
rsomics-plink-missing --bed a.bed --bim a.bim --fam a.fam --out out

# parallelise the scan
rsomics-plink-missing --bfile path/to/fileset --threads 8 --out out

Compatibility

Every field value is identical to PLINK 1.9 — N_MISS, N_GENO, the %g-formatted F_MISS, MISS_PHENO, and the numeric chromosome codes (X→23, Y→24, XY→25, MT→26). tests/compat.rs diffs both reports field-for-field against PLINK live when the plink binary (v1.9) is on PATH, and against checked-in golden output otherwise.

The inter-column whitespace is not always byte-identical. PLINK sizes its FID, IID, and SNP columns from an internal id buffer whose width depends on the .fam/.bim load order rather than on the true maximum id length, so for some datasets its padding is one space narrower than the maximum id would imply. We right-justify to the true maximum, which matches PLINK byte-for-byte on the common case (the in-repo golden is byte-identical) but can differ by a space under PLINK's buffer quirk. Whitespace-splitting parsers see identical data either way.

PLINK's --missing --within <file> cluster breakdown (the N_CLST/per-cluster .imiss/.lmiss columns) is not implemented; the unclustered whole-sample report is the common case.

Origin

This crate is an independent Rust reimplementation of plink --missing based on:

No source code from the GPL upstream was used as reference during implementation. Test fixtures are independently generated.

License: MIT OR Apache-2.0. Upstream credit: PLINK 1.9 (Christopher Chang et al., GPLv3).