# rsomics-plink-missing
Per-sample and per-variant genotype **missingness** from a PLINK1 binary
fileset — a Rust reimplementation of `plink --missing`.
It writes the same two reports PLINK does:
| `.imiss` | sample | `FID IID MISS_PHENO N_MISS N_GENO F_MISS` |
| `.lmiss` | variant | `CHR SNP N_MISS N_GENO F_MISS` |
A genotype is missing when its packed 2-bit code is `01`. `F_MISS` is
`N_MISS / N_GENO`; `N_GENO` is the variant count in `.imiss` and the sample
count in `.lmiss`. `MISS_PHENO` is `Y` when the `.fam` phenotype is unset
(`-9` or `0`), else `N`.
## Usage
```sh
# both reports to stdout (.imiss then .lmiss)
rsomics-plink-missing --bfile path/to/fileset
# write path/to/out.imiss and path/to/out.lmiss (matches plink --out)
rsomics-plink-missing --bfile path/to/fileset --out path/to/out
# explicit component paths instead of a shared prefix
rsomics-plink-missing --bed a.bed --bim a.bim --fam a.fam --out out
# parallelise the scan
rsomics-plink-missing --bfile path/to/fileset --threads 8 --out out
```
## Compatibility
Every **field value** is identical to PLINK 1.9 — `N_MISS`, `N_GENO`, the
`%g`-formatted `F_MISS`, `MISS_PHENO`, and the numeric chromosome codes (X→23,
Y→24, XY→25, MT→26). `tests/compat.rs` diffs both reports field-for-field
against PLINK live when the `plink` binary (v1.9) is on `PATH`, and against
checked-in golden output otherwise.
The inter-column whitespace is not always byte-identical. PLINK sizes its `FID`,
`IID`, and `SNP` columns from an internal id buffer whose width depends on the
`.fam`/`.bim` load order rather than on the true maximum id length, so for some
datasets its padding is one space narrower than the maximum id would imply. We
right-justify to the true maximum, which matches PLINK byte-for-byte on the
common case (the in-repo golden is byte-identical) but can differ by a space
under PLINK's buffer quirk. Whitespace-splitting parsers see identical data
either way.
PLINK's `--missing --within <file>` cluster breakdown (the `N_CLST`/per-cluster
`.imiss`/`.lmiss` columns) is not implemented; the unclustered whole-sample
report is the common case.
## Origin
This crate is an independent Rust reimplementation of `plink --missing` based on:
- The published method: Chang et al. 2015 (PLINK 1.9,
doi:10.1186/s13742-015-0047-8) and Purcell et al. 2007 (PLINK 1,
doi:10.1086/519795).
- The public PLINK 1.9 `--missing` / basic-statistics documentation
(<https://www.cog-genomics.org/plink/1.9/basic_stats>) and binary-fileset
format spec (<https://www.cog-genomics.org/plink/1.9/formats>).
- Black-box behaviour testing against the `plink` 1.9 binary.
No source code from the GPL upstream was used as reference during
implementation. Test fixtures are independently generated.
License: MIT OR Apache-2.0.
Upstream credit: [PLINK 1.9](https://www.cog-genomics.org/plink/1.9/)
(Christopher Chang et al., GPLv3).