infer_sex
A high-performance, zero-dependency Rust library for classifying samples based on sex chromosome characteristics from summarized variant data.
About
This library uses a multi-pronged algorithm based on chromosome biology. It operates on an iterator of VariantInfo structs in a single pass, ensuring minimal memory usage and high throughput.
The core of the library is the SexInferenceAccumulator, a state machine that you feed variant data into to receive a classification and a detailed evidence report.
Highlights
- Performant: Zero-dependency and processes variants in a single stream.
- Memory Efficient: State machine design uses constant memory.
- Transparent: The final call is accompanied by a detailed
EvidenceReportshowing the result of each internal check. - Simple API: A straightforward, builder-like pattern for processing data.
- Robust: Supports both GRCh37/hg19 and GRCh38/hg38 genome builds.
Usage
The primary workflow involves creating a SexInferenceAccumulator, processing VariantInfo structs from your data source (e.g., a VCF file), and finalizing the analysis to get a result.
- Create a configuration for your genome build.
- Initialize the accumulator state machine.
let mut accumulator = new; - In your application, create
VariantInfostructs from your data. - Process all variants in a single pass.
- Finalize the analysis to get the result. This consumes the accumulator.
- Use the structured result.
Algorithm
The final classification is determined by a voting system based on four biological checks.
- X Chromosome Heterozygosity: Compares the ratio of heterozygous to total variants on chromosome X against a threshold. A high ratio suggests one outcome, a low ratio suggests the other.
- Y Chromosome Presence: Evaluates the ratio of variants in the non-pseudoautosomal region (non-PAR) to the pseudoautosomal region (PAR) of chromosome Y. A high proportion of non-PAR variants provides strong evidence for one classification.
- SRY Presence: Checks for any variants within the SRY. The presence of such variants is a strong indicator for a specific outcome.
- PAR vs. Non-PAR X Heterozygosity: Compares the heterozygosity rate within the X-PAR to the rate in the X-non-PAR. A significantly higher rate in the PAR is a key indicator. Zero heterozygosity in the non-PAR region is also a special case providing strong evidence.
This algorithm has been validated on real-world microarray data.
API Overview
InferenceConfig: Specifies the genome build (Build37orBuild38).VariantInfo: The input struct representing a single variant's chromosome, position, and heterozygosity.SexInferenceAccumulator: The main state machine that consumesVariantInfostructs.InferenceResult: The final output, containing thefinal_call(InferredSex) and thereport. The call may beIndeterminatewhen no sex-chromosome evidence is observed.EvidenceReport: A breakdown of the results and vote from each of the internal checks.