LDSC — Rust Rewrite
A compiled, statically-typed rewrite of Bulik-Sullivan et al.'s LDSC in Rust.
Implements six subcommands — munge-sumstats, ldscore, h2, rg, make-annot, cts-annot — with
identical numerical output and a 7× speedup on LD score computation.
Get Started
Fastest (no Rust required):
Standalone binaries are available in GitHub Releases (see “Prebuilt Binaries” below).
Native install (requires Rust):
Quick Start
The typical LDSC workflow — preprocess summary statistics, then estimate heritability or genetic correlation — mirrors the upstream wiki tutorial.
Step 1: Download pre-computed European LD scores (skip ldscore for European GWAS)
Step 2: Pre-process summary statistics
Step 3a: Estimate heritability
Step 3b: Estimate genetic correlation
Installation Details
Native builds require Rust ≥ 1.85 and a C toolchain with Fortran support to build the default statically-linked OpenBLAS. If you opt into the system BLAS feature, you only need the system OpenBLAS development package.
On Debian/Ubuntu (default static build):
Docker
Images are published to the GitHub Container Registry on every push to main and for each version tag.
# Run with local data mounted
Version tags (v1.2.3) produce :1.2.3, :1.2, and :latest. Pushes to main produce a :main
tag and a short-SHA tag (:sha-XXXXXXX).
Prebuilt Binaries
Releases include Linux and macOS tarballs that contain ldsc, LICENSE, and README.md.
# Linux (x86_64)
# macOS (Apple Silicon)
# macOS (Intel)
Release Process (Maintainers)
Releases are cut with cargo-release and tagged as vX.Y.Z. Tag pushes trigger the release
workflow, which builds and uploads platform tarballs to GitHub Releases.
Building the image locally
Requires Docker with BuildKit (default since Docker 23):
The multi-stage Dockerfile uses cargo-chef to
cache dependency compilation in a separate layer, so incremental rebuilds only recompile changed
source files. The runtime image is debian:bookworm-slim plus libgfortran5 for the
statically-linked OpenBLAS.
Building from source
Requires a Rust toolchain (≥ 1.85; edition 2024 features used). OpenBLAS is linked statically — no runtime library installation needed.
# binary: target/release/ldsc
The release profile sets opt-level = 3, lto = "thin", codegen-units = 1.
BLAS configuration
By default this crate builds OpenBLAS from source and links it statically
(blas-openblas-static). For CI or HPC environments that prefer a system BLAS,
use the system feature instead:
# Debian/Ubuntu system OpenBLAS
The system feature skips the OpenBLAS source build and links via pkg-config.
Keep the default static build if you want a self-contained binary.
Runtime tuning (optional)
The following global flags are available for performance tuning but are not heavily battle-tested. Use them only when needed:
--blas-threads N: OpenBLAS thread count (default 4; affects all subcommands).--rayon-threads N: Rayon thread count for jackknife inh2/rg.--polars-threads N: Polars thread count for CSV streaming inmunge-sumstats.
ldsc --version prints the compiled BLAS backend (e.g., openblas-static).
Usage
munge-sumstats
Pre-processes GWAS summary statistics into the .sumstats.gz format consumed by h2 and rg.
Input summary statistics may be plain, .gz, or .bz2, and can be tab- or whitespace-delimited.
Key flags: --signed-sumstats COLNAME,null_value tells the tool which column carries effect direction and what the
null value is (e.g. BETA,0, OR,1, Z,0). Without this flag the tool auto-detects from BETA/LOG_ODDS/OR/Z columns.
--a1-inc skips the signed column and treats all Z-scores as positive (A1 is always the risk allele).
--merge-alleles enforces allele concordance (mismatches are removed), matching Python behavior.
Use --daner or --daner-n for Ripke daner formats (infers N from FRQ_[A/U] headers or Nca/Nco columns).
ldscore
Computes LD scores from a PLINK binary file set (.bed/.bim/.fam).
Annotation inputs (.annot) may be plain, .gz, or .bz2.
Tip for European GWAS: Pre-computed 1000G phase 3 LD scores are available from the Broad LDSCORE page. Download
eur_w_ld_chr.tar.bz2; aftertar -jxvf, the inner.l2.ldscore.gzfiles are already gzip-compressed and work directly withldsc. Non-European populations require computing your own LD scores from an appropriate reference panel.
ldsc ldscore warns if the LD window spans an entire chromosome; use --yes-really to silence.
Window flags are mutually exclusive: --ld-wind-cm (genetic distance, default 1.0), --ld-wind-kb
(physical distance), or --ld-wind-snp (fixed flanking SNP count).
Partitioned LD scores with --annot prefix: expects {prefix}{chr}.annot[.gz] for each chromosome present in
the BIM, outputs one L2 column per annotation and corresponding .l2.M / .l2.M_5_50 files.
--per-allele is equivalent to --pq-exp 1 (weights each r² by p·(1−p)). Use --pq-exp S to
apply (p·(1−p))^S weighting; output columns and .M files receive a _S{S} suffix.
--no-print-annot is accepted for Python CLI parity but is a no-op (warns when used).
h2
Estimates SNP heritability.
LD score inputs may be plain, .gz, or .bz2.
--ref-ld-chr prefix appends the chromosome number then .l2.ldscore.gz. So
--ref-ld-chr eur_w_ld_chr/ reads eur_w_ld_chr/1.l2.ldscore.gz … eur_w_ld_chr/22.l2.ldscore.gz.
If the chromosome number falls in the middle of the filename, use @ as a placeholder:
--ref-ld-chr ld/chr@_scores → ld/chr1_scores.l2.ldscore.gz, etc.
The same convention applies to --w-ld-chr.
You may pass a comma-separated list to --ref-ld / --ref-ld-chr (Python behavior);
--w-ld / --w-ld-chr must point to a single fileset.
Common options: --no-intercept, --intercept-h2 VALUE, --two-step 30, --chisq-max 80,
--samp-prev 0.1 --pop-prev 0.01 (liability-scale conversion),
--print-coefficients (partitioned h2: per-annotation τ and enrichment).
Overlapping annotations: use --overlap-annot with --frqfile-chr prefix (or --frqfile for
single filesets) to match Python’s overlap-adjusted results. When enabled, LDSC writes
<out>.results with overlap-aware proportion/enrichment columns.
Cell-type-specific h2: use --h2-cts and --ref-ld-chr-cts (see the LDSC wiki for .ldcts
format). Output is written to <out>.cell_type_results.txt. Add --print-all-cts to report
coefficients for all CTS LD score prefixes in each line.
rg
Estimates genetic correlations across all pairs from a list of summary statistic files.
LD score inputs may be plain, .gz, or .bz2.
--ref-ld-chr / --w-ld-chr follow the same prefix convention as h2 (see above).
Common options: --no-intercept, --intercept-h2 1,1,1 (one per trait), --intercept-gencov 0.0,0.0 (per-pair), --two-step 30,
--samp-prev / --pop-prev (comma-separated, one value per input file).
make-annot
Generates 0/1 annotation files from a UCSC BED file or a gene set.
# From a BED file:
# From a gene set:
cts-annot
Bins one or more continuous annotations into categories and writes a .annot file
compatible with ldscore --annot (Python --cts-bin preprocessing).
Performance
Benchmarks against the original Python on a 16-core desktop (AMD Ryzen 9 5950X) using 1000 Genomes
Phase 3 (n = 2,504 individuals, --ld-wind-snp 100).
| Dataset | SNPs | Rust | Python | Speedup |
|---|---|---|---|---|
| chr22 | 24,624 | 1 s | 7 s | 7.0× |
| 20% genome | 333,000 | 12 s | 88 s | 7.3× |
Correctness: all 1,664,851 SNPs in the full 1000G genome verified to match Python within 0.001 (max diff 0.000508, median 0.000250) after fixing the four algorithmic bugs described below.
Differences from Python
Command structure
Python LDSC consists of three separate scripts; this crate consolidates them into subcommands of a
single ldsc binary:
| Python | Rust |
|---|---|
python munge_sumstats.py --sumstats … --out … |
ldsc munge-sumstats --sumstats … --out … |
python ldsc.py --l2 --bfile … --out … |
ldsc ldscore --bfile … --out … |
python ldsc.py --h2 … --ref-ld-chr … |
ldsc h2 --h2 … --ref-ld-chr … |
python ldsc.py --rg … --ref-ld-chr … |
ldsc rg --rg … --ref-ld-chr … |
python make_annot.py --bimfile … --bed-file … |
ldsc make-annot --bimfile … --bed-file … |
python ldsc.py --cts-bin … |
ldsc cts-annot … |
Python's --l2 flag (LD score estimation mode) becomes the ldscore subcommand. The --h2 and
--rg flags (regression modes) become h2 and rg subcommands.
Flag renames
| Python | Rust | Note |
|---|---|---|
--ld-wind-snps |
--ld-wind-snp |
trailing s removed |
--N / --N-col |
--n / --n-col |
lowercase |
--M |
--m-snps |
renamed |
--snp / --a1 / --a2 / --p / --frq / --info |
--snp-col / --a1-col / ... |
--*-col suffix |
--maf-min |
--maf |
renamed |
Behavioural differences
--mafin ldscore: default is a post-filter on output (faster). Use--maf-preto match Python’s pre-computation filtering.--n-mindefault: when--n-minis 0, Rust now matches Python (90th percentile / 1.5).--yes-really: Rust warns when the LD window spans a whole chromosome and--yes-reallyis not set (Python errors).--chunksize: Python requires explicit chunking for large files; Rust uses Polars LazyFrame streaming and ignores chunk size for munge.--return-silly-things/--invert-anyway: accepted flags for CLI parity; Rust never clips results and always uses a least-squares solver (warnings emitted).--no-print-annot: accepted for CLI parity but does not affect output (warning emitted).--cts-binworkflow: implemented as a separate preprocessor (ldsc cts-annot), then useldsc ldscore --annot.
No-op Flags (Warned)
The following Python flags are accepted for CLI parity but do not change behavior in Rust:
ldscore --no-print-annot(cts-annot always writes output)h2/rg --return-silly-thingsh2/rg --invert-anyway
Performance Deep-Dive
Why Python is slow
The original Python implementation is bottlenecked by three independent factors:
-
GIL-blocked jackknife.
jackknife.pyruns 200 leave-one-block-out IRWLS refits sequentially. Each refit is ascipy.linalg.lstsqcall that releases the GIL, but Python loop overhead and NumPy's per-call allocation dominate at this problem size. -
Per-SNP NumPy allocation in the LD score loop.
ldscore.pycallsnp.dotin a Python-level loop with fresh array views on each of the ~33,000 chunks for a 1M-SNP genome. Python's boxing overhead and NumPy's internal allocation path are not amortised. -
Sequential LD computation. The GIL prevents genuine thread-level parallelism in the correlation loop.
What the Rust implementation does differently
1. Ring-buffer genotype store (ldscore.rs)
Python allocates a new rfuncA matrix every chunk. Rust pre-allocates a single F-order
Array2<f64> of shape (n_indiv, ring_size) where ring_size = max_window + chunk_c. SNP columns
are written into successive ring slots modulo ring_size; evicted slots are reused. This eliminates
~33,000 heap allocations for a 1M-SNP genome and improves cache locality because each active column
occupies a contiguous 8-byte stride in memory.
2. Single DGEMM per chunk
For each chunk of B SNPs the computation is:
BB = Bᵀ · B (chunk × chunk, unbiased r²)
AB = Aᵀ · B (window × chunk, unbiased r²)
Both are single ndarray::dot calls dispatched to OpenBLAS DGEMM. The window matrix A is
assembled from ring slots into a pre-allocated F-order a_buf so columns are contiguous in memory
and the DGEMM kernel can stride through them without gather operations.
3. Tuned BLAS thread count
OpenBLAS defaults to using all available cores, which creates thread-spawning overhead that
outweighs the parallelism benefit for the small matrices in 1000G-scale LD computation
(n ≈ 2,500, window ≈ 200). The Rust binary calls openblas_set_num_threads(4) at startup
(overridable with --blas-threads). This is optimal for 1000G; biobank data (n > 10,000) may
benefit from higher values.
4. Global sequential pass — no cross-chromosome boundary artefact
Python processes all chromosomes as a single ordered dataset. With --ld-wind-snps, the last 100
SNPs of chromosome k and the first 100 of chromosome k+1 are within each other's windows. The 1000G
reference panel contains five continental populations, creating population-stratification-driven
Pearson r up to 0.38 across chromosome boundaries. Earlier versions of the Rust code ran per-chromosome
in parallel, which zeroed out these cross-boundary contributions and produced L2 values 1–2 units too
low for boundary SNPs. The current implementation mirrors Python: a single global pass over all SNPs
in BIM order, with per-chromosome files written from the global L2 array after the fact.
5. Parallel block jackknife (jackknife.rs)
The 200 leave-one-block-out IRWLS refits are independent. Rayon's into_par_iter distributes them
across all available cores. Each refit allocates two ndarray views and one LAPACK SVD call; the total
wall time for h2 and rg is dominated by the file I/O and merge join, not the jackknife.
6. Polars LazyFrame for munge (munge.rs)
munge_sumstats.py uses pandas, which loads the entire file into RAM before filtering. The Rust
implementation uses Polars LazyCsvReader, which pushes column selection, renaming, and filter
predicates into a query plan that streams the file in chunks. For large GWAS files (> 1 M SNPs) the
peak RAM is proportional to the output size, not the input size.
Source Map (for LLMs)
src/
├── main.rs Clap dispatch — parses CLI, calls into subcommand modules.
│
├── cli.rs All argument structs (MungeArgs, LdscoreArgs, H2Args, RgArgs,
│ MakeAnnotArgs). No logic — pure clap derive macros.
│
├── parse.rs File I/O helpers:
│ · scan_sumstats / scan_ldscore → Polars LazyFrame
│ · concat_chrs(prefix, suffix) → concat per-chr files
│ · read_m_total / read_m_vec → .l2.M files
│ · read_annot(prefix, thin) → Array2<f64> + col names
│
├── munge.rs munge-sumstats pipeline (Polars LazyFrame, no data loaded until
│ collect). Internal functions:
│ · apply_ignore, apply_col_overrides, normalize_columns
│ · apply_info_list, apply_n_override
│ · derive_z (BETA/SE → Z; P + sign → Z; --a1-inc)
│ · filter_snps, apply_nstudy_filter
│ · write_sumstats_gz (gzip TSV output)
│
├── ldscore.rs LD score computation. Key types and functions:
│ · BimRecord — CHR/SNP/CM/BP/bed_idx struct
│ · parse_bim, count_fam, parse_fam — PLINK file parsers
│ · load_individual_indices — --keep FID/IID file → isize indices
│ · WindowMode — Cm / Kb / Snp enum
│ · get_block_lefts_f64, get_block_lefts_snp — window boundaries
│ · normalize_col — impute NaN → mean, centre, unit-variance (f32)
│ · compute_ldscore_global — ring-buffer DGEMM loop (sequential,
│ scalar and partitioned paths share the same pre-alloc buffers)
│ · r2_unbiased — r² − (1−r²)/(n−2)
│ · write_ldscore_refs — gzip TSV output
│ · load_snp_set — HashSet<String> from --extract / --print-snps
│ · run — orchestrates BIM read, --extract / --annot / --keep,
│ calls compute_ldscore_global, writes per-chr .l2.ldscore.gz
│ and .l2.M / .l2.M_5_50 files
│
├── irwls.rs Iteratively Re-Weighted Least Squares.
│ · IrwlsResult — est + optional jackknife fields
│ · irwls(x, y, weights, n_iter) — pre-alloc xw/yw, SVD solve,
│ reweight on fitted values; zero-alloc inner loop
│
├── jackknife.rs Block jackknife variance estimation.
│ · jackknife(x, y, weights, n_blocks, n_iter) →
│ full-data IRWLS + n_blocks parallel leave-one-out refits
│ (rayon par_iter) → pseudo-values → SE + covariance matrix
│
└── regressions.rs h2 and rg regression drivers.
· run_h2 — loads sumstats + LD scores, inner-joins on SNP,
detects K annotation columns, dispatches to scalar (K=1)
or partitioned (K>1) path; supports --two-step, --no-intercept,
--intercept-h2, --print-coefficients, liability-scale output
· run_h2_partitioned — K-column design matrix, per-annotation
enrichment, resolves M vector from per-annotation M files
· run_h2_scalar — shared by standalone h2 and rg univariate sub-fits
· run_rg — iterates trait pairs; gencov regression + univariate h2
per trait; --two-step, --intercept-gencov, --no-intercept,
liability-scale rg; prints summary table
· load_ld_ref / load_ld — LazyFrame readers for ref and weight LD
· resolve_m / resolve_m_vec — reads .l2.M_5_50 or falls back to n_obs
· liability_scale_h2 — observed → liability scale conversion
· print_jackknife_diagnostics — --print-cov / --print-delete-vals
make_annot.rs BED → 0/1 annotation generator.
· annotate_from_bed — loads BED intervals per chromosome,
sorts and merges, binary-search annotation per SNP
· annotate_from_gene_set — gene symbols → coordinate lookup →
same interval merge/annotate pipeline
· write_annot_file — CHR BP SNP CM ANNOT TSV (optional .gz)
Key data-flow invariants
ldscore --annot prefixreads{prefix}{chr}.annot[.gz]for every chromosome found in the BIM (not a singleprefix.annot.gzfile).--extractfilters the BIM before window computation;--print-snpsfilters only the output.bed_idx(original BIM row index) differs frompos(index in the filteredall_snpsslice) when--extractis active;bed_idx_to_posinrun()maps between them.--keeppassesiid_indices: Option<&Array1<isize>>tobed-reader;n_indiv_actual(not the FAM total) is used for normalization and the r²-unbiased correction.- The ring buffer
ring_size = max_window + chunk_cguarantees no live window slot is overwritten before it has been consumed in the A×B product. rg --no-interceptpropagatesfixed_intercept = Some(1.0)to both the gencov regression and each univariaterun_h2_scalarcall, matching Python's behaviour.
Dependency rationale
| Crate | Version | Role |
|---|---|---|
bed-reader |
1 | mmap-based PLINK .bed reading; only touched pages loaded |
polars |
0.53 | lazy CSV streaming (munge + LD score file loading) |
ndarray + ndarray-linalg |
0.16 + 0.17 | dense matrix algebra; SVD for IRWLS |
blas-src + OpenBLAS |
0.10 | statically-linked BLAS for DGEMM |
rayon |
1 | data-parallel jackknife blocks |
statrs |
0.18 | Normal CDF/quantile for P→Z conversion |
clap |
4 | derive-macro CLI argument parsing |
anyhow / thiserror |
1 / 2 | error propagation |
flate2 |
1 | gzip output for .sumstats.gz and .ldscore.gz |