ldsc 0.3.1

LD Score Regression — fast Rust reimplementation of Bulik-Sullivan et al. LDSC
ldsc-0.3.1 is not a library.

LDSC — Rust Rewrite

CI crates.io License: GPL-3.0 MSRV: 1.85

A compiled, statically-typed rewrite of Bulik-Sullivan et al.'s LDSC in Rust. Implements six subcommands — munge-sumstats, l2, h2, rg, make-annot, cts-annot — with identical numerical output and a 7× speedup on LD score computation.


Install

Fastest (no Rust required):

docker run --rm ghcr.io/sharifhsn/ldsc:latest --help

Local install options:

  • Prebuilt binaries from GitHub Releases (see “Prebuilt Binaries” below).
  • Cargo install (requires Rust): cargo install ldsc

Quick Start

The typical LDSC workflow — preprocess summary statistics, then estimate heritability or genetic correlation — mirrors the upstream wiki tutorial.

Step 1: Download pre-computed European LD scores (skip l2 for European GWAS)

wget https://data.broadinstitute.org/alkesgroup/LDSCORE/eur_w_ld_chr.tar.bz2
wget https://data.broadinstitute.org/alkesgroup/LDSCORE/w_hm3.snplist.bz2
tar -jxvf eur_w_ld_chr.tar.bz2   # inner .l2.ldscore.gz files are already gzip-compressed
bunzip2 w_hm3.snplist.bz2

Step 2: Pre-process summary statistics

ldsc munge-sumstats \
  --sumstats my_gwas.txt \
  --N 50000 \
  --merge-alleles w_hm3.snplist \
  --out my_trait

Step 3a: Estimate heritability

ldsc h2 \
  --h2 my_trait.sumstats.gz \
  --ref-ld-chr eur_w_ld_chr/ \
  --w-ld-chr eur_w_ld_chr/ \
  --out my_trait_h2

Step 3b: Estimate genetic correlation

ldsc rg \
  --rg trait1.sumstats.gz,trait2.sumstats.gz \
  --ref-ld-chr eur_w_ld_chr/ \
  --w-ld-chr eur_w_ld_chr/ \
  --out trait1_vs_trait2

Usage

munge-sumstats

Pre-processes GWAS summary statistics into the .sumstats.gz format consumed by h2 and rg. Input summary statistics may be plain, .gz, or .bz2, and can be tab- or whitespace-delimited.

ldsc munge-sumstats \
  --sumstats my_gwas.txt.gz \
  --out output_prefix \
  [--merge-alleles w_hm3.snplist] \
  [--signed-sumstats BETA,0] \
  [--N 50000] \
  [--info-min 0.9] \
  [--maf-min 0.01]

Key flags: --signed-sumstats COLNAME,null_value tells the tool which column carries effect direction and what the null value is (e.g. BETA,0, OR,1, Z,0). Without this flag the tool auto-detects from BETA/LOG_ODDS/OR/Z columns. --a1-inc skips the signed column and treats all Z-scores as positive (A1 is always the risk allele). --merge-alleles enforces allele concordance (mismatches are removed), matching Python behavior. Use --daner or --daner-n for Ripke daner formats (infers N from FRQ_[A/U] headers or Nca/Nco columns).

l2

Computes LD scores from a PLINK binary file set (.bed/.bim/.fam). Annotation inputs (.annot) may be plain, .gz, or .bz2.

Tip for European GWAS: Pre-computed 1000G phase 3 LD scores are available from the Broad LDSCORE page. Download eur_w_ld_chr.tar.bz2; after tar -jxvf, the inner .l2.ldscore.gz files are already gzip-compressed and work directly with ldsc. Non-European populations require computing your own LD scores from an appropriate reference panel.

ldsc l2 \
  --bfile /path/to/1000G_EUR \
  --out out/eur \
  --ld-wind-cm 1.0 \
  [--annot annotations/BaselineLD.] \
  [--extract snplist.txt] \
  [--maf 0.01] \
  [--keep keep_individuals.txt] \
  [--per-allele] \
  [--pq-exp 1.0]

ldsc l2 warns if the LD window spans an entire chromosome; use --yes-really to silence.

Window flags are mutually exclusive: --ld-wind-cm (genetic distance, default 1.0), --ld-wind-kb (physical distance), or --ld-wind-snps (fixed flanking SNP count).

Partitioned LD scores with --annot prefix: accepts either a single {prefix}.annot[.gz|.bz2] file or per-chromosome {prefix}{chr}.annot[.gz|.bz2] files for each chromosome present in the BIM. Outputs one L2 column per annotation and corresponding .l2.M / .l2.M_5_50 files.

--per-allele is equivalent to --pq-exp 1 (weights each r² by p·(1−p)). Use --pq-exp S to apply (p·(1−p))^S weighting; output columns and .M files receive a _S{S} suffix. --no-print-annot suppresses the .annot.gz output produced by --cts-bin.

h2

Estimates SNP heritability. LD score inputs may be plain, .gz, or .bz2.

ldsc h2 \
  --h2 trait.sumstats.gz \
  --ref-ld-chr eur_w_ld_chr/ \
  --w-ld-chr eur_w_ld_chr/ \
  --out results

--ref-ld-chr prefix appends the chromosome number then .l2.ldscore.gz. So --ref-ld-chr eur_w_ld_chr/ reads eur_w_ld_chr/1.l2.ldscore.gzeur_w_ld_chr/22.l2.ldscore.gz. If the chromosome number falls in the middle of the filename, use @ as a placeholder: --ref-ld-chr ld/chr@_scoresld/chr1_scores.l2.ldscore.gz, etc. The same convention applies to --w-ld-chr. You may pass a comma-separated list to --ref-ld / --ref-ld-chr (Python behavior); --w-ld / --w-ld-chr must point to a single fileset.

Common options: --no-intercept, --intercept-h2 VALUE, --two-step 30, --chisq-max 80, --samp-prev 0.1 --pop-prev 0.01 (liability-scale conversion), --print-coefficients (partitioned h2: per-annotation τ and enrichment).

Overlapping annotations: use --overlap-annot with --frqfile-chr prefix (or --frqfile for single filesets) to match Python’s overlap-adjusted results. When enabled, LDSC writes <out>.results with overlap-aware proportion/enrichment columns.

Cell-type-specific h2: use --h2-cts and --ref-ld-chr-cts (see the LDSC wiki for .ldcts format). Output is written to <out>.cell_type_results.txt. Add --print-all-cts to report coefficients for all CTS LD score prefixes in each line.

rg

Estimates genetic correlations across all pairs from a list of summary statistic files. LD score inputs may be plain, .gz, or .bz2.

ldsc rg \
  --rg trait1.sumstats.gz,trait2.sumstats.gz,trait3.sumstats.gz \
  --ref-ld-chr eur_w_ld_chr/ \
  --w-ld-chr eur_w_ld_chr/ \
  --out results

--ref-ld-chr / --w-ld-chr follow the same prefix convention as h2 (see above).

Common options: --no-intercept, --intercept-h2 1,1,1 (one per trait), --intercept-gencov 0.0,0.0 (per-pair), --two-step 30, --samp-prev / --pop-prev (comma-separated, one value per input file).

make-annot

Generates 0/1 annotation files from a UCSC BED file or a gene set.

# From a BED file:
ldsc make-annot \
  --bimfile my_data.bim \
  --bed-file regions.bed \
  --annot-file output.annot.gz \
  --windowsize 100000

# From a gene set:
ldsc make-annot \
  --bimfile my_data.bim \
  --gene-set-file immune_genes.txt \
  --gene-coord-file ENSG_coord.txt \
  --annot-file output.annot.gz \
  --windowsize 100000

cts-annot

Bins one or more continuous annotations into categories and writes a .annot file compatible with l2 --annot (Python --cts-bin preprocessing).

ldsc cts-annot \
  --bimfile my_data.bim \
  --cts-bin DAF.txt,DIST.txt \
  --cts-breaks 0.1,0.25,0.4x10,100,1000 \
  --cts-names DAF,DIST_TO_GENE \
  --annot-file cts.annot.gz

Installation Details

Native builds require Rust ≥ 1.85. The Rust implementation uses faer for dense linear algebra.

Fast f32 mode (experimental)

Pass --fast-f32 to ldsc l2 to run core matmuls in f32 while accumulating in f64. This roughly halves memory for GEMM buffers and can be ~1.5x faster on CPUs with 256-bit SIMD. It is not parity-safe compared to the default f64 path.

Observed differences on full 1000G (Phase 3, --ld-wind-kb 1000, --chunk-size 200):

  • Speedup: ~1.44x (f32 vs f64)
  • Output deltas: max_abs_diff=0.001 (relative to f64)
# f32 matmul (runtime flag, no rebuild needed)
ldsc l2 --bfile --out --fast-f32

# default f64 (parity-safe)
ldsc l2 --bfile --out

Optional GPU acceleration (experimental)

Build with --features gpu to enable CUDA-accelerated matrix multiplication via CubeCL. Requires a CUDA toolkit at build time.

cargo build --release --features gpu
ldsc l2 --bfile --out --gpu

At 1000G scale (n=2,490), GPU is transfer-bound and slower than CPU. GPU acceleration targets biobank-scale cohorts (n >= 50k) where each chunk's GEMM is large enough for compute to dominate PCIe transfer. Use --gpu-tile-cols N to split large window matrices into VRAM-fitting tiles.

Default build (CPU only):

cargo build --release

Prebuilt Binaries

Releases include Linux, macOS, and Windows archives that contain ldsc, LICENSE, and README.md.

# Linux (x86_64)
curl -L -o ldsc_linux-x86_64.tar.gz \
  https://github.com/sharifhsn/ldsc/releases/latest/download/ldsc_linux-x86_64.tar.gz
tar -xzf ldsc_linux-x86_64.tar.gz
./ldsc --help

# macOS (Apple Silicon)
curl -L -o ldsc_macos-aarch64.tar.gz \
  https://github.com/sharifhsn/ldsc/releases/latest/download/ldsc_macos-aarch64.tar.gz
tar -xzf ldsc_macos-aarch64.tar.gz
./ldsc --help

# Windows (x86_64)
# Download the zip from the release page and extract:
# https://github.com/sharifhsn/ldsc/releases/latest/download/ldsc_windows-x86_64.zip

Docker

Images are published to the GitHub Container Registry on every push to main and for each version tag.

docker pull ghcr.io/sharifhsn/ldsc:latest

# Run with local data mounted
docker run --rm \
  -v /path/to/data:/data \
  ghcr.io/sharifhsn/ldsc:latest \
  h2 --h2         /data/trait.sumstats.gz \
     --ref-ld-chr /data/eur_w_ld_chr/ \
     --w-ld-chr   /data/eur_w_ld_chr/ \
     --out        /data/results

Version tags (v1.2.3) produce :1.2.3, :1.2, and :latest. Pushes to main produce a :main tag and a short-SHA tag (:sha-XXXXXXX).

Building from source

Requires a Rust toolchain (≥ 1.85; edition 2024 features used).

cargo build --release
# binary: target/release/ldsc

The release profile sets opt-level = 3, lto = "thin", codegen-units = 1.

Runtime tuning (optional)

The following global flags are available for performance tuning but are not heavily battle-tested. Use them only when needed:

  • --rayon-threads N: Rayon thread count for jackknife in h2/rg.
  • --polars-threads N: Polars thread count for CSV streaming in munge-sumstats.

Performance

Benchmarks against the original Python on a 16-core desktop (AMD Ryzen 9 5950X) using 1000 Genomes Phase 3 (n = 2,504 individuals, --ld-wind-snps 100).

Dataset SNPs Rust Python Speedup
chr22 24,624 1 s 7 s 7.0×
20% genome 333,000 12 s 88 s 7.3×

Correctness: all 1,664,851 SNPs in the full 1000G genome verified to match Python within 0.001 (max diff 0.000508, median 0.000250) after fixing the four algorithmic bugs described below.

Additional UKBB I/O benchmarks on this machine (Apple M4, 10 CPU cores, 24 GB RAM, macOS 26.3 build 25D125). These highlight I/O-heavy workflows and the impact of the Rust pipeline’s faster parsing and joins.

Dataset: /Users/sharif/Code/ldsc/data/biomarkers-30600-both_sexes-irnt.sample8m.tsv (~497 MB) for munge-sumstats; and /Users/sharif/Code/ldsc/data/UKBB.ALL.ldscore/UKBB.EUR.l2.ldscore.gz for h2/rg (381,831 SNPs after merge). Quick local checks for l2 default to a 50k SNP extract via scripts/bench_l2_py3_vs_rust.sh.

Workflow Rust Python Speedup
munge-sumstats 3.74 s 62.65 s 16.75×
h2 0.90 s 7.81 s 8.68×
rg (two traits) 2.93 s 28.09 s 9.59×

Reference HPC hardware (Penn PMACS) for scaling tests:

  • 19 Dell C6420 quad node systems (76 compute nodes, 6,080 cores total, CentOS 7.8, 80 CPU cores per node with hyper-threading, 256 GB or 512 GB RAM per node, 56 GB/s EDR or 100 GB/s FDR InfiniBand to the filesystem, 10 Gb/s Ethernet).
  • 1 Dell R940 big memory system (1.5 TB RAM, 96 CPU cores, 10 Gb/s Ethernet, 100 GB/s FDR InfiniBand).
  • 2 GPU nodes (1× Nvidia Tesla P100, 512 GB RAM, 88 CPU cores, 10 Gb/s Ethernet, 100 GB/s FDR InfiniBand).
  • 4.2 PB IBM Spectrum Scale (GPFS) disk storage (2 tiers, no backup).
  • 1.3 PB mirrored archive tape storage.
  • LSF job scheduling system.

Differences from Python

Command structure

Python LDSC consists of three separate scripts; this crate consolidates them into subcommands of a single ldsc binary:

Python Rust
python munge_sumstats.py --sumstats … --out … ldsc munge-sumstats --sumstats … --out …
python ldsc.py --l2 --bfile … --out … ldsc l2 --bfile … --out …
python ldsc.py --h2 … --ref-ld-chr … ldsc h2 --h2 … --ref-ld-chr …
python ldsc.py --rg … --ref-ld-chr … ldsc rg --rg … --ref-ld-chr …
python make_annot.py --bimfile … --bed-file … ldsc make-annot --bimfile … --bed-file …
python ldsc.py --cts-bin … ldsc cts-annot …

Python's --l2 flag (LD score estimation mode) becomes the l2 subcommand. The --h2 and --rg flags (regression modes) become h2 and rg subcommands.

Flag compatibility

Python flag names are supported directly.

Behavioural differences

  • --maf in l2: default now matches Python (MAF prefilter before LD computation).
  • --n-min default: when --n-min is 0, Rust now matches Python (90th percentile / 1.5).
  • --yes-really: Rust warns when the LD window spans a whole chromosome and --yes-really is not set (Python errors).
  • --chunksize: Python requires explicit chunking for large files; Rust uses Polars LazyFrame streaming and ignores chunk size for munge.
  • --return-silly-things / --invert-anyway: accepted flags for CLI parity; Rust never clips results and always uses a least-squares solver (warnings emitted).
  • --no-print-annot: only affects --cts-bin output; suppresses the .annot.gz file.
  • --cts-bin workflow: supported directly by ldsc l2 (also available as a separate preprocessor via ldsc cts-annot).

No-op Flags (Warned)

The following Python flags are accepted for CLI parity but do not change behavior in Rust:

  • h2/rg --return-silly-things
  • h2/rg --invert-anyway

Performance Deep-Dive

Why Python is slow

The original Python implementation is bottlenecked by three independent factors:

  1. GIL-blocked jackknife. jackknife.py runs 200 leave-one-block-out IRWLS refits sequentially. Each refit is a scipy.linalg.lstsq call that releases the GIL, but Python loop overhead and NumPy's per-call allocation dominate at this problem size.

  2. Per-SNP NumPy allocation in the LD score loop. ldscore.py calls np.dot in a Python-level loop with fresh array views on each of the ~33,000 chunks for a 1M-SNP genome. Python's boxing overhead and NumPy's internal allocation path are not amortised.

  3. Sequential LD computation. The GIL prevents genuine thread-level parallelism in the correlation loop.

What the Rust implementation does differently

1. Ring-buffer genotype store (l2.rs)

Python allocates a new rfuncA matrix every chunk. Rust pre-allocates a single F-order MatF of shape (n_indiv, ring_size) where ring_size = max_window + chunk_c. SNP columns are written into successive ring slots modulo ring_size; evicted slots are reused. This eliminates ~33,000 heap allocations for a 1M-SNP genome and improves cache locality because each active column occupies a contiguous 8-byte stride in memory.

2. Single matmul per chunk

For each chunk of B SNPs the computation is:

BB = Bᵀ · B          (chunk × chunk, unbiased r²)
AB = Aᵀ · B          (window × chunk, unbiased r²)

Both are single faer matmul calls. The window matrix A is assembled from ring slots into a pre-allocated column-major a_buf so columns are contiguous in memory and the matmul kernel can stride through them without gather operations.

3. Threading control

Small matrix multiplications benefit from fewer threads; the --rayon-threads flag controls the global Rayon thread count used by jackknife and matrix ops.

4. Global sequential pass — no cross-chromosome boundary artefact

Python processes all chromosomes as a single ordered dataset. With --ld-wind-snps, the last 100 SNPs of chromosome k and the first 100 of chromosome k+1 are within each other's windows. The 1000G reference panel contains five continental populations, creating population-stratification-driven Pearson r up to 0.38 across chromosome boundaries. Earlier versions of the Rust code ran per-chromosome in parallel, which zeroed out these cross-boundary contributions and produced L2 values 1–2 units too low for boundary SNPs. The current implementation mirrors Python: a single global pass over all SNPs in BIM order, with per-chromosome files written from the global L2 array after the fact.

5. Parallel block jackknife (jackknife.rs)

The 200 leave-one-block-out IRWLS refits are independent. Rayon's into_par_iter distributes them across all available cores. Each refit allocates two faer matrices and one SVD call; the total wall time for h2 and rg is dominated by the file I/O and merge join, not the jackknife.

6. Polars LazyFrame for munge (munge.rs)

munge_sumstats.py uses pandas, which loads the entire file into RAM before filtering. The Rust implementation uses Polars LazyCsvReader, which pushes column selection, renaming, and filter predicates into a query plan that streams the file in chunks. For large GWAS files (> 1 M SNPs) the peak RAM is proportional to the output size, not the input size.


Source Map (for LLMs)

src/
├── main.rs          Clap dispatch — parses CLI, calls into subcommand modules.
│
├── cli.rs           All argument structs (MungeArgs, L2Args, H2Args, RgArgs,
│                    MakeAnnotArgs, CtsAnnotArgs). No logic — pure clap derive macros.
│
├── gpu.rs           [cfg(feature = "gpu")] CUDA matmul via CubeCL/cubek-matmul.
│                    · GpuContext — holds ComputeClient + Strategy
│                    · matmul_tn — A^T × B single-shot GPU GEMM
│                    · matmul_tn_tiled — tiled variant for VRAM-limited windows
│
├── parse.rs         File I/O helpers:
│                    · scan_sumstats / scan_ldscore  → Polars LazyFrame
│                    · concat_chrs(prefix, suffix)   → concat per-chr files
│                    · read_m_total / read_m_vec      → .l2.M files
│                    · read_annot(prefix, thin)       → MatF + col names
│
├── munge.rs         munge-sumstats pipeline (Polars LazyFrame, no data loaded until
│                    collect). Internal functions:
│                    · apply_ignore, apply_col_overrides, normalize_columns
│                    · apply_info_list, apply_n_override
│                    · derive_z (BETA/SE → Z; P + sign → Z; --a1-inc)
│                    · filter_snps, apply_nstudy_filter
│                    · write_sumstats_gz (gzip TSV output)
│
├── l2.rs            LD score computation. Key types and functions:
│                    · BimRecord — CHR/SNP/CM/BP/bed_idx struct
│                    · parse_bim, count_fam, parse_fam — PLINK file parsers
│                    · load_individual_indices — --keep FID/IID file → isize indices
│                    · WindowMode — Cm / Kb / Snp enum
│                    · get_block_lefts_f64, get_block_lefts_snp — window boundaries
│                    · GemmBufs — enum holding f32 or f64 scratch buffers (--fast-f32)
│                    · normalize_col — impute NaN → mean, centre, unit-variance
│                    · compute_ldscore_global — ring-buffer GEMM loop (sequential,
│                      scalar and partitioned paths share the same pre-alloc buffers)
│                    · r2_unbiased — r² − (1−r²)/(n−2)
│                    · write_ldscore_refs — gzip TSV output
│                    · load_snp_set — HashSet<String> from --extract / --print-snps
│                    · run — orchestrates BIM read, --extract / --annot / --keep,
│                      calls compute_ldscore_global, writes per-chr .l2.ldscore.gz
│                      and .l2.M / .l2.M_5_50 files
│
├── irwls.rs         Iteratively Re-Weighted Least Squares.
│                    · IrwlsResult — est + optional jackknife fields
│                    · irwls(x, y, weights, n_iter) — pre-alloc xw/yw, SVD solve,
│                      reweight on fitted values; zero-alloc inner loop
│
├── jackknife.rs     Block jackknife variance estimation.
│                    · jackknife(x, y, weights, n_blocks, n_iter) →
│                      full-data IRWLS + n_blocks parallel leave-one-out refits
│                      (rayon par_iter) → pseudo-values → SE + covariance matrix
│
└── regressions.rs   h2 and rg regression drivers.
                     · run_h2 — loads sumstats + LD scores, inner-joins on SNP,
                       detects K annotation columns, dispatches to scalar (K=1)
                       or partitioned (K>1) path; supports --two-step, --no-intercept,
                       --intercept-h2, --print-coefficients, liability-scale output
                     · run_h2_partitioned — K-column design matrix, per-annotation
                       enrichment, resolves M vector from per-annotation M files
                     · run_h2_scalar — shared by standalone h2 and rg univariate sub-fits
                     · run_rg — iterates trait pairs; gencov regression + univariate h2
                       per trait; --two-step, --intercept-gencov, --no-intercept,
                       liability-scale rg; prints summary table
                     · load_ld_ref / load_ld — LazyFrame readers for ref and weight LD
                     · resolve_m / resolve_m_vec — reads .l2.M_5_50 or falls back to n_obs
                     · liability_scale_h2 — observed → liability scale conversion
                     · print_jackknife_diagnostics — --print-cov / --print-delete-vals

make_annot.rs        BED → 0/1 annotation generator.
                     · annotate_from_bed — loads BED intervals per chromosome,
                       sorts and merges, binary-search annotation per SNP
                     · annotate_from_gene_set — gene symbols → coordinate lookup →
                       same interval merge/annotate pipeline
                     · write_annot_file — CHR BP SNP CM ANNOT TSV (optional .gz)

Key data-flow invariants

  • l2 --annot prefix reads {prefix}{chr}.annot[.gz] for every chromosome found in the BIM (not a single prefix.annot.gz file).
  • --extract filters the BIM before window computation; --print-snps filters only the output.
  • bed_idx (original BIM row index) differs from pos (index in the filtered all_snps slice) when --extract is active; bed_idx_to_pos in run() maps between them.
  • --keep passes iid_indices: Option<&[isize]> to the internal BED reader; n_indiv_actual (not the FAM total) is used for normalization and the r²-unbiased correction.
  • The ring buffer ring_size = max_window + chunk_c guarantees no live window slot is overwritten before it has been consumed in the A×B product.
  • rg --no-intercept propagates fixed_intercept = Some(1.0) to both the gencov regression and each univariate run_h2_scalar call, matching Python's behaviour.

Dependency rationale

Crate Version Role
internal bed module - minimal PLINK .bed reader tailored to LDSC
polars 0.53 lazy CSV streaming (munge + LD score file loading)
faer 0.24 dense matrix algebra; SVD for IRWLS
rayon 1 data-parallel jackknife blocks
statrs 0.18 Normal CDF/quantile for P→Z conversion
clap 4 derive-macro CLI argument parsing
anyhow 1 error propagation
flate2 1 gzip output for .sumstats.gz and .ldscore.gz
cubecl 0.10 (optional, gpu feature) multi-backend GPU compute
cubek-matmul 0.2 (optional, gpu feature) autotuned GPU matmul

Maintainers

Release process

Releases are cut with cargo-release and tagged as vX.Y.Z. Tag pushes trigger the release workflow, which builds and uploads platform tarballs to GitHub Releases.

cargo release patch
cargo release patch --execute

Building the image locally

Requires Docker with BuildKit (default since Docker 23):

docker build -t ldsc .

The multi-stage Dockerfile uses cargo-chef to cache dependency compilation in a separate layer, so incremental rebuilds only recompile changed source files.