ldsc-0.3.1 is not a library.

LDSC — Rust Rewrite

A compiled, statically-typed rewrite of Bulik-Sullivan et al.'s LDSC in Rust. Implements six subcommands — munge-sumstats, l2, h2, rg, make-annot, cts-annot — with identical numerical output and a 7× speedup on LD score computation.

Install

Fastest (no Rust required):

docker run --rm ghcr.io/sharifhsn/ldsc:latest --help

Local install options:

Prebuilt binaries from GitHub Releases (see “Prebuilt Binaries” below).
Cargo install (requires Rust): cargo install ldsc

Quick Start

The typical LDSC workflow — preprocess summary statistics, then estimate heritability or genetic correlation — mirrors the upstream wiki tutorial.

Step 1: Download pre-computed European LD scores (skip l2 for European GWAS)

wget https://data.broadinstitute.org/alkesgroup/LDSCORE/eur_w_ld_chr.tar.bz2
wget https://data.broadinstitute.org/alkesgroup/LDSCORE/w_hm3.snplist.bz2
tar -jxvf eur_w_ld_chr.tar.bz2   # inner .l2.ldscore.gz files are already gzip-compressed
bunzip2 w_hm3.snplist.bz2

Step 2: Pre-process summary statistics

ldsc munge-sumstats \
  --sumstats my_gwas.txt \
  --N 50000 \
  --merge-alleles w_hm3.snplist \
  --out my_trait

Step 3a: Estimate heritability

ldsc h2 \
  --h2 my_trait.sumstats.gz \
  --ref-ld-chr eur_w_ld_chr/ \
  --w-ld-chr eur_w_ld_chr/ \
  --out my_trait_h2

Step 3b: Estimate genetic correlation

ldsc rg \
  --rg trait1.sumstats.gz,trait2.sumstats.gz \
  --ref-ld-chr eur_w_ld_chr/ \
  --w-ld-chr eur_w_ld_chr/ \
  --out trait1_vs_trait2

Usage

munge-sumstats

Pre-processes GWAS summary statistics into the .sumstats.gz format consumed by h2 and rg. Input summary statistics may be plain, .gz, or .bz2, and can be tab- or whitespace-delimited.

ldsc munge-sumstats \
  --sumstats my_gwas.txt.gz \
  --out output_prefix \
  [--merge-alleles w_hm3.snplist] \
  [--signed-sumstats BETA,0] \
  [--N 50000] \
  [--info-min 0.9] \
  [--maf-min 0.01]

Key flags: --signed-sumstats COLNAME,null_value tells the tool which column carries effect direction and what the null value is (e.g. BETA,0, OR,1, Z,0). Without this flag the tool auto-detects from BETA/LOG_ODDS/OR/Z columns. --a1-inc skips the signed column and treats all Z-scores as positive (A1 is always the risk allele). --merge-alleles enforces allele concordance (mismatches are removed), matching Python behavior. Use --daner or --daner-n for Ripke daner formats (infers N from FRQ_[A/U] headers or Nca/Nco columns).

l2

Computes LD scores from a PLINK binary file set (.bed/.bim/.fam). Annotation inputs (.annot) may be plain, .gz, or .bz2.

Tip for European GWAS: Pre-computed 1000G phase 3 LD scores are available from the Broad LDSCORE page. Download eur_w_ld_chr.tar.bz2; after tar -jxvf, the inner .l2.ldscore.gz files are already gzip-compressed and work directly with ldsc. Non-European populations require computing your own LD scores from an appropriate reference panel.

ldsc l2 \
  --bfile /path/to/1000G_EUR \
  --out out/eur \
  --ld-wind-cm 1.0 \
  [--annot annotations/BaselineLD.] \
  [--extract snplist.txt] \
  [--maf 0.01] \
  [--keep keep_individuals.txt] \
  [--per-allele] \
  [--pq-exp 1.0]

ldsc l2 warns if the LD window spans an entire chromosome; use --yes-really to silence.

Window flags are mutually exclusive: --ld-wind-cm (genetic distance, default 1.0), --ld-wind-kb (physical distance), or --ld-wind-snps (fixed flanking SNP count).

Partitioned LD scores with --annot prefix: accepts either a single {prefix}.annot[.gz|.bz2] file or per-chromosome {prefix}{chr}.annot[.gz|.bz2] files for each chromosome present in the BIM. Outputs one L2 column per annotation and corresponding .l2.M / .l2.M_5_50 files.

--per-allele is equivalent to --pq-exp 1 (weights each r² by p·(1−p)). Use --pq-exp S to apply (p·(1−p))^S weighting; output columns and .M files receive a _S{S} suffix. --no-print-annot suppresses the .annot.gz output produced by --cts-bin.

h2

Estimates SNP heritability. LD score inputs may be plain, .gz, or .bz2.

ldsc h2 \
  --h2 trait.sumstats.gz \
  --ref-ld-chr eur_w_ld_chr/ \
  --w-ld-chr eur_w_ld_chr/ \
  --out results

--ref-ld-chr prefix appends the chromosome number then .l2.ldscore.gz. So --ref-ld-chr eur_w_ld_chr/ reads eur_w_ld_chr/1.l2.ldscore.gz … eur_w_ld_chr/22.l2.ldscore.gz. If the chromosome number falls in the middle of the filename, use @ as a placeholder: --ref-ld-chr ld/chr@_scores → ld/chr1_scores.l2.ldscore.gz, etc. The same convention applies to --w-ld-chr. You may pass a comma-separated list to --ref-ld / --ref-ld-chr (Python behavior); --w-ld / --w-ld-chr must point to a single fileset.

Common options: --no-intercept, --intercept-h2 VALUE, --two-step 30, --chisq-max 80, --samp-prev 0.1 --pop-prev 0.01 (liability-scale conversion), --print-coefficients (partitioned h2: per-annotation τ and enrichment).

Overlapping annotations: use --overlap-annot with --frqfile-chr prefix (or --frqfile for single filesets) to match Python’s overlap-adjusted results. When enabled, LDSC writes <out>.results with overlap-aware proportion/enrichment columns.

Cell-type-specific h2: use --h2-cts and --ref-ld-chr-cts (see the LDSC wiki for .ldcts format). Output is written to <out>.cell_type_results.txt. Add --print-all-cts to report coefficients for all CTS LD score prefixes in each line.

rg

Estimates genetic correlations across all pairs from a list of summary statistic files. LD score inputs may be plain, .gz, or .bz2.

ldsc rg \
  --rg trait1.sumstats.gz,trait2.sumstats.gz,trait3.sumstats.gz \
  --ref-ld-chr eur_w_ld_chr/ \
  --w-ld-chr eur_w_ld_chr/ \
  --out results

--ref-ld-chr / --w-ld-chr follow the same prefix convention as h2 (see above).

Common options: --no-intercept, --intercept-h2 1,1,1 (one per trait), --intercept-gencov 0.0,0.0 (per-pair), --two-step 30, --samp-prev / --pop-prev (comma-separated, one value per input file).

make-annot

Generates 0/1 annotation files from a UCSC BED file or a gene set.

# From a BED file:
ldsc make-annot \
  --bimfile my_data.bim \
  --bed-file regions.bed \
  --annot-file output.annot.gz \
  --windowsize 100000

# From a gene set:
ldsc make-annot \
  --bimfile my_data.bim \
  --gene-set-file immune_genes.txt \
  --gene-coord-file ENSG_coord.txt \
  --annot-file output.annot.gz \
  --windowsize 100000

cts-annot

Bins one or more continuous annotations into categories and writes a .annot file compatible with l2 --annot (Python --cts-bin preprocessing).

ldsc cts-annot \
  --bimfile my_data.bim \
  --cts-bin DAF.txt,DIST.txt \
  --cts-breaks 0.1,0.25,0.4x10,100,1000 \
  --cts-names DAF,DIST_TO_GENE \
  --annot-file cts.annot.gz

Installation Details

Native builds require Rust ≥ 1.85. The Rust implementation uses faer for dense linear algebra.

Fast f32 mode (experimental)

Pass --fast-f32 to ldsc l2 to run core matmuls in f32 while accumulating in f64. This roughly halves memory for GEMM buffers and can be ~1.5x faster on CPUs with 256-bit SIMD. It is not parity-safe compared to the default f64 path.

Observed differences on full 1000G (Phase 3, --ld-wind-kb 1000, --chunk-size 200):

Speedup: ~1.44x (f32 vs f64)
Output deltas: max_abs_diff=0.001 (relative to f64)

# f32 matmul (runtime flag, no rebuild needed)
ldsc l2 --bfile … --out … --fast-f32

# default f64 (parity-safe)
ldsc l2 --bfile … --out …

Optional GPU acceleration (experimental)

Build with --features gpu to enable CUDA-accelerated matrix multiplication via CubeCL. Requires a CUDA toolkit at build time.

cargo build --release --features gpu
ldsc l2 --bfile … --out … --gpu

At 1000G scale (n=2,490), GPU is transfer-bound and slower than CPU. GPU acceleration targets biobank-scale cohorts (n >= 50k) where each chunk's GEMM is large enough for compute to dominate PCIe transfer. Use --gpu-tile-cols N to split large window matrices into VRAM-fitting tiles.

Default build (CPU only):

cargo build --release

Prebuilt Binaries

Releases include Linux, macOS, and Windows archives that contain ldsc, LICENSE, and README.md.

# Linux (x86_64)
curl -L -o ldsc_linux-x86_64.tar.gz \
  https://github.com/sharifhsn/ldsc/releases/latest/download/ldsc_linux-x86_64.tar.gz
tar -xzf ldsc_linux-x86_64.tar.gz
./ldsc --help

# macOS (Apple Silicon)
curl -L -o ldsc_macos-aarch64.tar.gz \
  https://github.com/sharifhsn/ldsc/releases/latest/download/ldsc_macos-aarch64.tar.gz
tar -xzf ldsc_macos-aarch64.tar.gz
./ldsc --help

# Windows (x86_64)
# Download the zip from the release page and extract:
# https://github.com/sharifhsn/ldsc/releases/latest/download/ldsc_windows-x86_64.zip

Docker

Images are published to the GitHub Container Registry on every push to main and for each version tag.

docker pull ghcr.io/sharifhsn/ldsc:latest

# Run with local data mounted
docker run --rm \
  -v /path/to/data:/data \
  ghcr.io/sharifhsn/ldsc:latest \
  h2 --h2         /data/trait.sumstats.gz \
     --ref-ld-chr /data/eur_w_ld_chr/ \
     --w-ld-chr   /data/eur_w_ld_chr/ \
     --out        /data/results

Version tags (v1.2.3) produce :1.2.3, :1.2, and :latest. Pushes to main produce a :main tag and a short-SHA tag (:sha-XXXXXXX).

Building from source

Requires a Rust toolchain (≥ 1.85; edition 2024 features used).

cargo build --release
# binary: target/release/ldsc

The release profile sets opt-level = 3, lto = "thin", codegen-units = 1.

Runtime tuning (optional)

The following global flags are available for performance tuning but are not heavily battle-tested. Use them only when needed:

--rayon-threads N: Rayon thread count for jackknife in h2/rg.
--polars-threads N: Polars thread count for CSV streaming in munge-sumstats.

Performance

Benchmarks against the original Python on a 16-core desktop (AMD Ryzen 9 5950X) using 1000 Genomes Phase 3 (n = 2,504 individuals, --ld-wind-snps 100).

Dataset	SNPs	Rust	Python	Speedup
chr22	24,624	1 s	7 s	7.0×
20% genome	333,000	12 s	88 s	7.3×

Correctness: all 1,664,851 SNPs in the full 1000G genome verified to match Python within 0.001 (max diff 0.000508, median 0.000250) after fixing the four algorithmic bugs described below.

Additional UKBB I/O benchmarks on this machine (Apple M4, 10 CPU cores, 24 GB RAM, macOS 26.3 build 25D125). These highlight I/O-heavy workflows and the impact of the Rust pipeline’s faster parsing and joins.

Dataset: /Users/sharif/Code/ldsc/data/biomarkers-30600-both_sexes-irnt.sample8m.tsv (~497 MB) for munge-sumstats; and /Users/sharif/Code/ldsc/data/UKBB.ALL.ldscore/UKBB.EUR.l2.ldscore.gz for h2/rg (381,831 SNPs after merge). Quick local checks for l2 default to a 50k SNP extract via scripts/bench_l2_py3_vs_rust.sh.

Workflow	Rust	Python	Speedup
munge-sumstats	3.74 s	62.65 s	16.75×
h2	0.90 s	7.81 s	8.68×
rg (two traits)	2.93 s	28.09 s	9.59×

Reference HPC hardware (Penn PMACS) for scaling tests:

19 Dell C6420 quad node systems (76 compute nodes, 6,080 cores total, CentOS 7.8, 80 CPU cores per node with hyper-threading, 256 GB or 512 GB RAM per node, 56 GB/s EDR or 100 GB/s FDR InfiniBand to the filesystem, 10 Gb/s Ethernet).
1 Dell R940 big memory system (1.5 TB RAM, 96 CPU cores, 10 Gb/s Ethernet, 100 GB/s FDR InfiniBand).
2 GPU nodes (1× Nvidia Tesla P100, 512 GB RAM, 88 CPU cores, 10 Gb/s Ethernet, 100 GB/s FDR InfiniBand).
4.2 PB IBM Spectrum Scale (GPFS) disk storage (2 tiers, no backup).
1.3 PB mirrored archive tape storage.
LSF job scheduling system.

Differences from Python

Command structure

Python LDSC consists of three separate scripts; this crate consolidates them into subcommands of a single ldsc binary:

Python	Rust
`python munge_sumstats.py --sumstats … --out …`	`ldsc munge-sumstats --sumstats … --out …`
`python ldsc.py --l2 --bfile … --out …`	`ldsc l2 --bfile … --out …`
`python ldsc.py --h2 … --ref-ld-chr …`	`ldsc h2 --h2 … --ref-ld-chr …`
`python ldsc.py --rg … --ref-ld-chr …`	`ldsc rg --rg … --ref-ld-chr …`
`python make_annot.py --bimfile … --bed-file …`	`ldsc make-annot --bimfile … --bed-file …`
`python ldsc.py --cts-bin …`	`ldsc cts-annot …`

Python's --l2 flag (LD score estimation mode) becomes the l2 subcommand. The --h2 and --rg flags (regression modes) become h2 and rg subcommands.

Flag compatibility

Python flag names are supported directly.

Behavioural differences

--maf in l2: default now matches Python (MAF prefilter before LD computation).
--n-min default: when --n-min is 0, Rust now matches Python (90th percentile / 1.5).
--yes-really: Rust warns when the LD window spans a whole chromosome and --yes-really is not set (Python errors).
--chunksize: Python requires explicit chunking for large files; Rust uses Polars LazyFrame streaming and ignores chunk size for munge.
--return-silly-things / --invert-anyway: accepted flags for CLI parity; Rust never clips results and always uses a least-squares solver (warnings emitted).
--no-print-annot: only affects --cts-bin output; suppresses the .annot.gz file.
--cts-bin workflow: supported directly by ldsc l2 (also available as a separate preprocessor via ldsc cts-annot).

No-op Flags (Warned)

The following Python flags are accepted for CLI parity but do not change behavior in Rust:

h2/rg --return-silly-things
h2/rg --invert-anyway

Performance Deep-Dive

Why Python is slow

The original Python implementation is bottlenecked by three independent factors:

GIL-blocked jackknife. jackknife.py runs 200 leave-one-block-out IRWLS refits sequentially. Each refit is a scipy.linalg.lstsq call that releases the GIL, but Python loop overhead and NumPy's per-call allocation dominate at this problem size.
Per-SNP NumPy allocation in the LD score loop. ldscore.py calls np.dot in a Python-level loop with fresh array views on each of the ~33,000 chunks for a 1M-SNP genome. Python's boxing overhead and NumPy's internal allocation path are not amortised.
Sequential LD computation. The GIL prevents genuine thread-level parallelism in the correlation loop.

What the Rust implementation does differently

1. Ring-buffer genotype store (`l2.rs`)

Python allocates a new rfuncA matrix every chunk. Rust pre-allocates a single F-order MatF of shape (n_indiv, ring_size) where ring_size = max_window + chunk_c. SNP columns are written into successive ring slots modulo ring_size; evicted slots are reused. This eliminates ~33,000 heap allocations for a 1M-SNP genome and improves cache locality because each active column occupies a contiguous 8-byte stride in memory.

2. Single matmul per chunk

For each chunk of B SNPs the computation is:

BB = Bᵀ · B          (chunk × chunk, unbiased r²)
AB = Aᵀ · B          (window × chunk, unbiased r²)

Both are single faer matmul calls. The window matrix A is assembled from ring slots into a pre-allocated column-major a_buf so columns are contiguous in memory and the matmul kernel can stride through them without gather operations.

3. Threading control

Small matrix multiplications benefit from fewer threads; the --rayon-threads flag controls the global Rayon thread count used by jackknife and matrix ops.

4. Global sequential pass — no cross-chromosome boundary artefact

Python processes all chromosomes as a single ordered dataset. With --ld-wind-snps, the last 100 SNPs of chromosome k and the first 100 of chromosome k+1 are within each other's windows. The 1000G reference panel contains five continental populations, creating population-stratification-driven Pearson r up to 0.38 across chromosome boundaries. Earlier versions of the Rust code ran per-chromosome in parallel, which zeroed out these cross-boundary contributions and produced L2 values 1–2 units too low for boundary SNPs. The current implementation mirrors Python: a single global pass over all SNPs in BIM order, with per-chromosome files written from the global L2 array after the fact.

5. Parallel block jackknife (`jackknife.rs`)

The 200 leave-one-block-out IRWLS refits are independent. Rayon's into_par_iter distributes them across all available cores. Each refit allocates two faer matrices and one SVD call; the total wall time for h2 and rg is dominated by the file I/O and merge join, not the jackknife.

6. Polars LazyFrame for munge (`munge.rs`)

munge_sumstats.py uses pandas, which loads the entire file into RAM before filtering. The Rust implementation uses Polars LazyCsvReader, which pushes column selection, renaming, and filter predicates into a query plan that streams the file in chunks. For large GWAS files (> 1 M SNPs) the peak RAM is proportional to the output size, not the input size.

Source Map (for LLMs)

src/
├── main.rs          Clap dispatch — parses CLI, calls into subcommand modules.
│
├── cli.rs           All argument structs (MungeArgs, L2Args, H2Args, RgArgs,
│                    MakeAnnotArgs, CtsAnnotArgs). No logic — pure clap derive macros.
│
├── gpu.rs           [cfg(feature = "gpu")] CUDA matmul via CubeCL/cubek-matmul.
│                    · GpuContext — holds ComputeClient + Strategy
│                    · matmul_tn — A^T × B single-shot GPU GEMM
│                    · matmul_tn_tiled — tiled variant for VRAM-limited windows
│
├── parse.rs         File I/O helpers:
│                    · scan_sumstats / scan_ldscore  → Polars LazyFrame
│                    · concat_chrs(prefix, suffix)   → concat per-chr files
│                    · read_m_total / read_m_vec      → .l2.M files
│                    · read_annot(prefix, thin)       → MatF + col names
│
├── munge.rs         munge-sumstats pipeline (Polars LazyFrame, no data loaded until
│                    collect). Internal functions:
│                    · apply_ignore, apply_col_overrides, normalize_columns
│                    · apply_info_list, apply_n_override
│                    · derive_z (BETA/SE → Z; P + sign → Z; --a1-inc)
│                    · filter_snps, apply_nstudy_filter
│                    · write_sumstats_gz (gzip TSV output)
│
├── l2.rs            LD score computation. Key types and functions:
│                    · BimRecord — CHR/SNP/CM/BP/bed_idx struct
│                    · parse_bim, count_fam, parse_fam — PLINK file parsers
│                    · load_individual_indices — --keep FID/IID file → isize indices
│                    · WindowMode — Cm / Kb / Snp enum
│                    · get_block_lefts_f64, get_block_lefts_snp — window boundaries
│                    · GemmBufs — enum holding f32 or f64 scratch buffers (--fast-f32)
│                    · normalize_col — impute NaN → mean, centre, unit-variance
│                    · compute_ldscore_global — ring-buffer GEMM loop (sequential,
│                      scalar and partitioned paths share the same pre-alloc buffers)
│                    · r2_unbiased — r² − (1−r²)/(n−2)
│                    · write_ldscore_refs — gzip TSV output
│                    · load_snp_set — HashSet<String> from --extract / --print-snps
│                    · run — orchestrates BIM read, --extract / --annot / --keep,
│                      calls compute_ldscore_global, writes per-chr .l2.ldscore.gz
│                      and .l2.M / .l2.M_5_50 files
│
├── irwls.rs         Iteratively Re-Weighted Least Squares.
│                    · IrwlsResult — est + optional jackknife fields
│                    · irwls(x, y, weights, n_iter) — pre-alloc xw/yw, SVD solve,
│                      reweight on fitted values; zero-alloc inner loop
│
├── jackknife.rs     Block jackknife variance estimation.
│                    · jackknife(x, y, weights, n_blocks, n_iter) →
│                      full-data IRWLS + n_blocks parallel leave-one-out refits
│                      (rayon par_iter) → pseudo-values → SE + covariance matrix
│
└── regressions.rs   h2 and rg regression drivers.
                     · run_h2 — loads sumstats + LD scores, inner-joins on SNP,
                       detects K annotation columns, dispatches to scalar (K=1)
                       or partitioned (K>1) path; supports --two-step, --no-intercept,
                       --intercept-h2, --print-coefficients, liability-scale output
                     · run_h2_partitioned — K-column design matrix, per-annotation
                       enrichment, resolves M vector from per-annotation M files
                     · run_h2_scalar — shared by standalone h2 and rg univariate sub-fits
                     · run_rg — iterates trait pairs; gencov regression + univariate h2
                       per trait; --two-step, --intercept-gencov, --no-intercept,
                       liability-scale rg; prints summary table
                     · load_ld_ref / load_ld — LazyFrame readers for ref and weight LD
                     · resolve_m / resolve_m_vec — reads .l2.M_5_50 or falls back to n_obs
                     · liability_scale_h2 — observed → liability scale conversion
                     · print_jackknife_diagnostics — --print-cov / --print-delete-vals

make_annot.rs        BED → 0/1 annotation generator.
                     · annotate_from_bed — loads BED intervals per chromosome,
                       sorts and merges, binary-search annotation per SNP
                     · annotate_from_gene_set — gene symbols → coordinate lookup →
                       same interval merge/annotate pipeline
                     · write_annot_file — CHR BP SNP CM ANNOT TSV (optional .gz)

Key data-flow invariants

l2 --annot prefix reads {prefix}{chr}.annot[.gz] for every chromosome found in the BIM (not a single prefix.annot.gz file).
--extract filters the BIM before window computation; --print-snps filters only the output.
bed_idx (original BIM row index) differs from pos (index in the filtered all_snps slice) when --extract is active; bed_idx_to_pos in run() maps between them.
--keep passes iid_indices: Option<&[isize]> to the internal BED reader; n_indiv_actual (not the FAM total) is used for normalization and the r²-unbiased correction.
The ring buffer ring_size = max_window + chunk_c guarantees no live window slot is overwritten before it has been consumed in the A×B product.
rg --no-intercept propagates fixed_intercept = Some(1.0) to both the gencov regression and each univariate run_h2_scalar call, matching Python's behaviour.

Dependency rationale

Crate	Version	Role
internal `bed` module	-	minimal PLINK .bed reader tailored to LDSC
`polars`	0.53	lazy CSV streaming (munge + LD score file loading)
`faer`	0.24	dense matrix algebra; SVD for IRWLS
`rayon`	1	data-parallel jackknife blocks
`statrs`	0.18	Normal CDF/quantile for P→Z conversion
`clap`	4	derive-macro CLI argument parsing
`anyhow`	1	error propagation
`flate2`	1	gzip output for .sumstats.gz and .ldscore.gz
`cubecl`	0.10	(optional, `gpu` feature) multi-backend GPU compute
`cubek-matmul`	0.2	(optional, `gpu` feature) autotuned GPU matmul

Maintainers

Release process

Releases are cut with cargo-release and tagged as vX.Y.Z. Tag pushes trigger the release workflow, which builds and uploads platform tarballs to GitHub Releases.

cargo release patch
cargo release patch --execute

Building the image locally

Requires Docker with BuildKit (default since Docker 23):

docker build -t ldsc .

The multi-stage Dockerfile uses cargo-chef to cache dependency compilation in a separate layer, so incremental rebuilds only recompile changed source files.

ldsc 0.3.1