ldsc 0.1.0

LD Score Regression — fast Rust reimplementation of Bulik-Sullivan et al. LDSC
ldsc-0.1.0 is not a library.

LDSC — Rust Rewrite

CI crates.io License: GPL-3.0 MSRV: 1.85

A compiled, statically-typed rewrite of Bulik-Sullivan et al.'s LDSC in Rust. Implements the same five subcommands — munge-sumstats, ldscore, h2, rg, make-annot — with identical numerical output and a 7× speedup on LD score computation.


Installation

cargo install ldsc

Requires Rust ≥ 1.85 and a C toolchain with Fortran support to build the statically-linked OpenBLAS. On Debian/Ubuntu:

sudo apt-get install cmake gfortran libgfortran-dev

Building from source

Requires a Rust toolchain (≥ 1.85; edition 2024 features used). OpenBLAS is linked statically — no runtime library installation needed.

cargo build --release
# binary: target/release/ldsc

The release profile sets opt-level = 3, lto = "thin", codegen-units = 1.


Usage

munge-sumstats

Pre-processes GWAS summary statistics into the .sumstats.gz format consumed by h2 and rg.

ldsc munge-sumstats \
  --sumstats my_gwas.txt.gz \
  --out output_prefix \
  [--merge-alleles w_hm3.snplist] \
  [--signed-sumstats BETA,0] \
  [--n 50000] \
  [--info-min 0.9] \
  [--maf 0.01]

Key flags: --signed-sumstats COLNAME,null_value tells the tool which column carries effect direction and what the null value is (e.g. BETA,0, OR,1, Z,0). Without this flag the tool auto-detects from BETA/LOG_ODDS/OR/Z columns. --a1-inc skips the signed column and treats all Z-scores as positive (A1 is always the risk allele).

ldscore

Computes LD scores from a PLINK binary file set.

ldsc ldscore \
  --bfile /path/to/1000G_EUR \
  --out out/eur \
  --ld-wind-cm 1.0 \
  [--annot annotations/BaselineLD.] \
  [--extract snplist.txt] \
  [--keep keep_individuals.txt] \
  [--per-allele] \
  [--blas-threads 4]

Window flags are mutually exclusive: --ld-wind-cm (genetic distance, default 1.0), --ld-wind-kb (physical distance), or --ld-wind-snp (fixed flanking SNP count).

Partitioned LD scores with --annot prefix: expects {prefix}{chr}.annot[.gz] for each chromosome present in the BIM, outputs one L2 column per annotation and corresponding .l2.M / .l2.M_5_50 files.

h2

Estimates SNP heritability.

ldsc h2 \
  --h2 trait.sumstats.gz \
  --ref-ld-chr eur_ldscores/chr \
  --w-ld-chr eur_ldscores/chr \
  --out results

Common options: --no-intercept, --intercept-h2 VALUE, --two-step 30, --chisq-max 80, --samp-prev 0.1 --pop-prev 0.01 (liability-scale conversion), --print-coefficients (partitioned h2: per-annotation τ and enrichment).

rg

Estimates genetic correlations across all pairs from a list of summary statistic files.

ldsc rg \
  --rg trait1.sumstats.gz,trait2.sumstats.gz,trait3.sumstats.gz \
  --ref-ld-chr eur_ldscores/chr \
  --w-ld-chr eur_ldscores/chr \
  --out results

Common options: --no-intercept, --intercept-gencov 0.0,0.0 (per-pair), --two-step 30, --samp-prev / --pop-prev (comma-separated, one value per input file).

make-annot

Generates 0/1 annotation files from a UCSC BED file or a gene set.

# From a BED file:
ldsc make-annot \
  --bimfile my_data.bim \
  --bed-file regions.bed \
  --annot-file output.annot.gz \
  --windowsize 100000

# From a gene set:
ldsc make-annot \
  --bimfile my_data.bim \
  --gene-set-file immune_genes.txt \
  --gene-coord-file ENSG_coord.txt \
  --annot-file output.annot.gz \
  --windowsize 100000

Performance

Benchmarks against the original Python on a 16-core desktop (AMD Ryzen 9 5950X) using 1000 Genomes Phase 3 (n = 2,504 individuals, --ld-wind-snp 100).

Dataset SNPs Rust Python Speedup
chr22 24,624 1 s 7 s 7.0×
20% genome 333,000 12 s 88 s 7.3×

Correctness: all 1,664,851 SNPs in the full 1000G genome verified to match Python within 0.001 (max diff 0.000508, median 0.000250) after fixing the four algorithmic bugs described below.


Differences from Python

Flag renames

Python Rust Note
--ld-wind-snps --ld-wind-snp trailing s removed
--N / --N-col --n / --n-col lowercase
--M --m-snps renamed
--snp / --a1 / --a2 / --p / --frq / --info --snp-col / --a1-col / ... --*-col suffix
--maf-min --maf renamed

Behavioural differences

  • --maf in ldscore: applied as a post-filter on output (Python applies it pre-computation). For typical thresholds (≥ 0.01) this makes no numerical difference.
  • --n-min default: Python defaults to 90th-percentile / 2; Rust defaults to 0 (disabled).
  • --yes-really: Python refuses --ld-wind-cm spanning an entire chromosome without this flag; Rust has no such guard.
  • --chunksize: Python requires explicit chunking for large files; Rust uses Polars LazyFrame streaming and ignores chunk size for munge.
  • --return-silly-things / --invert-anyway: accepted flags for CLI parity; Rust never clips results and always uses a least-squares solver.

Unimplemented Features

The following Python flags have no Rust equivalent and are not planned:

Subcommand Flag Reason
munge-sumstats --daner / --daner-n Ripke daner format; niche use case
ldscore --pq-exp Generalised per-allele weighting; rare
ldscore --cts-bin / --cts-breaks / --cts-names Continuous variable partitioning
ldscore --no-print-annot Print-suppression flag; trivial
h2 --h2-cts / --ref-ld-chr-cts / --print-all-cts Cell-type-specific h2
h2 / rg --overlap-annot / --frqfile-chr Overlapping annotation categories
rg --intercept-h2 Fix per-trait h2 intercepts in rg
All .bz2 input files Only .gz supported

Performance Deep-Dive

Why Python is slow

The original Python implementation is bottlenecked by three independent factors:

  1. GIL-blocked jackknife. jackknife.py runs 200 leave-one-block-out IRWLS refits sequentially. Each refit is a scipy.linalg.lstsq call that releases the GIL, but Python loop overhead and NumPy's per-call allocation dominate at this problem size.

  2. Per-SNP NumPy allocation in the LD score loop. ldscore.py calls np.dot in a Python-level loop with fresh array views on each of the ~33,000 chunks for a 1M-SNP genome. Python's boxing overhead and NumPy's internal allocation path are not amortised.

  3. Sequential LD computation. The GIL prevents genuine thread-level parallelism in the correlation loop.

What the Rust implementation does differently

1. Ring-buffer genotype store (ldscore.rs)

Python allocates a new rfuncA matrix every chunk. Rust pre-allocates a single F-order Array2<f64> of shape (n_indiv, ring_size) where ring_size = max_window + chunk_c. SNP columns are written into successive ring slots modulo ring_size; evicted slots are reused. This eliminates ~33,000 heap allocations for a 1M-SNP genome and improves cache locality because each active column occupies a contiguous 8-byte stride in memory.

2. Single DGEMM per chunk

For each chunk of B SNPs the computation is:

BB = Bᵀ · B          (chunk × chunk, unbiased r²)
AB = Aᵀ · B          (window × chunk, unbiased r²)

Both are single ndarray::dot calls dispatched to OpenBLAS DGEMM. The window matrix A is assembled from ring slots into a pre-allocated F-order a_buf so columns are contiguous in memory and the DGEMM kernel can stride through them without gather operations.

3. Tuned BLAS thread count

OpenBLAS defaults to using all available cores, which creates thread-spawning overhead that outweighs the parallelism benefit for the small matrices in 1000G-scale LD computation (n ≈ 2,500, window ≈ 200). The Rust binary calls openblas_set_num_threads(4) at startup (overridable with --blas-threads). This is optimal for 1000G; biobank data (n > 10,000) may benefit from higher values.

4. Global sequential pass — no cross-chromosome boundary artefact

Python processes all chromosomes as a single ordered dataset. With --ld-wind-snps, the last 100 SNPs of chromosome k and the first 100 of chromosome k+1 are within each other's windows. The 1000G reference panel contains five continental populations, creating population-stratification-driven Pearson r up to 0.38 across chromosome boundaries. Earlier versions of the Rust code ran per-chromosome in parallel, which zeroed out these cross-boundary contributions and produced L2 values 1–2 units too low for boundary SNPs. The current implementation mirrors Python: a single global pass over all SNPs in BIM order, with per-chromosome files written from the global L2 array after the fact.

5. Parallel block jackknife (jackknife.rs)

The 200 leave-one-block-out IRWLS refits are independent. Rayon's into_par_iter distributes them across all available cores. Each refit allocates two ndarray views and one LAPACK SVD call; the total wall time for h2 and rg is dominated by the file I/O and merge join, not the jackknife.

6. Polars LazyFrame for munge (munge.rs)

munge_sumstats.py uses pandas, which loads the entire file into RAM before filtering. The Rust implementation uses Polars LazyCsvReader, which pushes column selection, renaming, and filter predicates into a query plan that streams the file in chunks. For large GWAS files (> 1 M SNPs) the peak RAM is proportional to the output size, not the input size.


Source Map (for LLMs)

src/
├── main.rs          Clap dispatch — parses CLI, calls into subcommand modules.
│
├── cli.rs           All argument structs (MungeArgs, LdscoreArgs, H2Args, RgArgs,
│                    MakeAnnotArgs). No logic — pure clap derive macros.
│
├── parse.rs         File I/O helpers:
│                    · scan_sumstats / scan_ldscore  → Polars LazyFrame
│                    · concat_chrs(prefix, suffix)   → concat per-chr files
│                    · read_m_total / read_m_vec      → .l2.M files
│                    · read_annot(prefix, thin)       → Array2<f64> + col names
│
├── munge.rs         munge-sumstats pipeline (Polars LazyFrame, no data loaded until
│                    collect). Internal functions:
│                    · apply_ignore, apply_col_overrides, normalize_columns
│                    · apply_info_list, apply_n_override
│                    · derive_z (BETA/SE → Z; P + sign → Z; --a1-inc)
│                    · filter_snps, apply_nstudy_filter
│                    · write_sumstats_gz (gzip TSV output)
│
├── ldscore.rs       LD score computation. Key types and functions:
│                    · BimRecord — CHR/SNP/CM/BP/bed_idx struct
│                    · parse_bim, count_fam, parse_fam — PLINK file parsers
│                    · load_individual_indices — --keep FID/IID file → isize indices
│                    · WindowMode — Cm / Kb / Snp enum
│                    · get_block_lefts_f64, get_block_lefts_snp — window boundaries
│                    · normalize_col — impute NaN → mean, centre, unit-variance (f32)
│                    · compute_ldscore_global — ring-buffer DGEMM loop (sequential,
│                      scalar and partitioned paths share the same pre-alloc buffers)
│                    · r2_unbiased — r² − (1−r²)/(n−2)
│                    · write_ldscore_refs — gzip TSV output
│                    · load_snp_set — HashSet<String> from --extract / --print-snps
│                    · run — orchestrates BIM read, --extract / --annot / --keep,
│                      calls compute_ldscore_global, writes per-chr .l2.ldscore.gz
│                      and .l2.M / .l2.M_5_50 files
│
├── irwls.rs         Iteratively Re-Weighted Least Squares.
│                    · IrwlsResult — est + optional jackknife fields
│                    · irwls(x, y, weights, n_iter) — pre-alloc xw/yw, SVD solve,
│                      reweight on fitted values; zero-alloc inner loop
│
├── jackknife.rs     Block jackknife variance estimation.
│                    · jackknife(x, y, weights, n_blocks, n_iter) →
│                      full-data IRWLS + n_blocks parallel leave-one-out refits
│                      (rayon par_iter) → pseudo-values → SE + covariance matrix
│
└── regressions.rs   h2 and rg regression drivers.
                     · run_h2 — loads sumstats + LD scores, inner-joins on SNP,
                       detects K annotation columns, dispatches to scalar (K=1)
                       or partitioned (K>1) path; supports --two-step, --no-intercept,
                       --intercept-h2, --print-coefficients, liability-scale output
                     · run_h2_partitioned — K-column design matrix, per-annotation
                       enrichment, resolves M vector from per-annotation M files
                     · run_h2_scalar — shared by standalone h2 and rg univariate sub-fits
                     · run_rg — iterates trait pairs; gencov regression + univariate h2
                       per trait; --two-step, --intercept-gencov, --no-intercept,
                       liability-scale rg; prints summary table
                     · load_ld_ref / load_ld — LazyFrame readers for ref and weight LD
                     · resolve_m / resolve_m_vec — reads .l2.M_5_50 or falls back to n_obs
                     · liability_scale_h2 — observed → liability scale conversion
                     · print_jackknife_diagnostics — --print-cov / --print-delete-vals

make_annot.rs        BED → 0/1 annotation generator.
                     · annotate_from_bed — loads BED intervals per chromosome,
                       sorts and merges, binary-search annotation per SNP
                     · annotate_from_gene_set — gene symbols → coordinate lookup →
                       same interval merge/annotate pipeline
                     · write_annot_file — CHR BP SNP CM ANNOT TSV (optional .gz)

Key data-flow invariants

  • ldscore --annot prefix reads {prefix}{chr}.annot[.gz] for every chromosome found in the BIM (not a single prefix.annot.gz file).
  • --extract filters the BIM before window computation; --print-snps filters only the output.
  • bed_idx (original BIM row index) differs from pos (index in the filtered all_snps slice) when --extract is active; bed_idx_to_pos in run() maps between them.
  • --keep passes iid_indices: Option<&Array1<isize>> to bed-reader; n_indiv_actual (not the FAM total) is used for normalization and the r²-unbiased correction.
  • The ring buffer ring_size = max_window + chunk_c guarantees no live window slot is overwritten before it has been consumed in the A×B product.
  • rg --no-intercept propagates fixed_intercept = Some(1.0) to both the gencov regression and each univariate run_h2_scalar call, matching Python's behaviour.

Dependency rationale

Crate Version Role
bed-reader 1 mmap-based PLINK .bed reading; only touched pages loaded
polars 0.53 lazy CSV streaming (munge + LD score file loading)
ndarray + ndarray-linalg 0.16 + 0.17 dense matrix algebra; SVD for IRWLS
blas-src + OpenBLAS 0.10 statically-linked BLAS for DGEMM
rayon 1 data-parallel jackknife blocks
statrs 0.18 Normal CDF/quantile for P→Z conversion
clap 4 derive-macro CLI argument parsing
anyhow / thiserror 1 / 2 error propagation
flate2 1 gzip output for .sumstats.gz and .ldscore.gz