ldsc 0.1.0

LD Score Regression — fast Rust reimplementation of Bulik-Sullivan et al. LDSC
# LDSC — Rust Rewrite

[![CI](https://github.com/sharifhsn/ldsc/actions/workflows/ci.yml/badge.svg)](https://github.com/sharifhsn/ldsc/actions)
[![crates.io](https://img.shields.io/crates/v/ldsc.svg)](https://crates.io/crates/ldsc)
[![License: GPL-3.0](https://img.shields.io/badge/license-GPL--3.0-blue.svg)](LICENSE)
[![MSRV: 1.85](https://img.shields.io/badge/rustc-1.85%2B-orange.svg)](https://blog.rust-lang.org/2025/02/20/Rust-1.85.0.html)

A compiled, statically-typed rewrite of [Bulik-Sullivan et al.'s LDSC](https://github.com/bulik/ldsc) in Rust.
Implements the same five subcommands — `munge-sumstats`, `ldscore`, `h2`, `rg`, `make-annot` — with
identical numerical output and a 7× speedup on LD score computation.

---

## Installation

```bash
cargo install ldsc
```

Requires Rust ≥ 1.85 and a C toolchain with Fortran support to build the statically-linked OpenBLAS.
On Debian/Ubuntu:

```bash
sudo apt-get install cmake gfortran libgfortran-dev
```

---

## Building from source

Requires a Rust toolchain (≥ 1.85; edition 2024 features used). OpenBLAS is linked statically — no
runtime library installation needed.

```bash
cargo build --release
# binary: target/release/ldsc
```

The release profile sets `opt-level = 3`, `lto = "thin"`, `codegen-units = 1`.

---

## Usage

### munge-sumstats

Pre-processes GWAS summary statistics into the `.sumstats.gz` format consumed by `h2` and `rg`.

```bash
ldsc munge-sumstats \
  --sumstats my_gwas.txt.gz \
  --out output_prefix \
  [--merge-alleles w_hm3.snplist] \
  [--signed-sumstats BETA,0] \
  [--n 50000] \
  [--info-min 0.9] \
  [--maf 0.01]
```

Key flags: `--signed-sumstats COLNAME,null_value` tells the tool which column carries effect direction and what the
null value is (e.g. `BETA,0`, `OR,1`, `Z,0`). Without this flag the tool auto-detects from BETA/LOG_ODDS/OR/Z columns.
`--a1-inc` skips the signed column and treats all Z-scores as positive (A1 is always the risk allele).

### ldscore

Computes LD scores from a PLINK binary file set.

```bash
ldsc ldscore \
  --bfile /path/to/1000G_EUR \
  --out out/eur \
  --ld-wind-cm 1.0 \
  [--annot annotations/BaselineLD.] \
  [--extract snplist.txt] \
  [--keep keep_individuals.txt] \
  [--per-allele] \
  [--blas-threads 4]
```

Window flags are mutually exclusive: `--ld-wind-cm` (genetic distance, default 1.0), `--ld-wind-kb`
(physical distance), or `--ld-wind-snp` (fixed flanking SNP count).

Partitioned LD scores with `--annot prefix`: expects `{prefix}{chr}.annot[.gz]` for each chromosome present in
the BIM, outputs one L2 column per annotation and corresponding `.l2.M` / `.l2.M_5_50` files.

### h2

Estimates SNP heritability.

```bash
ldsc h2 \
  --h2 trait.sumstats.gz \
  --ref-ld-chr eur_ldscores/chr \
  --w-ld-chr eur_ldscores/chr \
  --out results
```

Common options: `--no-intercept`, `--intercept-h2 VALUE`, `--two-step 30`, `--chisq-max 80`,
`--samp-prev 0.1 --pop-prev 0.01` (liability-scale conversion),
`--print-coefficients` (partitioned h2: per-annotation τ and enrichment).

### rg

Estimates genetic correlations across all pairs from a list of summary statistic files.

```bash
ldsc rg \
  --rg trait1.sumstats.gz,trait2.sumstats.gz,trait3.sumstats.gz \
  --ref-ld-chr eur_ldscores/chr \
  --w-ld-chr eur_ldscores/chr \
  --out results
```

Common options: `--no-intercept`, `--intercept-gencov 0.0,0.0` (per-pair), `--two-step 30`,
`--samp-prev` / `--pop-prev` (comma-separated, one value per input file).

### make-annot

Generates 0/1 annotation files from a UCSC BED file or a gene set.

```bash
# From a BED file:
ldsc make-annot \
  --bimfile my_data.bim \
  --bed-file regions.bed \
  --annot-file output.annot.gz \
  --windowsize 100000

# From a gene set:
ldsc make-annot \
  --bimfile my_data.bim \
  --gene-set-file immune_genes.txt \
  --gene-coord-file ENSG_coord.txt \
  --annot-file output.annot.gz \
  --windowsize 100000
```

---

## Performance

Benchmarks against the original Python on a 16-core desktop (AMD Ryzen 9 5950X) using 1000 Genomes
Phase 3 (n = 2,504 individuals, `--ld-wind-snp 100`).

| Dataset | SNPs | Rust | Python | Speedup |
|---------|------|------|--------|---------|
| chr22 | 24,624 | 1 s | 7 s | **7.0×** |
| 20% genome | 333,000 | 12 s | 88 s | **7.3×** |

Correctness: all 1,664,851 SNPs in the full 1000G genome verified to match Python within 0.001
(max diff 0.000508, median 0.000250) after fixing the four algorithmic bugs described below.

---

## Differences from Python

### Flag renames

| Python | Rust | Note |
|--------|------|------|
| `--ld-wind-snps` | `--ld-wind-snp` | trailing `s` removed |
| `--N` / `--N-col` | `--n` / `--n-col` | lowercase |
| `--M` | `--m-snps` | renamed |
| `--snp` / `--a1` / `--a2` / `--p` / `--frq` / `--info` | `--snp-col` / `--a1-col` / ... | `--*-col` suffix |
| `--maf-min` | `--maf` | renamed |

### Behavioural differences

- **`--maf` in ldscore**: applied as a post-filter on output (Python applies it pre-computation).
  For typical thresholds (≥ 0.01) this makes no numerical difference.
- **`--n-min` default**: Python defaults to 90th-percentile / 2; Rust defaults to 0 (disabled).
- **`--yes-really`**: Python refuses `--ld-wind-cm` spanning an entire chromosome without this flag;
  Rust has no such guard.
- **`--chunksize`**: Python requires explicit chunking for large files; Rust uses Polars LazyFrame
  streaming and ignores chunk size for munge.
- **`--return-silly-things` / `--invert-anyway`**: accepted flags for CLI parity; Rust never clips
  results and always uses a least-squares solver.

---

## Unimplemented Features

The following Python flags have no Rust equivalent and are not planned:

| Subcommand | Flag | Reason |
|------------|------|--------|
| `munge-sumstats` | `--daner` / `--daner-n` | Ripke daner format; niche use case |
| `ldscore` | `--pq-exp` | Generalised per-allele weighting; rare |
| `ldscore` | `--cts-bin` / `--cts-breaks` / `--cts-names` | Continuous variable partitioning |
| `ldscore` | `--no-print-annot` | Print-suppression flag; trivial |
| `h2` | `--h2-cts` / `--ref-ld-chr-cts` / `--print-all-cts` | Cell-type-specific h2 |
| `h2` / `rg` | `--overlap-annot` / `--frqfile-chr` | Overlapping annotation categories |
| `rg` | `--intercept-h2` | Fix per-trait h2 intercepts in rg |
| All | `.bz2` input files | Only `.gz` supported |

---

## Performance Deep-Dive

### Why Python is slow

The original Python implementation is bottlenecked by three independent factors:

1. **GIL-blocked jackknife.** `jackknife.py` runs 200 leave-one-block-out IRWLS refits sequentially.
   Each refit is a `scipy.linalg.lstsq` call that releases the GIL, but Python loop overhead and
   NumPy's per-call allocation dominate at this problem size.

2. **Per-SNP NumPy allocation in the LD score loop.** `ldscore.py` calls `np.dot` in a Python-level
   loop with fresh array views on each of the ~33,000 chunks for a 1M-SNP genome. Python's boxing
   overhead and NumPy's internal allocation path are not amortised.

3. **Sequential LD computation.** The GIL prevents genuine thread-level parallelism in the
   correlation loop.

### What the Rust implementation does differently

#### 1. Ring-buffer genotype store (`ldscore.rs`)

Python allocates a new `rfuncA` matrix every chunk. Rust pre-allocates a single F-order
`Array2<f64>` of shape `(n_indiv, ring_size)` where `ring_size = max_window + chunk_c`. SNP columns
are written into successive ring slots modulo `ring_size`; evicted slots are reused. This eliminates
~33,000 heap allocations for a 1M-SNP genome and improves cache locality because each active column
occupies a contiguous 8-byte stride in memory.

#### 2. Single DGEMM per chunk

For each chunk of B SNPs the computation is:

```
BB = Bᵀ · B          (chunk × chunk, unbiased r²)
AB = Aᵀ · B          (window × chunk, unbiased r²)
```

Both are single `ndarray::dot` calls dispatched to OpenBLAS DGEMM. The window matrix `A` is
assembled from ring slots into a pre-allocated F-order `a_buf` so columns are contiguous in memory
and the DGEMM kernel can stride through them without gather operations.

#### 3. Tuned BLAS thread count

OpenBLAS defaults to using all available cores, which creates thread-spawning overhead that
outweighs the parallelism benefit for the small matrices in 1000G-scale LD computation
(n ≈ 2,500, window ≈ 200). The Rust binary calls `openblas_set_num_threads(4)` at startup
(overridable with `--blas-threads`). This is optimal for 1000G; biobank data (n > 10,000) may
benefit from higher values.

#### 4. Global sequential pass — no cross-chromosome boundary artefact

Python processes all chromosomes as a single ordered dataset. With `--ld-wind-snps`, the last 100
SNPs of chromosome k and the first 100 of chromosome k+1 are within each other's windows. The 1000G
reference panel contains five continental populations, creating population-stratification-driven
Pearson r up to 0.38 across chromosome boundaries. Earlier versions of the Rust code ran per-chromosome
in parallel, which zeroed out these cross-boundary contributions and produced L2 values 1–2 units too
low for boundary SNPs. The current implementation mirrors Python: a single global pass over all SNPs
in BIM order, with per-chromosome files written from the global L2 array after the fact.

#### 5. Parallel block jackknife (`jackknife.rs`)

The 200 leave-one-block-out IRWLS refits are independent. Rayon's `into_par_iter` distributes them
across all available cores. Each refit allocates two ndarray views and one LAPACK SVD call; the total
wall time for h2 and rg is dominated by the file I/O and merge join, not the jackknife.

#### 6. Polars LazyFrame for munge (`munge.rs`)

`munge_sumstats.py` uses pandas, which loads the entire file into RAM before filtering. The Rust
implementation uses Polars `LazyCsvReader`, which pushes column selection, renaming, and filter
predicates into a query plan that streams the file in chunks. For large GWAS files (> 1 M SNPs) the
peak RAM is proportional to the output size, not the input size.

---

## Source Map (for LLMs)

```
src/
├── main.rs          Clap dispatch — parses CLI, calls into subcommand modules.
│
├── cli.rs           All argument structs (MungeArgs, LdscoreArgs, H2Args, RgArgs,
│                    MakeAnnotArgs). No logic — pure clap derive macros.
│
├── parse.rs         File I/O helpers:
│                    · scan_sumstats / scan_ldscore  → Polars LazyFrame
│                    · concat_chrs(prefix, suffix)   → concat per-chr files
│                    · read_m_total / read_m_vec      → .l2.M files
│                    · read_annot(prefix, thin)       → Array2<f64> + col names
│
├── munge.rs         munge-sumstats pipeline (Polars LazyFrame, no data loaded until
│                    collect). Internal functions:
│                    · apply_ignore, apply_col_overrides, normalize_columns
│                    · apply_info_list, apply_n_override
│                    · derive_z (BETA/SE → Z; P + sign → Z; --a1-inc)
│                    · filter_snps, apply_nstudy_filter
│                    · write_sumstats_gz (gzip TSV output)
│
├── ldscore.rs       LD score computation. Key types and functions:
│                    · BimRecord — CHR/SNP/CM/BP/bed_idx struct
│                    · parse_bim, count_fam, parse_fam — PLINK file parsers
│                    · load_individual_indices — --keep FID/IID file → isize indices
│                    · WindowMode — Cm / Kb / Snp enum
│                    · get_block_lefts_f64, get_block_lefts_snp — window boundaries
│                    · normalize_col — impute NaN → mean, centre, unit-variance (f32)
│                    · compute_ldscore_global — ring-buffer DGEMM loop (sequential,
│                      scalar and partitioned paths share the same pre-alloc buffers)
│                    · r2_unbiased — r² − (1−r²)/(n−2)
│                    · write_ldscore_refs — gzip TSV output
│                    · load_snp_set — HashSet<String> from --extract / --print-snps
│                    · run — orchestrates BIM read, --extract / --annot / --keep,
│                      calls compute_ldscore_global, writes per-chr .l2.ldscore.gz
│                      and .l2.M / .l2.M_5_50 files
│
├── irwls.rs         Iteratively Re-Weighted Least Squares.
│                    · IrwlsResult — est + optional jackknife fields
│                    · irwls(x, y, weights, n_iter) — pre-alloc xw/yw, SVD solve,
│                      reweight on fitted values; zero-alloc inner loop
│
├── jackknife.rs     Block jackknife variance estimation.
│                    · jackknife(x, y, weights, n_blocks, n_iter) →
│                      full-data IRWLS + n_blocks parallel leave-one-out refits
│                      (rayon par_iter) → pseudo-values → SE + covariance matrix
│
└── regressions.rs   h2 and rg regression drivers.
                     · run_h2 — loads sumstats + LD scores, inner-joins on SNP,
                       detects K annotation columns, dispatches to scalar (K=1)
                       or partitioned (K>1) path; supports --two-step, --no-intercept,
                       --intercept-h2, --print-coefficients, liability-scale output
                     · run_h2_partitioned — K-column design matrix, per-annotation
                       enrichment, resolves M vector from per-annotation M files
                     · run_h2_scalar — shared by standalone h2 and rg univariate sub-fits
                     · run_rg — iterates trait pairs; gencov regression + univariate h2
                       per trait; --two-step, --intercept-gencov, --no-intercept,
                       liability-scale rg; prints summary table
                     · load_ld_ref / load_ld — LazyFrame readers for ref and weight LD
                     · resolve_m / resolve_m_vec — reads .l2.M_5_50 or falls back to n_obs
                     · liability_scale_h2 — observed → liability scale conversion
                     · print_jackknife_diagnostics — --print-cov / --print-delete-vals

make_annot.rs        BED → 0/1 annotation generator.
                     · annotate_from_bed — loads BED intervals per chromosome,
                       sorts and merges, binary-search annotation per SNP
                     · annotate_from_gene_set — gene symbols → coordinate lookup →
                       same interval merge/annotate pipeline
                     · write_annot_file — CHR BP SNP CM ANNOT TSV (optional .gz)
```

### Key data-flow invariants

- `ldscore --annot prefix` reads `{prefix}{chr}.annot[.gz]` for every chromosome found in
  the BIM (not a single `prefix.annot.gz` file).
- `--extract` filters the BIM *before* window computation; `--print-snps` filters only the output.
- `bed_idx` (original BIM row index) differs from `pos` (index in the filtered `all_snps` slice)
  when `--extract` is active; `bed_idx_to_pos` in `run()` maps between them.
- `--keep` passes `iid_indices: Option<&Array1<isize>>` to `bed-reader`; `n_indiv_actual` (not
  the FAM total) is used for normalization and the r²-unbiased correction.
- The ring buffer `ring_size = max_window + chunk_c` guarantees no live window slot is overwritten
  before it has been consumed in the A×B product.
- `rg --no-intercept` propagates `fixed_intercept = Some(1.0)` to both the gencov regression and
  each univariate `run_h2_scalar` call, matching Python's behaviour.

### Dependency rationale

| Crate | Version | Role |
|-------|---------|------|
| `bed-reader` | 1 | mmap-based PLINK .bed reading; only touched pages loaded |
| `polars` | 0.53 | lazy CSV streaming (munge + LD score file loading) |
| `ndarray` + `ndarray-linalg` | 0.16 + 0.17 | dense matrix algebra; SVD for IRWLS |
| `blas-src` + OpenBLAS | 0.10 | statically-linked BLAS for DGEMM |
| `rayon` | 1 | data-parallel jackknife blocks |
| `statrs` | 0.18 | Normal CDF/quantile for P→Z conversion |
| `clap` | 4 | derive-macro CLI argument parsing |
| `anyhow` / `thiserror` | 1 / 2 | error propagation |
| `flate2` | 1 | gzip output for .sumstats.gz and .ldscore.gz |