limma-rust 0.1.0

# limma-rust


[![Crates.io](https://img.shields.io/crates/v/limma-rust.svg)](https://crates.io/crates/limma-rust)
[![Docs.rs](https://img.shields.io/docsrs/limma-rust)](https://docs.rs/limma-rust)
[![CI](https://github.com/sinmojito/limma-rust/actions/workflows/ci.yml/badge.svg)](https://github.com/sinmojito/limma-rust/actions/workflows/ci.yml)
[![License: GPL-3.0-or-later](https://img.shields.io/badge/license-GPL--3.0--or--later-blue.svg)](LICENSE)

**limma-rust** is a pure-Rust port of the Bioconductor [**limma**](https://bioconductor.org/packages/limma/)
package: linear models and empirical-Bayes moderation for differential
expression in microarray, RNA-seq, proteomics, and metabolomics data. It
reproduces limma 3.68.3's output to **8+ significant figures** (checked function
by function against R 4.6.0), makes **identical differential-expression calls**
(zero discordant at FDR 5% across six public accessions and four organisms), and
runs **2-15× faster at 4.8-11.8× lower peak memory**, with **no BLAS, LAPACK,
Python, or R at runtime**.

> **Disclaimer.** limma-rust is an independent, unofficial Rust port and is **not**
> affiliated with, endorsed by, or supported by the authors of limma or the
> Walter and Eliza Hall Institute (WEHI). For the canonical, authoritative
> implementation, use the [Bioconductor limma package](https://bioconductor.org/packages/limma/).

## Highlights


- **Function-by-function parity.** 229 tests assert that each ported function
  reproduces Bioconductor limma 3.68.3 (on R 4.6.0) to 8+ significant figures;
  nothing is accepted unless it matches.
- **Validated on real data.** End-to-end agreement on six public accessions
  across three omics modalities (RNA-seq, proteomics, metabolomics) and four
  organisms (human, mouse, fruit fly, fission yeast), with **100% `decideTests`
  concordance** and zero discordant significant calls.
- **Bit-exact where it counts.** The rotation gene-set tests
  (`roast`/`mroast`/`romer`) reproduce R's Mersenne-Twister draws exactly, so
  their Monte-Carlo p-values match *to the last digit*.
- **Fast, and parallel by default.** Pure-Rust SIMD linear algebra
  ([`faer`](https://crates.io/crates/faer)) plus across-gene `rayon` parallelism:
  2.0× on microarrays, 7.5× on a 204-sample RNA-seq voom set, 15.4× on synthetic
  voom, and up to 243× on per-gene workloads like `duplicateCorrelation`.
- **No native dependencies.** All linear algebra and special functions are pure
  Rust (distributions from [`statrs`](https://crates.io/crates/statrs)): a single
  static binary, nothing to link, no R or Python to install.
- **Library or CLI.** Use it as a crate (`use limma::...`) or drive the bundled
  `limma` command-line tool over CSV. `--no-default-features` builds a minimal,
  dependency-light pure-`ndarray` library.

## Installation


The crate is published as **`limma-rust`**; the library is imported as **`limma`**
(so `use limma::...` mirrors R's `limma::`).

```bash
# Library

cargo add limma-rust

# CLI binary (installs a `limma` executable)

cargo install limma-rust
```

The default build requires Rust **1.85+**. Building `--no-default-features` (no
`faer`/`rayon`/`clap`) drops that requirement for older toolchains.

## Quick start: library


`lmFit -> eBayes -> topTable`, the canonical limma pipeline, exactly as in R:

```rust
use limma::{ebayes, lmfit, top_table, SortBy};
use ndarray::Array2;

// Expression matrix: 6 genes (rows) × 6 samples (columns), on the log2 scale.
let exprs = Array2::from_shape_vec((6, 6), vec![
    5.1, 4.9, 5.0, 7.0, 7.2, 6.9, // g1: up in group B
    3.0, 3.1, 2.9, 3.0, 2.8, 3.1, // g2: flat
    8.0, 8.2, 7.9, 6.0, 5.9, 6.1, // g3: down in group B
    1.0, 1.2, 0.9, 1.1, 1.0, 1.3, // g4: flat
    4.0, 4.1, 3.9, 4.2, 4.0, 4.1, // g5: flat
    2.0, 2.1, 1.9, 5.0, 5.1, 4.9, // g6: up in group B
]).unwrap();

// Design: 6 samples (rows) × 2 coefficients: intercept + a group-B indicator.
let design = Array2::from_shape_vec((6, 2), vec![
    1.0, 0.0,  1.0, 0.0,  1.0, 0.0,
    1.0, 1.0,  1.0, 1.0,  1.0, 1.0,
]).unwrap();

let genes: Vec<String> = (1..=6).map(|i| format!("g{i}")).collect();
let coefs = vec!["Intercept".to_string(), "grpB".to_string()];

let mut fit = lmfit(&exprs, &design, genes, coefs).unwrap();
ebayes(&mut fit, 0.01, (0.1, 4.0), /* trend */ false, /* robust */ false).unwrap();

// Tabulate the "grpB" coefficient (index 1), ranked by the B-statistic.
for row in top_table(&fit, 1, SortBy::B, None).unwrap() {
    println!(
        "{:<3} log2FC={:+.2}  t={:+.2}  adj.P={:.1e}  B={:+.2}",
        row.id, row.log2_fold_change, row.t, row.adj_p_value, row.b,
    );
}
// g6 (the +3 log2FC gene) tops the table; g3 (down) and g1 (up) tie next;
// the three flat genes fall to the bottom with strongly negative B.
```

## Quick start: CLI


```bash
# lmFit -> eBayes -> topTable for one coefficient

limma --exprs exprs.csv --design design.csv --coef grpB --out results.csv

# Overall moderated F-statistic across all coefficients

limma --exprs exprs.csv --design design.csv --f-test --out ftable.csv

# eBayes with an intensity-dependent variance trend + robust hyperparameters

limma --exprs exprs.csv --design design.csv --coef grpB --trend --robust

# TREAT: moderated t against an absolute log2 fold-change threshold of 1.0

limma --exprs exprs.csv --design design.csv --coef grpB --treat 1.0

# Apply a contrast matrix, then write decideTests calls (BH, p < 0.05)

limma --exprs exprs.csv --design design.csv \
      --contrasts contrasts.csv --coef AS_vs_AR --decide calls.csv
```

### Input format


Both inputs are delimited text whose first column is an identifier and whose
header names the remaining columns. The delimiter is detected from the file
extension (tab for `.tsv` / `.tab`, comma otherwise) and can be forced with
`--delimiter` (a single character, or one of `comma`, `tab`, `semicolon`,
`space`):

```text
# exprs.csv: genes × samples (first column = gene id, header = sample names)

gene,s1,s2,s3,s4,s5,s6
g1,5.1,4.9,5.0,7.0,7.2,6.9
g2,3.0,3.1,2.9,3.0,2.8,3.1

# design.csv: samples × coefficients (first column = sample id, header = coef names)

sample,Intercept,grpB
s1,1,0
s4,1,1
```

### Common options


| Option | Description | Default |
|---|---|---|
| `--exprs <FILE>` | Expression matrix CSV/TSV (genes × samples) | *required* |
| `--design <FILE>` | Design matrix CSV/TSV (samples × coefficients) | *required* |
| `--contrasts <FILE>` | Contrast matrix CSV/TSV (coefficients × contrasts) | none |
| `--delimiter <CH>` | Force the input field delimiter | auto by extension |
| `--coef <NAME\|IDX>` | Coefficient/contrast to tabulate (moderated-t table) | last coef |
| `--f-test` | Tabulate the overall moderated F instead | `false` |
| `--treat <LFC>` | Run TREAT against an absolute log2-FC threshold | none |
| `--trend` / `--robust` | Intensity trend / robust eBayes hyperparameters | `false` |
| `--proportion <P>` | Assumed proportion of DE genes (eBayes) | `0.01` |
| `--sort <KEY>` | `logFC` \| `AveExpr` \| `P` \| `t` \| `B` \| `none` | `B` |
| `--number <N>` | Maximum rows to emit | all |
| `--out <FILE>` | Write the ranked top table (CSV) | stdout |
| `--write-fit <FILE>` | `write.fit`-style table, all coefficients (TSV) | none |
| `--decide <FILE>` | `decideTests` matrix (-1/0/1 per gene per contrast) | none |

Run `limma --help` for the full list.

## Validation & benchmarks


Every ported function is checked against the installed R limma package: file-based
R scripts under `reference/` dump per-(gene, coefficient) reference values from the
original package, and the Rust test suite asserts the port reproduces them. The
end-to-end pipeline is additionally validated and benchmarked on real datasets.
Full methodology, datasets, and honest caveats are in
[`benchmarks/REPORT.md`](https://github.com/sinmojito/limma-rust/blob/main/benchmarks/REPORT.md).

**Numerical parity.** The worst-case relative error vs Bioconductor limma is
**4.5e-08** (on `lods`, the cancellation-prone log-odds B-statistic); most
statistics agree to ~1e-10, and p-values to ~1e-11. Across every dataset,
**`decideTests` agreement is 100%** with zero discordant significant calls,
cross-checked exactly against limma's own `decideTests`.

**Speed & memory** (median, vs R 4.6.0 / limma 3.68.3 on a 16-core Windows 11
machine; compute only, IO excluded):

| Workload | Shape (features × samples) | Pipeline | limma-rust vs R |
|---|---:|---|---:|
| Microarray | 12,048 × 37 | `lmFit -> eBayes` | **2.0×** |
| Microarray | 11,623 × 80 | `lmFit -> eBayes` | **2.6×** |
| Real RNA-seq (human lung) | 16,413 × 204 | `voom -> lmFit -> eBayes` | **7.5×** |
| Synthetic RNA-seq | 11,996 × 12 | `voom -> lmFit -> eBayes` | **15.4×** |
| `camera` (gene-set test) | 7,919 × 7 | competitive set test | **3.0×** |
| `roast`/`mroast`/`romer` | 7,919 × 7 | rotation set tests | **7-25×** |
| `duplicateCorrelation` | 5,861 × 36 | per-gene REML | **243×** |

Peak resident memory is **4.8-11.8× below R** across a 2k-32k-gene /
12-384-sample sweep (whole-process RSS; R's ~110 MB interpreter-plus-limma
baseline inflates the ratio on small inputs, settling to ~5× once the data
dominates).

**Where R still wins.** On a *wide, unweighted* design (181 genes against 1,335
samples), R is ~3× faster: it factorises the shared design once and applies it to
every gene via batched LAPACK, whereas the port solves per gene. That is the one
documented regime where the reference's batched linear algebra wins; it is a
trade-off, not a correctness gap.

## Function coverage


limma-rust ports limma's statistical core and most of the analyses built on top of
it. The table tracks per-area status against the upstream R source; see
[`reference/inventory.tsv`](https://github.com/sinmojito/limma-rust/blob/main/reference/inventory.tsv)
for the full list.

| Area | limma functions | Status |
|------|-----------------|--------|
| Linear model fit | `lmFit`, `lm.series`, `mrlm`, `gls.series`, `nonEstimable`, `is.fullrank` | ported |
| Contrasts | `contrasts.fit`, `makeContrasts` | ported |
| Empirical Bayes | `eBayes` (incl. `trend`/`robust`), `squeezeVar`, `fitFDist`, `fitFDistRobustly`, `fitFDistUnequalDF1`, `tmixture.matrix` | ported |
| Lowess / loess | `loessFit`, `weightedLowess`, `tricubeMovingAverage` | ported |
| Result tables | `topTable`, `topTableF`, `toptable`, BH/BY adjust, `write.fit` | ported |
| Decision tests | `decideTests` (separate/global/hierarchical/nestedF), `classifyTestsF`, `p.adjust` (none/bonferroni/holm/BH/BY) | ported |
| TREAT | `treat`, `topTreat` | ported |
| RNA-seq weights | `voom`, `voomWithQualityWeights`, `vooma`, `voomaByGroup`, `voomaLmFit`, `arrayWeights`, `printtipWeights`, `beadCountWeights` | ported |
| Correlation / weights | `duplicateCorrelation`, `interGeneCorrelation`, `removeBatchEffect` | ported |
| Normalization | `normalizeBetweenArrays`, `normalizeQuantiles`, `normalizeCyclicLoess`, `normalizeWithinArrays`, `MA.RG`/`RG.MA` | ported |
| Background correction | `backgroundCorrect`, `normexp.fit` (saddle-point MLE), `nec`, `neqc`, `ma3x3` | ported |
| Gene-set tests | `camera`, `cameraPR`, `fry`, `geneSetTest`, `wilcoxGST`, `ids2indices` | ported |
| Gene-set tests (rotation) | `roast`, `mroast`, `romer`, `topRomer` (bit-exact Mersenne-Twister RNG) | ported |
| Exon / splicing | `diffSplice`, `topSplice`, `genas`, `wsva` | ported |
| Two-color helpers | `exprs.MA`, `designI2A`, `designI2M`, `lmscFit` | ported |
| External-numerics norm | `normalizeVSN`, `normalizeRobustSpline`, `normalizeForPrintorder`, `kooperberg`, `intraspotCorrelation` | out of scope |
| Microarray IO readers | `read.maimages`, `read.ilmn`, ... | out of scope |
| Containers / S4 methods | `RGList`/`MAList`/`EList` classes, `cbind`/`rbind`/`merge`/`dim` | out of scope |
| Plotting | `plotMD`, `plotMDS`, `volcanoplot`, `venn`, ... | out of scope |
| Annotation | `goana`, `kegga` | out of scope |

## R to limma-rust name map


Names follow Rust's `snake_case`; the mapping is otherwise one-to-one. A few
common entry points:

| R (limma) | limma-rust |
|---|---|
| `lmFit` | `lmfit`, `lmfit_weighted` |
| `contrasts.fit` / `makeContrasts` | `contrasts_fit` / `make_contrasts` |
| `eBayes` / `treat` | `ebayes` / `treat` |
| `topTable` / `topTableF` | `top_table` / `top_table_f` |
| `decideTests` | `decide_tests` |
| `voom` / `voomWithQualityWeights` | `voom` / `voom_with_quality_weights` |
| `duplicateCorrelation` | `duplicate_correlation` |
| `removeBatchEffect` | `remove_batch_effect` |
| `camera` / `cameraPR` | `camera` / `camera_pr` |
| `roast` / `mroast` / `romer` | `roast` / `mroast` / `romer` |
| `normalizeBetweenArrays` | `normalize_between_arrays` |

## Building from source


```bash
cargo build --release              # default: parallel + faer + CLI
cargo test                         # 229 parity tests + CLI integration test
cargo build --no-default-features  # minimal pure-ndarray library
```

## How this port was produced


limma-rust was written with AI assistance (Anthropic's Claude, via Claude Code);
the `Co-Authored-By` trailers in the git history record this. Its outputs are
checked numerically against the upstream package, as described under
[Validation & benchmarks](#validation--benchmarks). The limma source is fetched
locally to `reference/limma-src/` for study (via `reference/get_source.R`) and is
not vendored into this repository.

## License


limma is licensed GPL (>= 2). As a derivative port, limma-rust is distributed under
**GPL-3.0-or-later**; the full license text is in [`LICENSE`](LICENSE).

## Credits & citation


limma is the work of Gordon K. Smyth and colleagues. This port reimplements their
algorithms; it does not originate them. If you use limma-rust in research, please
cite the original limma paper:

> Ritchie, M.E., Phipson, B., Wu, D., Hu, Y., Law, C.W., Shi, W., and Smyth,
> G.K. (2015). limma powers differential expression analyses for RNA-sequencing
> and microarray studies. *Nucleic Acids Research* 43(7), e47.
> doi:[10.1093/nar/gkv007](https://doi.org/10.1093/nar/gkv007)