limma-rust 0.1.0

Pure-Rust port of the Bioconductor limma differential-expression package
Documentation

limma-rust

Crates.io Docs.rs CI License: GPL-3.0-or-later

limma-rust is a pure-Rust port of the Bioconductor limma package: linear models and empirical-Bayes moderation for differential expression in microarray, RNA-seq, proteomics, and metabolomics data. It reproduces limma 3.68.3's output to 8+ significant figures (checked function by function against R 4.6.0), makes identical differential-expression calls (zero discordant at FDR 5% across six public accessions and four organisms), and runs 2-15× faster at 4.8-11.8× lower peak memory, with no BLAS, LAPACK, Python, or R at runtime.

Disclaimer. limma-rust is an independent, unofficial Rust port and is not affiliated with, endorsed by, or supported by the authors of limma or the Walter and Eliza Hall Institute (WEHI). For the canonical, authoritative implementation, use the Bioconductor limma package.

Highlights

  • Function-by-function parity. 229 tests assert that each ported function reproduces Bioconductor limma 3.68.3 (on R 4.6.0) to 8+ significant figures; nothing is accepted unless it matches.
  • Validated on real data. End-to-end agreement on six public accessions across three omics modalities (RNA-seq, proteomics, metabolomics) and four organisms (human, mouse, fruit fly, fission yeast), with 100% decideTests concordance and zero discordant significant calls.
  • Bit-exact where it counts. The rotation gene-set tests (roast/mroast/romer) reproduce R's Mersenne-Twister draws exactly, so their Monte-Carlo p-values match to the last digit.
  • Fast, and parallel by default. Pure-Rust SIMD linear algebra (faer) plus across-gene rayon parallelism: 2.0× on microarrays, 7.5× on a 204-sample RNA-seq voom set, 15.4× on synthetic voom, and up to 243× on per-gene workloads like duplicateCorrelation.
  • No native dependencies. All linear algebra and special functions are pure Rust (distributions from statrs): a single static binary, nothing to link, no R or Python to install.
  • Library or CLI. Use it as a crate (use limma::...) or drive the bundled limma command-line tool over CSV. --no-default-features builds a minimal, dependency-light pure-ndarray library.

Installation

The crate is published as limma-rust; the library is imported as limma (so use limma::... mirrors R's limma::).

# Library

cargo add limma-rust


# CLI binary (installs a `limma` executable)

cargo install limma-rust

The default build requires Rust 1.85+. Building --no-default-features (no faer/rayon/clap) drops that requirement for older toolchains.

Quick start: library

lmFit -> eBayes -> topTable, the canonical limma pipeline, exactly as in R:

use limma::{ebayes, lmfit, top_table, SortBy};
use ndarray::Array2;

// Expression matrix: 6 genes (rows) × 6 samples (columns), on the log2 scale.
let exprs = Array2::from_shape_vec((6, 6), vec![
    5.1, 4.9, 5.0, 7.0, 7.2, 6.9, // g1: up in group B
    3.0, 3.1, 2.9, 3.0, 2.8, 3.1, // g2: flat
    8.0, 8.2, 7.9, 6.0, 5.9, 6.1, // g3: down in group B
    1.0, 1.2, 0.9, 1.1, 1.0, 1.3, // g4: flat
    4.0, 4.1, 3.9, 4.2, 4.0, 4.1, // g5: flat
    2.0, 2.1, 1.9, 5.0, 5.1, 4.9, // g6: up in group B
]).unwrap();

// Design: 6 samples (rows) × 2 coefficients: intercept + a group-B indicator.
let design = Array2::from_shape_vec((6, 2), vec![
    1.0, 0.0,  1.0, 0.0,  1.0, 0.0,
    1.0, 1.0,  1.0, 1.0,  1.0, 1.0,
]).unwrap();

let genes: Vec<String> = (1..=6).map(|i| format!("g{i}")).collect();
let coefs = vec!["Intercept".to_string(), "grpB".to_string()];

let mut fit = lmfit(&exprs, &design, genes, coefs).unwrap();
ebayes(&mut fit, 0.01, (0.1, 4.0), /* trend */ false, /* robust */ false).unwrap();

// Tabulate the "grpB" coefficient (index 1), ranked by the B-statistic.
for row in top_table(&fit, 1, SortBy::B, None).unwrap() {
    println!(
        "{:<3} log2FC={:+.2}  t={:+.2}  adj.P={:.1e}  B={:+.2}",
        row.id, row.log2_fold_change, row.t, row.adj_p_value, row.b,
    );
}
// g6 (the +3 log2FC gene) tops the table; g3 (down) and g1 (up) tie next;
// the three flat genes fall to the bottom with strongly negative B.

Quick start: CLI

# lmFit -> eBayes -> topTable for one coefficient

limma --exprs exprs.csv --design design.csv --coef grpB --out results.csv


# Overall moderated F-statistic across all coefficients

limma --exprs exprs.csv --design design.csv --f-test --out ftable.csv


# eBayes with an intensity-dependent variance trend + robust hyperparameters

limma --exprs exprs.csv --design design.csv --coef grpB --trend --robust


# TREAT: moderated t against an absolute log2 fold-change threshold of 1.0

limma --exprs exprs.csv --design design.csv --coef grpB --treat 1.0


# Apply a contrast matrix, then write decideTests calls (BH, p < 0.05)

limma --exprs exprs.csv --design design.csv \

      --contrasts contrasts.csv --coef AS_vs_AR --decide calls.csv

Input format

Both inputs are delimited text whose first column is an identifier and whose header names the remaining columns. The delimiter is detected from the file extension (tab for .tsv / .tab, comma otherwise) and can be forced with --delimiter (a single character, or one of comma, tab, semicolon, space):

# exprs.csv: genes × samples (first column = gene id, header = sample names)
gene,s1,s2,s3,s4,s5,s6
g1,5.1,4.9,5.0,7.0,7.2,6.9
g2,3.0,3.1,2.9,3.0,2.8,3.1

# design.csv: samples × coefficients (first column = sample id, header = coef names)
sample,Intercept,grpB
s1,1,0
s4,1,1

Common options

Option Description Default
--exprs <FILE> Expression matrix CSV/TSV (genes × samples) required
--design <FILE> Design matrix CSV/TSV (samples × coefficients) required
--contrasts <FILE> Contrast matrix CSV/TSV (coefficients × contrasts) none
--delimiter <CH> Force the input field delimiter auto by extension
--coef <NAME|IDX> Coefficient/contrast to tabulate (moderated-t table) last coef
--f-test Tabulate the overall moderated F instead false
--treat <LFC> Run TREAT against an absolute log2-FC threshold none
--trend / --robust Intensity trend / robust eBayes hyperparameters false
--proportion <P> Assumed proportion of DE genes (eBayes) 0.01
--sort <KEY> logFC | AveExpr | P | t | B | none B
--number <N> Maximum rows to emit all
--out <FILE> Write the ranked top table (CSV) stdout
--write-fit <FILE> write.fit-style table, all coefficients (TSV) none
--decide <FILE> decideTests matrix (-1/0/1 per gene per contrast) none

Run limma --help for the full list.

Validation & benchmarks

Every ported function is checked against the installed R limma package: file-based R scripts under reference/ dump per-(gene, coefficient) reference values from the original package, and the Rust test suite asserts the port reproduces them. The end-to-end pipeline is additionally validated and benchmarked on real datasets. Full methodology, datasets, and honest caveats are in benchmarks/REPORT.md.

Numerical parity. The worst-case relative error vs Bioconductor limma is 4.5e-08 (on lods, the cancellation-prone log-odds B-statistic); most statistics agree to ~1e-10, and p-values to ~1e-11. Across every dataset, decideTests agreement is 100% with zero discordant significant calls, cross-checked exactly against limma's own decideTests.

Speed & memory (median, vs R 4.6.0 / limma 3.68.3 on a 16-core Windows 11 machine; compute only, IO excluded):

Workload Shape (features × samples) Pipeline limma-rust vs R
Microarray 12,048 × 37 lmFit -> eBayes 2.0×
Microarray 11,623 × 80 lmFit -> eBayes 2.6×
Real RNA-seq (human lung) 16,413 × 204 voom -> lmFit -> eBayes 7.5×
Synthetic RNA-seq 11,996 × 12 voom -> lmFit -> eBayes 15.4×
camera (gene-set test) 7,919 × 7 competitive set test 3.0×
roast/mroast/romer 7,919 × 7 rotation set tests 7-25×
duplicateCorrelation 5,861 × 36 per-gene REML 243×

Peak resident memory is 4.8-11.8× below R across a 2k-32k-gene / 12-384-sample sweep (whole-process RSS; R's ~110 MB interpreter-plus-limma baseline inflates the ratio on small inputs, settling to ~5× once the data dominates).

Where R still wins. On a wide, unweighted design (181 genes against 1,335 samples), R is ~3× faster: it factorises the shared design once and applies it to every gene via batched LAPACK, whereas the port solves per gene. That is the one documented regime where the reference's batched linear algebra wins; it is a trade-off, not a correctness gap.

Function coverage

limma-rust ports limma's statistical core and most of the analyses built on top of it. The table tracks per-area status against the upstream R source; see reference/inventory.tsv for the full list.

Area limma functions Status
Linear model fit lmFit, lm.series, mrlm, gls.series, nonEstimable, is.fullrank ported
Contrasts contrasts.fit, makeContrasts ported
Empirical Bayes eBayes (incl. trend/robust), squeezeVar, fitFDist, fitFDistRobustly, fitFDistUnequalDF1, tmixture.matrix ported
Lowess / loess loessFit, weightedLowess, tricubeMovingAverage ported
Result tables topTable, topTableF, toptable, BH/BY adjust, write.fit ported
Decision tests decideTests (separate/global/hierarchical/nestedF), classifyTestsF, p.adjust (none/bonferroni/holm/BH/BY) ported
TREAT treat, topTreat ported
RNA-seq weights voom, voomWithQualityWeights, vooma, voomaByGroup, voomaLmFit, arrayWeights, printtipWeights, beadCountWeights ported
Correlation / weights duplicateCorrelation, interGeneCorrelation, removeBatchEffect ported
Normalization normalizeBetweenArrays, normalizeQuantiles, normalizeCyclicLoess, normalizeWithinArrays, MA.RG/RG.MA ported
Background correction backgroundCorrect, normexp.fit (saddle-point MLE), nec, neqc, ma3x3 ported
Gene-set tests camera, cameraPR, fry, geneSetTest, wilcoxGST, ids2indices ported
Gene-set tests (rotation) roast, mroast, romer, topRomer (bit-exact Mersenne-Twister RNG) ported
Exon / splicing diffSplice, topSplice, genas, wsva ported
Two-color helpers exprs.MA, designI2A, designI2M, lmscFit ported
External-numerics norm normalizeVSN, normalizeRobustSpline, normalizeForPrintorder, kooperberg, intraspotCorrelation out of scope
Microarray IO readers read.maimages, read.ilmn, ... out of scope
Containers / S4 methods RGList/MAList/EList classes, cbind/rbind/merge/dim out of scope
Plotting plotMD, plotMDS, volcanoplot, venn, ... out of scope
Annotation goana, kegga out of scope

R to limma-rust name map

Names follow Rust's snake_case; the mapping is otherwise one-to-one. A few common entry points:

R (limma) limma-rust
lmFit lmfit, lmfit_weighted
contrasts.fit / makeContrasts contrasts_fit / make_contrasts
eBayes / treat ebayes / treat
topTable / topTableF top_table / top_table_f
decideTests decide_tests
voom / voomWithQualityWeights voom / voom_with_quality_weights
duplicateCorrelation duplicate_correlation
removeBatchEffect remove_batch_effect
camera / cameraPR camera / camera_pr
roast / mroast / romer roast / mroast / romer
normalizeBetweenArrays normalize_between_arrays

Building from source

cargo build --release              # default: parallel + faer + CLI

cargo test                         # 229 parity tests + CLI integration test

cargo build --no-default-features  # minimal pure-ndarray library

How this port was produced

limma-rust was written with AI assistance (Anthropic's Claude, via Claude Code); the Co-Authored-By trailers in the git history record this. Its outputs are checked numerically against the upstream package, as described under Validation & benchmarks. The limma source is fetched locally to reference/limma-src/ for study (via reference/get_source.R) and is not vendored into this repository.

License

limma is licensed GPL (>= 2). As a derivative port, limma-rust is distributed under GPL-3.0-or-later; the full license text is in LICENSE.

Credits & citation

limma is the work of Gordon K. Smyth and colleagues. This port reimplements their algorithms; it does not originate them. If you use limma-rust in research, please cite the original limma paper:

Ritchie, M.E., Phipson, B., Wu, D., Hu, Y., Law, C.W., Shi, W., and Smyth, G.K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43(7), e47. doi:10.1093/nar/gkv007