gsva-rust 0.1.0

# gsva-rust

[![Crates.io](https://img.shields.io/crates/v/gsva-rust.svg)](https://crates.io/crates/gsva-rust)
[![Docs.rs](https://img.shields.io/docsrs/gsva-rust)](https://docs.rs/gsva-rust)
[![CI](https://github.com/sinmojito/gsva-rust/actions/workflows/ci.yml/badge.svg)](https://github.com/sinmojito/gsva-rust/actions/workflows/ci.yml)
[![License: Artistic-2.0](https://img.shields.io/badge/license-Artistic--2.0-blue.svg)](LICENSE)

**gsva-rust** is a pure-Rust port of the Bioconductor [**GSVA**](https://bioconductor.org/packages/GSVA/)
family of gene-set enrichment methods — GSVA, ssGSEA, the combined z-score, and
PLAGE. Checked against R 4.6.0 / GSVA 2.6.2: `gsva` (Gaussian kcdf) and `ssgsea`
reproduce the R package **bit-for-bit** on real data, while `zscore` and `plage`
match to **~1 ULP**. The default build runs every method **5–50× faster** than
Bioconductor GSVA, with **no BLAS, LAPACK, Python, or R at runtime**.

> **Disclaimer.** gsva-rust is an independent, unofficial Rust port and is **not**
> affiliated with, endorsed by, or supported by the authors of GSVA or
> Bioconductor. For the canonical, authoritative implementation, use the
> [Bioconductor GSVA package](https://bioconductor.org/packages/GSVA/).

Pre-1.0: the public API may still shift before the first crates.io release. The
default build's only dependencies are two pure-Rust crates —
[`rayon`](https://crates.io/crates/rayon) for parallelism and
[`faer`](https://crates.io/crates/faer) for `plage`'s SVD; `--no-default-features`
drops both for a zero-dependency, pure-serial crate.

## Highlights

- **Parity with the R package, not with hand-picked values.** Every reference is
  dumped from Bioconductor `GSVA` at full `%.17g` precision. `gsva` (Gaussian
  kcdf) and `ssgsea` reproduce it bit-for-bit on real data; `zscore` and `plage`
  match to ~1 ULP.
- **Validated at two levels.** Seeded synthetic fixtures run offline under
  `cargo test`, plus a real-data gate on a public glioblastoma cohort scored
  against brain-tissue gene sets.
- **Four methods, one call shape.** `gsva`, `ssgsea`, `zscore`, and `plage` each
  take `(&ExprMatrix, &GeneSets, &Params)` and return an `EnrichmentResult`.
- **Faster than R, parity intact.** `rayon` parallelizes each method's
  independent axis bit-identically; `plage`'s SVD uses pure-Rust SIMD
  [`faer`](https://crates.io/crates/faer). 5.3–49.9× over Bioconductor GSVA across
  the five method/normalization variants.
- **Dependency-light.** Two pure-Rust deps by default, zero with
  `--no-default-features`. MSRV **1.85**, set by faer 0.24's dependency tree; CI
  builds all three feature configs on it.

## Installation

```toml
[dependencies]
gsva-rust = "0.1"
```

The library is imported as `gsva`:

```rust
use gsva::ExprMatrix;
```

## Quick start

```rust
use gsva::{ExprMatrix, GeneSet, GeneSets, GsvaParams};

// Expression as genes × samples, stored row-major.
let expr = ExprMatrix::new(
    vec!["TP53".into(), "EGFR".into(), "MYC".into(), "PTEN".into()],
    vec!["sample1".into(), "sample2".into(), "sample3".into()],
    vec![
        5.1, 6.2, 4.8, // TP53
        2.0, 9.1, 3.3, // EGFR
        7.7, 1.2, 8.0, // MYC
        4.4, 4.6, 4.5, // PTEN
    ],
);
let sets = GeneSets::new(vec![
    GeneSet::new("proliferation", vec!["MYC".into(), "EGFR".into()]),
    GeneSet::new("suppressors", vec!["TP53".into(), "PTEN".into()]),
]);

// The free function and the module share the name `gsva`, so call it qualified.
let res = gsva::gsva(&expr, &sets, &GsvaParams::default());
assert_eq!(res.gene_sets, ["proliferation", "suppressors"]);
let proliferation_s2 = res.score(0, 1); // (gene-set index, sample index)
println!("proliferation in sample2 = {proliferation_s2}");
```

Swap `gsva::gsva`/`GsvaParams` for `ssgsea`/`SsgseaParams`,
`zscore`/`ZscoreParams`, or `plage`/`PlageParams` to run the other methods — they
share the `(&ExprMatrix, &GeneSets, &Params) -> EnrichmentResult` shape. Real
inputs are usually loaded with `gsva::io::read_tsv_matrix` (a genes × samples
TSV) and `GeneSets::from_gmt` (an MSigDB-style `.gmt`).

## Methods

| Method | R entry point | Statistic |
|---|---|---|
| GSVA | `gsva(gsvaParam(...))` | Kolmogorov–Smirnov-style max-deviation random walk over a per-gene kernel CDF |
| ssGSEA | `gsva(ssgseaParam(...))` | rank-weighted running-sum enrichment score |
| z-score | `gsva(zscoreParam(...))` | mean of per-gene standardized scores |
| PLAGE | `gsva(plageParam(...))` | first right-singular vector of the standardized set submatrix |

## Validation

Every method is validated against the real Bioconductor `GSVA` package (R 4.6.0,
GSVA 2.6.2) at two levels.

**Real-data gate** — a public glioblastoma cohort (Verhaak *et al.*; 11 861
genes × 173 samples) scored against four brain-tissue gene sets, i.e. 692 scores
per method:

| Method | worst abs err | worst rel err | notes |
|---|---|---|---|
| `gsva` (Gaussian kcdf) | `0.0` | `0.0` | bit-identical |
| `ssgsea` (`normalize = true`) | `0.0` | `0.0` | bit-identical |
| `ssgsea` (`normalize = false`) | `0.0` | `0.0` | bit-identical |
| `zscore` | `3.6e-15` | `3.5e-14` | ~1 ULP (80-bit R reductions) |
| `plage` | `1.6e-15` | `2.3e-12` | faer SVD; magnitudes ~1 ULP, sign-aligned (Jacobi build: 8.8e-15) |

**Synthetic fixtures** (committed; run offline by `cargo test`) match R to
< `1e-9` for every method, covering `ssgsea` with `normalize` on and off and
`gsva` with both the Gaussian and Poisson kernels.

References are dumped from R at full `%.17g` precision by
`benchmarks/gen_parity_synthetic.R` (synthetic, committed) and
`benchmarks/parity_realdata.R` (real data, gitignored); the real-data example is
reproducible with `cargo run --release --example parity_realdata`, and aggregate
metrics live in [`benchmarks/parity_realdata_results.json`](https://github.com/sinmojito/gsva-rust/blob/main/benchmarks/parity_realdata_results.json).

## Fidelity notes

- R accumulates `mean`/`sum` in 80-bit `long double`; Rust uses `f64`. Where a
  method reduces over hundreds of genes, parity may be to ~1 ULP rather than
  bit-identical — the same pattern documented for the MCP-counter port.
- An empty gene-set/expression intersection is a hard error (R `stop()`s); the
  port fails loudly rather than emitting a silent empty result.

## R → Rust name map

| R (Bioconductor GSVA) | `gsva-rust` |
|---|---|
| `gsva(gsvaParam(expr, sets, kcdf, tau, maxDiff, absRanking))` | `gsva(&expr, &sets, &GsvaParams { .. })` |
| `gsva(ssgseaParam(expr, sets, alpha, normalize))` | `ssgsea(&expr, &sets, &SsgseaParams { .. })` |
| `gsva(zscoreParam(expr, sets))` | `zscore(&expr, &sets, &ZscoreParams::default())` |
| `gsva(plageParam(expr, sets))` | `plage(&expr, &sets, &PlageParams::default())` |
| `kcdf = "Gaussian"` / `"Poisson"` | `Kcdf::Gaussian` / `Kcdf::Poisson` |
| `maxDiff`, `absRanking`, `tau` | `max_diff`, `abs_ranking`, `tau` |
| `alpha`, `ssgsea.norm` | `alpha`, `normalize` |
| `minSize`, `maxSize` | `min_size`, `max_size` |
| expression matrix / `ExpressionSet` | `ExprMatrix` |
| `GeneSetCollection` / `.gmt` | `GeneSets` (`GeneSets::from_gmt`) |
| returned score matrix | `EnrichmentResult { gene_sets, samples, scores }` |

## Performance

The default-on `parallel` feature parallelizes each method over its independent
axis — kcdf over genes, `gsva`/`ssgsea` over samples, `zscore`/`plage` over gene
sets — so each unit of work does byte-identical arithmetic and scores are written
back in index order. `gsva`, `ssgsea`, and `zscore` are therefore bit-identical
between the serial and parallel builds, and against R. The exception is `plage`:
the default build factorizes its SVD with `faer` (pure-Rust SIMD, no BLAS/LAPACK)
while `--no-default-features` uses an in-crate one-sided Jacobi SVD, so the two
builds agree to ~1e-12 rather than bit-for-bit; both match R.

Single-point timing on a seeded synthetic matrix (10 000 genes × 200 samples ×
100 gene sets; 16 logical cores; R 4.6.0 / GSVA 2.6.2, rustc 1.95.0; median of 5,
warm-up excluded). "serial" is `--no-default-features` (Jacobi SVD for `plage`);
"parallel" is the default build (rayon + faer SVD):

| method | R (s) | serial (s) | parallel (s) | parallel vs R |
|---|---:|---:|---:|---:|
| `gsva` (Gaussian) | 2.933 | 2.928 | 0.281 | 10.4× |
| `ssgsea` (`normalize = true`) | 1.216 | 0.194 | 0.024 | 49.9× |
| `ssgsea` (`normalize = false`) | 1.204 | 0.188 | 0.025 | 48.8× |
| `zscore` | 0.070 | 0.016 | 0.013 | 5.3× |
| `plage` (faer) | 1.106 | 39.23 | 0.062 | 17.8× |

The default build beats Bioconductor GSVA on all five variants (5.3–49.9×).
`plage` is the widest swing: faer's SVD turns R's LAPACK `dgesdd` step into a
17.8× win. The zero-dependency Jacobi path is instead ~35× slower than R in serial
(rayon recovers it to ~7×), trading speed for bit-faithfulness — Jacobi converges
on the rank-deficient submatrices real gene sets routinely produce, where cheaper
power iteration would stall. Because gene sets stay bounded (tens–hundreds of
genes) as the genome grows, `plage` is nearly flat in gene count; see the scaling
sweep.

Full methodology, the gene/sample scaling sweep, and peak-memory curves are in
[`benchmarks/REPORT.md`](https://github.com/sinmojito/gsva-rust/blob/main/benchmarks/REPORT.md). Reproduce with:

```sh
Rscript benchmarks/gen_bench_synth.R    # seeded synthetic matrix + gene sets
Rscript benchmarks/bench_gsva.R         # single-point timing  -> results_gsva.json
Rscript benchmarks/bench_scaling.R      # size sweep + memory   -> results_scaling_gsva.json
```

## Building from source

```sh
cargo build
cargo test
```

Gate (matches CI):

```sh
cargo fmt --all -- --check
cargo clippy --all-targets --all-features -- -D warnings
cargo test
```

## How this port was produced

gsva-rust was written with AI assistance (Anthropic's Claude, via
[Claude Code](https://www.anthropic.com/claude-code)); the `Co-Authored-By`
trailers in the git history record this. Every method is checked for numeric
parity against the real Bioconductor `GSVA` package — on both synthetic fixtures
and real public RNA-seq — before it is considered done. The upstream GSVA C and R
sources ([github.com/rcastelo/GSVA](https://github.com/rcastelo/GSVA)) were studied
locally for the port and are not vendored into this repository.

## License

Licensed under the **Artistic License 2.0**, the same license as the upstream
GSVA package. See [`LICENSE`](LICENSE).

## Credits & citation

The GSVA methods and their reference implementation are the work of Sonja
Hänzelmann, Robert Castelo, Justin Guinney, and colleagues, maintained as the
Bioconductor **GSVA** package. This port reimplements their algorithms against
that package's behavior; it does not originate them.

- Original software: [Bioconductor GSVA](https://bioconductor.org/packages/GSVA/)
  · source [github.com/rcastelo/GSVA](https://github.com/rcastelo/GSVA)

If you use gsva-rust in research, please cite the GSVA package and the primary
paper for the method you used:

- **GSVA / package** — Hänzelmann S, Castelo R, Guinney J. *GSVA: gene set
  variation analysis for microarray and RNA-Seq data.* BMC Bioinformatics.
  2013;14:7. doi:[10.1186/1471-2105-14-7](https://doi.org/10.1186/1471-2105-14-7)
- **ssGSEA** — Barbie DA, *et al.* *Systematic RNA interference reveals that
  oncogenic KRAS-driven cancers require TBK1.* Nature. 2009;462:108–112.
  doi:[10.1038/nature08460](https://doi.org/10.1038/nature08460)
- **z-score** — Lee E, Chuang HY, Kim JW, Ideker T, Lee D. *Inferring pathway
  activity toward precise disease classification.* PLoS Comput Biol. 2008;4:e1000217.
  doi:[10.1371/journal.pcbi.1000217](https://doi.org/10.1371/journal.pcbi.1000217)
- **PLAGE** — Tomfohr J, Lu J, Kepler TB. *Pathway level analysis of gene
  expression using singular value decomposition.* BMC Bioinformatics. 2005;6:225.
  doi:[10.1186/1471-2105-6-225](https://doi.org/10.1186/1471-2105-6-225)