# gsva-rust
[](https://crates.io/crates/gsva-rust)
[](https://docs.rs/gsva-rust)
[](https://github.com/sinmojito/gsva-rust/actions/workflows/ci.yml)
[](LICENSE)
**gsva-rust** is a pure-Rust port of the Bioconductor [**GSVA**](https://bioconductor.org/packages/GSVA/)
family of gene-set enrichment methods — GSVA, ssGSEA, the combined z-score, and
PLAGE. Checked against R 4.6.0 / GSVA 2.6.2: `gsva` (Gaussian kcdf) and `ssgsea`
reproduce the R package **bit-for-bit** on real data, while `zscore` and `plage`
match to **~1 ULP**. The default build runs every method **5–50× faster** than
Bioconductor GSVA, with **no BLAS, LAPACK, Python, or R at runtime**.
> **Disclaimer.** gsva-rust is an independent, unofficial Rust port and is **not**
> affiliated with, endorsed by, or supported by the authors of GSVA or
> Bioconductor. For the canonical, authoritative implementation, use the
> [Bioconductor GSVA package](https://bioconductor.org/packages/GSVA/).
Pre-1.0: the public API may still shift before the first crates.io release. The
default build's only dependencies are two pure-Rust crates —
[`rayon`](https://crates.io/crates/rayon) for parallelism and
[`faer`](https://crates.io/crates/faer) for `plage`'s SVD; `--no-default-features`
drops both for a zero-dependency, pure-serial crate.
## Highlights
- **Parity with the R package, not with hand-picked values.** Every reference is
dumped from Bioconductor `GSVA` at full `%.17g` precision. `gsva` (Gaussian
kcdf) and `ssgsea` reproduce it bit-for-bit on real data; `zscore` and `plage`
match to ~1 ULP.
- **Validated at two levels.** Seeded synthetic fixtures run offline under
`cargo test`, plus a real-data gate on a public glioblastoma cohort scored
against brain-tissue gene sets.
- **Four methods, one call shape.** `gsva`, `ssgsea`, `zscore`, and `plage` each
take `(&ExprMatrix, &GeneSets, &Params)` and return an `EnrichmentResult`.
- **Faster than R, parity intact.** `rayon` parallelizes each method's
independent axis bit-identically; `plage`'s SVD uses pure-Rust SIMD
[`faer`](https://crates.io/crates/faer). 5.3–49.9× over Bioconductor GSVA across
the five method/normalization variants.
- **Dependency-light.** Two pure-Rust deps by default, zero with
`--no-default-features`. MSRV **1.85**, set by faer 0.24's dependency tree; CI
builds all three feature configs on it.
## Installation
```toml
[dependencies]
gsva-rust = "0.1"
```
The library is imported as `gsva`:
```rust
use gsva::ExprMatrix;
```
## Quick start
```rust
use gsva::{ExprMatrix, GeneSet, GeneSets, GsvaParams};
// Expression as genes × samples, stored row-major.
let expr = ExprMatrix::new(
vec!["TP53".into(), "EGFR".into(), "MYC".into(), "PTEN".into()],
vec!["sample1".into(), "sample2".into(), "sample3".into()],
vec![
5.1, 6.2, 4.8, // TP53
2.0, 9.1, 3.3, // EGFR
7.7, 1.2, 8.0, // MYC
4.4, 4.6, 4.5, // PTEN
],
);
let sets = GeneSets::new(vec![
GeneSet::new("proliferation", vec!["MYC".into(), "EGFR".into()]),
GeneSet::new("suppressors", vec!["TP53".into(), "PTEN".into()]),
]);
// The free function and the module share the name `gsva`, so call it qualified.
let res = gsva::gsva(&expr, &sets, &GsvaParams::default());
assert_eq!(res.gene_sets, ["proliferation", "suppressors"]);
let proliferation_s2 = res.score(0, 1); // (gene-set index, sample index)
println!("proliferation in sample2 = {proliferation_s2}");
```
Swap `gsva::gsva`/`GsvaParams` for `ssgsea`/`SsgseaParams`,
`zscore`/`ZscoreParams`, or `plage`/`PlageParams` to run the other methods — they
share the `(&ExprMatrix, &GeneSets, &Params) -> EnrichmentResult` shape. Real
inputs are usually loaded with `gsva::io::read_tsv_matrix` (a genes × samples
TSV) and `GeneSets::from_gmt` (an MSigDB-style `.gmt`).
## Methods
| GSVA | `gsva(gsvaParam(...))` | Kolmogorov–Smirnov-style max-deviation random walk over a per-gene kernel CDF |
| ssGSEA | `gsva(ssgseaParam(...))` | rank-weighted running-sum enrichment score |
| z-score | `gsva(zscoreParam(...))` | mean of per-gene standardized scores |
| PLAGE | `gsva(plageParam(...))` | first right-singular vector of the standardized set submatrix |
## Validation
Every method is validated against the real Bioconductor `GSVA` package (R 4.6.0,
GSVA 2.6.2) at two levels.
**Real-data gate** — a public glioblastoma cohort (Verhaak *et al.*; 11 861
genes × 173 samples) scored against four brain-tissue gene sets, i.e. 692 scores
per method:
| `gsva` (Gaussian kcdf) | `0.0` | `0.0` | bit-identical |
| `ssgsea` (`normalize = true`) | `0.0` | `0.0` | bit-identical |
| `ssgsea` (`normalize = false`) | `0.0` | `0.0` | bit-identical |
| `zscore` | `3.6e-15` | `3.5e-14` | ~1 ULP (80-bit R reductions) |
| `plage` | `1.6e-15` | `2.3e-12` | faer SVD; magnitudes ~1 ULP, sign-aligned (Jacobi build: 8.8e-15) |
**Synthetic fixtures** (committed; run offline by `cargo test`) match R to
< `1e-9` for every method, covering `ssgsea` with `normalize` on and off and
`gsva` with both the Gaussian and Poisson kernels.
References are dumped from R at full `%.17g` precision by
`benchmarks/gen_parity_synthetic.R` (synthetic, committed) and
`benchmarks/parity_realdata.R` (real data, gitignored); the real-data example is
reproducible with `cargo run --release --example parity_realdata`, and aggregate
metrics live in [`benchmarks/parity_realdata_results.json`](https://github.com/sinmojito/gsva-rust/blob/main/benchmarks/parity_realdata_results.json).
## Fidelity notes
- R accumulates `mean`/`sum` in 80-bit `long double`; Rust uses `f64`. Where a
method reduces over hundreds of genes, parity may be to ~1 ULP rather than
bit-identical — the same pattern documented for the MCP-counter port.
- An empty gene-set/expression intersection is a hard error (R `stop()`s); the
port fails loudly rather than emitting a silent empty result.
## R → Rust name map
| `gsva(gsvaParam(expr, sets, kcdf, tau, maxDiff, absRanking))` | `gsva(&expr, &sets, &GsvaParams { .. })` |
| `gsva(ssgseaParam(expr, sets, alpha, normalize))` | `ssgsea(&expr, &sets, &SsgseaParams { .. })` |
| `gsva(zscoreParam(expr, sets))` | `zscore(&expr, &sets, &ZscoreParams::default())` |
| `gsva(plageParam(expr, sets))` | `plage(&expr, &sets, &PlageParams::default())` |
| `kcdf = "Gaussian"` / `"Poisson"` | `Kcdf::Gaussian` / `Kcdf::Poisson` |
| `maxDiff`, `absRanking`, `tau` | `max_diff`, `abs_ranking`, `tau` |
| `alpha`, `ssgsea.norm` | `alpha`, `normalize` |
| `minSize`, `maxSize` | `min_size`, `max_size` |
| expression matrix / `ExpressionSet` | `ExprMatrix` |
| `GeneSetCollection` / `.gmt` | `GeneSets` (`GeneSets::from_gmt`) |
| returned score matrix | `EnrichmentResult { gene_sets, samples, scores }` |
## Performance
The default-on `parallel` feature parallelizes each method over its independent
axis — kcdf over genes, `gsva`/`ssgsea` over samples, `zscore`/`plage` over gene
sets — so each unit of work does byte-identical arithmetic and scores are written
back in index order. `gsva`, `ssgsea`, and `zscore` are therefore bit-identical
between the serial and parallel builds, and against R. The exception is `plage`:
the default build factorizes its SVD with `faer` (pure-Rust SIMD, no BLAS/LAPACK)
while `--no-default-features` uses an in-crate one-sided Jacobi SVD, so the two
builds agree to ~1e-12 rather than bit-for-bit; both match R.
Single-point timing on a seeded synthetic matrix (10 000 genes × 200 samples ×
100 gene sets; 16 logical cores; R 4.6.0 / GSVA 2.6.2, rustc 1.95.0; median of 5,
warm-up excluded). "serial" is `--no-default-features` (Jacobi SVD for `plage`);
"parallel" is the default build (rayon + faer SVD):
| `gsva` (Gaussian) | 2.933 | 2.928 | 0.281 | 10.4× |
| `ssgsea` (`normalize = true`) | 1.216 | 0.194 | 0.024 | 49.9× |
| `ssgsea` (`normalize = false`) | 1.204 | 0.188 | 0.025 | 48.8× |
| `zscore` | 0.070 | 0.016 | 0.013 | 5.3× |
| `plage` (faer) | 1.106 | 39.23 | 0.062 | 17.8× |
The default build beats Bioconductor GSVA on all five variants (5.3–49.9×).
`plage` is the widest swing: faer's SVD turns R's LAPACK `dgesdd` step into a
17.8× win. The zero-dependency Jacobi path is instead ~35× slower than R in serial
(rayon recovers it to ~7×), trading speed for bit-faithfulness — Jacobi converges
on the rank-deficient submatrices real gene sets routinely produce, where cheaper
power iteration would stall. Because gene sets stay bounded (tens–hundreds of
genes) as the genome grows, `plage` is nearly flat in gene count; see the scaling
sweep.
Full methodology, the gene/sample scaling sweep, and peak-memory curves are in
[`benchmarks/REPORT.md`](https://github.com/sinmojito/gsva-rust/blob/main/benchmarks/REPORT.md). Reproduce with:
```sh
Rscript benchmarks/gen_bench_synth.R # seeded synthetic matrix + gene sets
Rscript benchmarks/bench_gsva.R # single-point timing -> results_gsva.json
Rscript benchmarks/bench_scaling.R # size sweep + memory -> results_scaling_gsva.json
```
## Building from source
```sh
cargo build
cargo test
```
Gate (matches CI):
```sh
cargo fmt --all -- --check
cargo clippy --all-targets --all-features -- -D warnings
cargo test
```
## How this port was produced
gsva-rust was written with AI assistance (Anthropic's Claude, via
[Claude Code](https://www.anthropic.com/claude-code)); the `Co-Authored-By`
trailers in the git history record this. Every method is checked for numeric
parity against the real Bioconductor `GSVA` package — on both synthetic fixtures
and real public RNA-seq — before it is considered done. The upstream GSVA C and R
sources ([github.com/rcastelo/GSVA](https://github.com/rcastelo/GSVA)) were studied
locally for the port and are not vendored into this repository.
## License
Licensed under the **Artistic License 2.0**, the same license as the upstream
GSVA package. See [`LICENSE`](LICENSE).
## Credits & citation
The GSVA methods and their reference implementation are the work of Sonja
Hänzelmann, Robert Castelo, Justin Guinney, and colleagues, maintained as the
Bioconductor **GSVA** package. This port reimplements their algorithms against
that package's behavior; it does not originate them.
- Original software: [Bioconductor GSVA](https://bioconductor.org/packages/GSVA/)
· source [github.com/rcastelo/GSVA](https://github.com/rcastelo/GSVA)
If you use gsva-rust in research, please cite the GSVA package and the primary
paper for the method you used:
- **GSVA / package** — Hänzelmann S, Castelo R, Guinney J. *GSVA: gene set
variation analysis for microarray and RNA-Seq data.* BMC Bioinformatics.
2013;14:7. doi:[10.1186/1471-2105-14-7](https://doi.org/10.1186/1471-2105-14-7)
- **ssGSEA** — Barbie DA, *et al.* *Systematic RNA interference reveals that
oncogenic KRAS-driven cancers require TBK1.* Nature. 2009;462:108–112.
doi:[10.1038/nature08460](https://doi.org/10.1038/nature08460)
- **z-score** — Lee E, Chuang HY, Kim JW, Ideker T, Lee D. *Inferring pathway
activity toward precise disease classification.* PLoS Comput Biol. 2008;4:e1000217.
doi:[10.1371/journal.pcbi.1000217](https://doi.org/10.1371/journal.pcbi.1000217)
- **PLAGE** — Tomfohr J, Lu J, Kepler TB. *Pathway level analysis of gene
expression using singular value decomposition.* BMC Bioinformatics. 2005;6:225.
doi:[10.1186/1471-2105-6-225](https://doi.org/10.1186/1471-2105-6-225)