gsva-rust
gsva-rust is a pure-Rust port of the Bioconductor GSVA
family of gene-set enrichment methods — GSVA, ssGSEA, the combined z-score, and
PLAGE. Checked against R 4.6.0 / GSVA 2.6.2: gsva (Gaussian kcdf) and ssgsea
reproduce the R package bit-for-bit on real data, while zscore and plage
match to ~1 ULP. The default build runs every method 5–50× faster than
Bioconductor GSVA, with no BLAS, LAPACK, Python, or R at runtime.
Disclaimer. gsva-rust is an independent, unofficial Rust port and is not affiliated with, endorsed by, or supported by the authors of GSVA or Bioconductor. For the canonical, authoritative implementation, use the Bioconductor GSVA package.
Pre-1.0: the public API may still shift before the first crates.io release. The
default build's only dependencies are two pure-Rust crates —
rayon for parallelism and
faer for plage's SVD; --no-default-features
drops both for a zero-dependency, pure-serial crate.
Highlights
- Parity with the R package, not with hand-picked values. Every reference is
dumped from Bioconductor
GSVAat full%.17gprecision.gsva(Gaussian kcdf) andssgseareproduce it bit-for-bit on real data;zscoreandplagematch to ~1 ULP. - Validated at two levels. Seeded synthetic fixtures run offline under
cargo test, plus a real-data gate on a public glioblastoma cohort scored against brain-tissue gene sets. - Four methods, one call shape.
gsva,ssgsea,zscore, andplageeach take(&ExprMatrix, &GeneSets, &Params)and return anEnrichmentResult. - Faster than R, parity intact.
rayonparallelizes each method's independent axis bit-identically;plage's SVD uses pure-Rust SIMDfaer. 5.3–49.9× over Bioconductor GSVA across the five method/normalization variants. - Dependency-light. Two pure-Rust deps by default, zero with
--no-default-features. MSRV 1.85, set by faer 0.24's dependency tree; CI builds all three feature configs on it.
Installation
[]
= "0.1"
The library is imported as gsva:
use ExprMatrix;
Quick start
use ;
// Expression as genes × samples, stored row-major.
let expr = new;
let sets = new;
// The free function and the module share the name `gsva`, so call it qualified.
let res = gsva;
assert_eq!;
let proliferation_s2 = res.score; // (gene-set index, sample index)
println!;
Swap gsva::gsva/GsvaParams for ssgsea/SsgseaParams,
zscore/ZscoreParams, or plage/PlageParams to run the other methods — they
share the (&ExprMatrix, &GeneSets, &Params) -> EnrichmentResult shape. Real
inputs are usually loaded with gsva::io::read_tsv_matrix (a genes × samples
TSV) and GeneSets::from_gmt (an MSigDB-style .gmt).
Methods
| Method | R entry point | Statistic |
|---|---|---|
| GSVA | gsva(gsvaParam(...)) |
Kolmogorov–Smirnov-style max-deviation random walk over a per-gene kernel CDF |
| ssGSEA | gsva(ssgseaParam(...)) |
rank-weighted running-sum enrichment score |
| z-score | gsva(zscoreParam(...)) |
mean of per-gene standardized scores |
| PLAGE | gsva(plageParam(...)) |
first right-singular vector of the standardized set submatrix |
Validation
Every method is validated against the real Bioconductor GSVA package (R 4.6.0,
GSVA 2.6.2) at two levels.
Real-data gate — a public glioblastoma cohort (Verhaak et al.; 11 861 genes × 173 samples) scored against four brain-tissue gene sets, i.e. 692 scores per method:
| Method | worst abs err | worst rel err | notes |
|---|---|---|---|
gsva (Gaussian kcdf) |
0.0 |
0.0 |
bit-identical |
ssgsea (normalize = true) |
0.0 |
0.0 |
bit-identical |
ssgsea (normalize = false) |
0.0 |
0.0 |
bit-identical |
zscore |
3.6e-15 |
3.5e-14 |
~1 ULP (80-bit R reductions) |
plage |
1.6e-15 |
2.3e-12 |
faer SVD; magnitudes ~1 ULP, sign-aligned (Jacobi build: 8.8e-15) |
Synthetic fixtures (committed; run offline by cargo test) match R to
< 1e-9 for every method, covering ssgsea with normalize on and off and
gsva with both the Gaussian and Poisson kernels.
References are dumped from R at full %.17g precision by
benchmarks/gen_parity_synthetic.R (synthetic, committed) and
benchmarks/parity_realdata.R (real data, gitignored); the real-data example is
reproducible with cargo run --release --example parity_realdata, and aggregate
metrics live in benchmarks/parity_realdata_results.json.
Fidelity notes
- R accumulates
mean/sumin 80-bitlong double; Rust usesf64. Where a method reduces over hundreds of genes, parity may be to ~1 ULP rather than bit-identical — the same pattern documented for the MCP-counter port. - An empty gene-set/expression intersection is a hard error (R
stop()s); the port fails loudly rather than emitting a silent empty result.
R → Rust name map
| R (Bioconductor GSVA) | gsva-rust |
|---|---|
gsva(gsvaParam(expr, sets, kcdf, tau, maxDiff, absRanking)) |
gsva(&expr, &sets, &GsvaParams { .. }) |
gsva(ssgseaParam(expr, sets, alpha, normalize)) |
ssgsea(&expr, &sets, &SsgseaParams { .. }) |
gsva(zscoreParam(expr, sets)) |
zscore(&expr, &sets, &ZscoreParams::default()) |
gsva(plageParam(expr, sets)) |
plage(&expr, &sets, &PlageParams::default()) |
kcdf = "Gaussian" / "Poisson" |
Kcdf::Gaussian / Kcdf::Poisson |
maxDiff, absRanking, tau |
max_diff, abs_ranking, tau |
alpha, ssgsea.norm |
alpha, normalize |
minSize, maxSize |
min_size, max_size |
expression matrix / ExpressionSet |
ExprMatrix |
GeneSetCollection / .gmt |
GeneSets (GeneSets::from_gmt) |
| returned score matrix | EnrichmentResult { gene_sets, samples, scores } |
Performance
The default-on parallel feature parallelizes each method over its independent
axis — kcdf over genes, gsva/ssgsea over samples, zscore/plage over gene
sets — so each unit of work does byte-identical arithmetic and scores are written
back in index order. gsva, ssgsea, and zscore are therefore bit-identical
between the serial and parallel builds, and against R. The exception is plage:
the default build factorizes its SVD with faer (pure-Rust SIMD, no BLAS/LAPACK)
while --no-default-features uses an in-crate one-sided Jacobi SVD, so the two
builds agree to ~1e-12 rather than bit-for-bit; both match R.
Single-point timing on a seeded synthetic matrix (10 000 genes × 200 samples ×
100 gene sets; 16 logical cores; R 4.6.0 / GSVA 2.6.2, rustc 1.95.0; median of 5,
warm-up excluded). "serial" is --no-default-features (Jacobi SVD for plage);
"parallel" is the default build (rayon + faer SVD):
| method | R (s) | serial (s) | parallel (s) | parallel vs R |
|---|---|---|---|---|
gsva (Gaussian) |
2.933 | 2.928 | 0.281 | 10.4× |
ssgsea (normalize = true) |
1.216 | 0.194 | 0.024 | 49.9× |
ssgsea (normalize = false) |
1.204 | 0.188 | 0.025 | 48.8× |
zscore |
0.070 | 0.016 | 0.013 | 5.3× |
plage (faer) |
1.106 | 39.23 | 0.062 | 17.8× |
The default build beats Bioconductor GSVA on all five variants (5.3–49.9×).
plage is the widest swing: faer's SVD turns R's LAPACK dgesdd step into a
17.8× win. The zero-dependency Jacobi path is instead ~35× slower than R in serial
(rayon recovers it to ~7×), trading speed for bit-faithfulness — Jacobi converges
on the rank-deficient submatrices real gene sets routinely produce, where cheaper
power iteration would stall. Because gene sets stay bounded (tens–hundreds of
genes) as the genome grows, plage is nearly flat in gene count; see the scaling
sweep.
Full methodology, the gene/sample scaling sweep, and peak-memory curves are in
benchmarks/REPORT.md. Reproduce with:
Building from source
Gate (matches CI):
How this port was produced
gsva-rust was written with AI assistance (Anthropic's Claude, via
Claude Code); the Co-Authored-By
trailers in the git history record this. Every method is checked for numeric
parity against the real Bioconductor GSVA package — on both synthetic fixtures
and real public RNA-seq — before it is considered done. The upstream GSVA C and R
sources (github.com/rcastelo/GSVA) were studied
locally for the port and are not vendored into this repository.
License
Licensed under the Artistic License 2.0, the same license as the upstream
GSVA package. See LICENSE.
Credits & citation
The GSVA methods and their reference implementation are the work of Sonja Hänzelmann, Robert Castelo, Justin Guinney, and colleagues, maintained as the Bioconductor GSVA package. This port reimplements their algorithms against that package's behavior; it does not originate them.
- Original software: Bioconductor GSVA · source github.com/rcastelo/GSVA
If you use gsva-rust in research, please cite the GSVA package and the primary paper for the method you used:
- GSVA / package — Hänzelmann S, Castelo R, Guinney J. GSVA: gene set variation analysis for microarray and RNA-Seq data. BMC Bioinformatics. 2013;14:7. doi:10.1186/1471-2105-14-7
- ssGSEA — Barbie DA, et al. Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1. Nature. 2009;462:108–112. doi:10.1038/nature08460
- z-score — Lee E, Chuang HY, Kim JW, Ideker T, Lee D. Inferring pathway activity toward precise disease classification. PLoS Comput Biol. 2008;4:e1000217. doi:10.1371/journal.pcbi.1000217
- PLAGE — Tomfohr J, Lu J, Kepler TB. Pathway level analysis of gene expression using singular value decomposition. BMC Bioinformatics. 2005;6:225. doi:10.1186/1471-2105-6-225