gsva-rust 0.1.0

Pure-Rust port of the GSVA family of gene-set enrichment methods (GSVA, ssGSEA, z-score, PLAGE), validated for numeric parity against the Bioconductor GSVA package.
Documentation
  • Coverage
  • 100%
    70 out of 70 items documented1 out of 38 items with examples
  • Size
  • Source code size: 132.02 kB This is the summed size of all the files inside the crates.io package for this release.
  • Documentation size: 1.35 MB This is the summed size of all files generated by rustdoc for all configured targets
  • Ø build duration
  • this release: 38s Average build duration of successful builds.
  • all releases: 38s Average build duration of successful builds in releases after 2024-10-23.
  • Links
  • sinmojito/gsva-rust
    0 0 0
  • crates.io
  • Dependencies
  • Versions
  • Owners
  • sinmojito

gsva-rust

Crates.io Docs.rs CI License: Artistic-2.0

gsva-rust is a pure-Rust port of the Bioconductor GSVA family of gene-set enrichment methods — GSVA, ssGSEA, the combined z-score, and PLAGE. Checked against R 4.6.0 / GSVA 2.6.2: gsva (Gaussian kcdf) and ssgsea reproduce the R package bit-for-bit on real data, while zscore and plage match to ~1 ULP. The default build runs every method 5–50× faster than Bioconductor GSVA, with no BLAS, LAPACK, Python, or R at runtime.

Disclaimer. gsva-rust is an independent, unofficial Rust port and is not affiliated with, endorsed by, or supported by the authors of GSVA or Bioconductor. For the canonical, authoritative implementation, use the Bioconductor GSVA package.

Pre-1.0: the public API may still shift before the first crates.io release. The default build's only dependencies are two pure-Rust crates — rayon for parallelism and faer for plage's SVD; --no-default-features drops both for a zero-dependency, pure-serial crate.

Highlights

  • Parity with the R package, not with hand-picked values. Every reference is dumped from Bioconductor GSVA at full %.17g precision. gsva (Gaussian kcdf) and ssgsea reproduce it bit-for-bit on real data; zscore and plage match to ~1 ULP.
  • Validated at two levels. Seeded synthetic fixtures run offline under cargo test, plus a real-data gate on a public glioblastoma cohort scored against brain-tissue gene sets.
  • Four methods, one call shape. gsva, ssgsea, zscore, and plage each take (&ExprMatrix, &GeneSets, &Params) and return an EnrichmentResult.
  • Faster than R, parity intact. rayon parallelizes each method's independent axis bit-identically; plage's SVD uses pure-Rust SIMD faer. 5.3–49.9× over Bioconductor GSVA across the five method/normalization variants.
  • Dependency-light. Two pure-Rust deps by default, zero with --no-default-features. MSRV 1.85, set by faer 0.24's dependency tree; CI builds all three feature configs on it.

Installation

[dependencies]
gsva-rust = "0.1"

The library is imported as gsva:

use gsva::ExprMatrix;

Quick start

use gsva::{ExprMatrix, GeneSet, GeneSets, GsvaParams};

// Expression as genes × samples, stored row-major.
let expr = ExprMatrix::new(
    vec!["TP53".into(), "EGFR".into(), "MYC".into(), "PTEN".into()],
    vec!["sample1".into(), "sample2".into(), "sample3".into()],
    vec![
        5.1, 6.2, 4.8, // TP53
        2.0, 9.1, 3.3, // EGFR
        7.7, 1.2, 8.0, // MYC
        4.4, 4.6, 4.5, // PTEN
    ],
);
let sets = GeneSets::new(vec![
    GeneSet::new("proliferation", vec!["MYC".into(), "EGFR".into()]),
    GeneSet::new("suppressors", vec!["TP53".into(), "PTEN".into()]),
]);

// The free function and the module share the name `gsva`, so call it qualified.
let res = gsva::gsva(&expr, &sets, &GsvaParams::default());
assert_eq!(res.gene_sets, ["proliferation", "suppressors"]);
let proliferation_s2 = res.score(0, 1); // (gene-set index, sample index)
println!("proliferation in sample2 = {proliferation_s2}");

Swap gsva::gsva/GsvaParams for ssgsea/SsgseaParams, zscore/ZscoreParams, or plage/PlageParams to run the other methods — they share the (&ExprMatrix, &GeneSets, &Params) -> EnrichmentResult shape. Real inputs are usually loaded with gsva::io::read_tsv_matrix (a genes × samples TSV) and GeneSets::from_gmt (an MSigDB-style .gmt).

Methods

Method R entry point Statistic
GSVA gsva(gsvaParam(...)) Kolmogorov–Smirnov-style max-deviation random walk over a per-gene kernel CDF
ssGSEA gsva(ssgseaParam(...)) rank-weighted running-sum enrichment score
z-score gsva(zscoreParam(...)) mean of per-gene standardized scores
PLAGE gsva(plageParam(...)) first right-singular vector of the standardized set submatrix

Validation

Every method is validated against the real Bioconductor GSVA package (R 4.6.0, GSVA 2.6.2) at two levels.

Real-data gate — a public glioblastoma cohort (Verhaak et al.; 11 861 genes × 173 samples) scored against four brain-tissue gene sets, i.e. 692 scores per method:

Method worst abs err worst rel err notes
gsva (Gaussian kcdf) 0.0 0.0 bit-identical
ssgsea (normalize = true) 0.0 0.0 bit-identical
ssgsea (normalize = false) 0.0 0.0 bit-identical
zscore 3.6e-15 3.5e-14 ~1 ULP (80-bit R reductions)
plage 1.6e-15 2.3e-12 faer SVD; magnitudes ~1 ULP, sign-aligned (Jacobi build: 8.8e-15)

Synthetic fixtures (committed; run offline by cargo test) match R to < 1e-9 for every method, covering ssgsea with normalize on and off and gsva with both the Gaussian and Poisson kernels.

References are dumped from R at full %.17g precision by benchmarks/gen_parity_synthetic.R (synthetic, committed) and benchmarks/parity_realdata.R (real data, gitignored); the real-data example is reproducible with cargo run --release --example parity_realdata, and aggregate metrics live in benchmarks/parity_realdata_results.json.

Fidelity notes

  • R accumulates mean/sum in 80-bit long double; Rust uses f64. Where a method reduces over hundreds of genes, parity may be to ~1 ULP rather than bit-identical — the same pattern documented for the MCP-counter port.
  • An empty gene-set/expression intersection is a hard error (R stop()s); the port fails loudly rather than emitting a silent empty result.

R → Rust name map

R (Bioconductor GSVA) gsva-rust
gsva(gsvaParam(expr, sets, kcdf, tau, maxDiff, absRanking)) gsva(&expr, &sets, &GsvaParams { .. })
gsva(ssgseaParam(expr, sets, alpha, normalize)) ssgsea(&expr, &sets, &SsgseaParams { .. })
gsva(zscoreParam(expr, sets)) zscore(&expr, &sets, &ZscoreParams::default())
gsva(plageParam(expr, sets)) plage(&expr, &sets, &PlageParams::default())
kcdf = "Gaussian" / "Poisson" Kcdf::Gaussian / Kcdf::Poisson
maxDiff, absRanking, tau max_diff, abs_ranking, tau
alpha, ssgsea.norm alpha, normalize
minSize, maxSize min_size, max_size
expression matrix / ExpressionSet ExprMatrix
GeneSetCollection / .gmt GeneSets (GeneSets::from_gmt)
returned score matrix EnrichmentResult { gene_sets, samples, scores }

Performance

The default-on parallel feature parallelizes each method over its independent axis — kcdf over genes, gsva/ssgsea over samples, zscore/plage over gene sets — so each unit of work does byte-identical arithmetic and scores are written back in index order. gsva, ssgsea, and zscore are therefore bit-identical between the serial and parallel builds, and against R. The exception is plage: the default build factorizes its SVD with faer (pure-Rust SIMD, no BLAS/LAPACK) while --no-default-features uses an in-crate one-sided Jacobi SVD, so the two builds agree to ~1e-12 rather than bit-for-bit; both match R.

Single-point timing on a seeded synthetic matrix (10 000 genes × 200 samples × 100 gene sets; 16 logical cores; R 4.6.0 / GSVA 2.6.2, rustc 1.95.0; median of 5, warm-up excluded). "serial" is --no-default-features (Jacobi SVD for plage); "parallel" is the default build (rayon + faer SVD):

method R (s) serial (s) parallel (s) parallel vs R
gsva (Gaussian) 2.933 2.928 0.281 10.4×
ssgsea (normalize = true) 1.216 0.194 0.024 49.9×
ssgsea (normalize = false) 1.204 0.188 0.025 48.8×
zscore 0.070 0.016 0.013 5.3×
plage (faer) 1.106 39.23 0.062 17.8×

The default build beats Bioconductor GSVA on all five variants (5.3–49.9×). plage is the widest swing: faer's SVD turns R's LAPACK dgesdd step into a 17.8× win. The zero-dependency Jacobi path is instead ~35× slower than R in serial (rayon recovers it to ~7×), trading speed for bit-faithfulness — Jacobi converges on the rank-deficient submatrices real gene sets routinely produce, where cheaper power iteration would stall. Because gene sets stay bounded (tens–hundreds of genes) as the genome grows, plage is nearly flat in gene count; see the scaling sweep.

Full methodology, the gene/sample scaling sweep, and peak-memory curves are in benchmarks/REPORT.md. Reproduce with:

Rscript benchmarks/gen_bench_synth.R    # seeded synthetic matrix + gene sets
Rscript benchmarks/bench_gsva.R         # single-point timing  -> results_gsva.json
Rscript benchmarks/bench_scaling.R      # size sweep + memory   -> results_scaling_gsva.json

Building from source

cargo build
cargo test

Gate (matches CI):

cargo fmt --all -- --check
cargo clippy --all-targets --all-features -- -D warnings
cargo test

How this port was produced

gsva-rust was written with AI assistance (Anthropic's Claude, via Claude Code); the Co-Authored-By trailers in the git history record this. Every method is checked for numeric parity against the real Bioconductor GSVA package — on both synthetic fixtures and real public RNA-seq — before it is considered done. The upstream GSVA C and R sources (github.com/rcastelo/GSVA) were studied locally for the port and are not vendored into this repository.

License

Licensed under the Artistic License 2.0, the same license as the upstream GSVA package. See LICENSE.

Credits & citation

The GSVA methods and their reference implementation are the work of Sonja Hänzelmann, Robert Castelo, Justin Guinney, and colleagues, maintained as the Bioconductor GSVA package. This port reimplements their algorithms against that package's behavior; it does not originate them.

If you use gsva-rust in research, please cite the GSVA package and the primary paper for the method you used:

  • GSVA / package — Hänzelmann S, Castelo R, Guinney J. GSVA: gene set variation analysis for microarray and RNA-Seq data. BMC Bioinformatics. 2013;14:7. doi:10.1186/1471-2105-14-7
  • ssGSEA — Barbie DA, et al. Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1. Nature. 2009;462:108–112. doi:10.1038/nature08460
  • z-score — Lee E, Chuang HY, Kim JW, Ideker T, Lee D. Inferring pathway activity toward precise disease classification. PLoS Comput Biol. 2008;4:e1000217. doi:10.1371/journal.pcbi.1000217
  • PLAGE — Tomfohr J, Lu J, Kepler TB. Pathway level analysis of gene expression using singular value decomposition. BMC Bioinformatics. 2005;6:225. doi:10.1186/1471-2105-6-225