xcell-rust 0.1.0

Pure-Rust port of xCell (Aran et al. 2017) cell-type enrichment — ssGSEA, spillover-corrected — validated for numeric parity against the R xCell package. Built on gsva-rust.
Documentation
  • Coverage
  • 100%
    30 out of 30 items documented1 out of 13 items with examples
  • Size
  • Source code size: 407.81 kB This is the summed size of all the files inside the crates.io package for this release.
  • Documentation size: 824.35 kB This is the summed size of all files generated by rustdoc for all configured targets
  • Ø build duration
  • this release: 6s Average build duration of successful builds.
  • all releases: 6s Average build duration of successful builds in releases after 2024-10-23.
  • Links
  • Repository
  • crates.io
  • Dependencies
  • Versions
  • Owners
  • sinmojito

xcell-rust

Crates.io Docs.rs CI License: GPL-3.0-or-later

A pure-Rust port of xCell 1.1.0 (Aran, Hu & Butte, Genome Biology 2017), the cell-type enrichment method that scores 64 immune and stromal cell types from bulk expression. It runs the full xCell pipeline — rawEnrichmentAnalysis → transformScores → spillOver → microenvironmentScores — on top of the gsva ssGSEA engine, and is validated for numeric parity against the original R package (R 4.6.0, xCell 1.1.0, GSVA 2.6.2).

Disclaimer. xcell-rust is an independent, unofficial Rust port and is not affiliated with, endorsed by, or supported by the authors of xCell. For the canonical, authoritative implementation, use the original xCell package (Aran et al.).

Status: the full pipeline is landed and validated. Every stage is gated on real-data parity against R xCell 1.1.0 (see Validation). Pre-1.0: the public API may still shift before the first crates.io release.

The default build depends on two pure-Rust crates: gsva-rust (the ssGSEA engine, taken as its zero-dependency core) and rayon, for parity-preserving parallelism. Building --no-default-features drops rayon for a pure-serial crate whose only dependency is gsva-rust's core. Either way, no BLAS/LAPACK, quadprog/pracma, Python, or R is required at runtime — even the spillover non-negative least squares is hand-rolled.

Highlights

  • The whole xCell 1.1.0 method, not a slice: raw ssGSEA enrichment over the 489 bundled signatures, per-cell-type calibration, spillover compensation, and the immune / stroma / microenvironment aggregate scores.
  • Validated against R, not against hand-picked expected values: per-stage references are dumped from the real xCell package at full %.17g precision.
  • Two-level validation: seeded synthetic fixtures (cargo test, no R needed) plus a real-data parity gate against a public glioblastoma cohort.
  • Hand-rolled NNLS: the spillOver step solves each sample's non-negative least-squares problem with a Lawson–Hanson active-set solver, matching R's pracma::lsqlincon to ~1e-11 with no linear-algebra dependency.
  • Parallel by default, bit-identical: the parallel feature (rayon) solves the per-sample spillover NNLS across cores and forwards to gsva-rust/parallel so ssGSEA parallelizes too; both axes only reorder independent per-sample work, so results match the serial build (and R) bit-for-bit.
  • MSRV 1.85, inherited from the gsva-rust dependency's declared minimum. xCell.data is embedded, so analysis needs no network or data files.

Installation

[dependencies]
xcell-rust = "0.1"

The library is imported as xcell:

use xcell::{xcell_analysis, XCellModel, XCellParams};

Quick start

use xcell::{xcell_analysis, ExprMatrix, XCellModel, XCellParams};

// Expression as genes (rows, labelled by HGNC symbol) × samples. Real inputs are
// usually a TSV loaded with the bundled reader:
let expr: ExprMatrix =
    xcell::io::read_tsv_matrix(&std::fs::read_to_string("expr.tsv").unwrap());

// The embedded xCell 1.1.0 model: signatures, gene universe, spillover matrix,
// and calibration table. Cheap to load.
let model = XCellModel::load();

// Defaults match R xCellAnalysis: rnaseq = true, scale = true, alpha = 0.5.
let scores = xcell_analysis(&expr, &model, &XCellParams::default());

// 64 cell types + ImmuneScore / StromaScore / MicroenvironmentScore, × samples.
let macrophages_s0 = scores.score(
    scores.gene_sets.iter().position(|g| g == "Macrophages").unwrap(),
    0,
);
println!("Macrophages in sample 0 = {macrophages_s0}");

xCell requires at least 5000 of its 10 808 universe genes to be present; with fewer the R package aborts, and so does this port (loudly, rather than returning a degenerate result).

The pipeline

Stage R function What it does
1 rawEnrichmentAnalysis keep genes shared with xCell's universe, per-sample average-tie rank, unnormalized ssGSEA over 489 signatures, subtract each signature's minimum, average signatures per cell type
2 transformScores per-cell-type power-curve calibration t = ((raw − min)/5000)^V2 / (V3·2)
3 spillOver per-sample non-negative least squares against the compensation matrix K (αK with unit diagonal) to remove cross-cell-type spillover
4 microenvironmentScores append ImmuneScore, StromaScore, MicroenvironmentScore

Validation

Every stage is validated against the real R xCell 1.1.0 package (R 4.6.0, GSVA 2.6.2) — never against hand-written expected values.

Real-data gate — a public glioblastoma cohort (Verhaak et al., via GSVAdata; 11 861 genes × 173 samples). Each Rust stage is fed R's output from the previous stage, so its error is measured in isolation; worst_rel ignores structural zeros (entries < 1e-10, e.g. cell types absent from a sample that spillover NNLS drives to exactly 0). Full metrics in benchmarks/parity_realdata_results.json:

Stage worst abs err worst rel err notes
rawEnrichmentAnalysis 1.8e-12 4.2e-16 ssGSEA + per-cell-type mean; ~1 ULP
transformScores 1.1e-16 2.8e-16 power curve; ~1 ULP
spillOver (NNLS) 2.4e-15 9.6e-12 vs pracma::lsqlincon; unique optimum
microenvironmentScores 4.4e-16 3.3e-16 ~1 ULP
xCellAnalysis (end-to-end) 2.9e-15 1.5e-11 whole pipeline

Synthetic fixtures (committed; run offline by cargo test) pin stages 2–4 on a seeded 64-cell-type input to < 1e-9 (abs) / < 1e-6 (the NNLS relative bound) against R. Stage 1 needs ≥ 5000 genes, too large to commit, so it is covered by the real-data gate above and rests on gsva-rust's own bit-identical ssGSEA fixtures.

References are dumped from R by benchmarks/gen_parity_synthetic.R (synthetic, committed) and benchmarks/parity_realdata.R (real data, gitignored). Reproduce the real-data gate with:

Rscript -e "source('benchmarks/parity_realdata.R')"   # dumps R references
cargo run --release --example parity_realdata          # checks parity, writes JSON

Performance

Timing against the original R xCell package on the same real GBM cohort the parity gate uses (Verhaak et al., 11 861 genes × 173 samples), across three engines: R (single-threaded xCell 1.1.0), the serial Rust build (--no-default-features, rayon-free), and the parallel Rust build (the default, rayon over all 16 cores). Median of 5 runs, one warm-up excluded:

workload R (s) serial (s) parallel (s) parallel vs serial parallel vs R serial ≡ parallel
full (end-to-end xCellAnalysis) 2.245 0.294 0.097 3.0× 23.2× faster bit-identical
spillover (per-sample NNLS only) 0.085 0.076 0.011 7.0× 7.9× faster bit-identical

full is the user-facing call; its cost is dominated by stage-1 ssGSEA, which parallelizes inside gsva-rust. spillover isolates the one kernel xcell-rust itself parallelizes — the per-sample Lawson–Hanson NNLS — where rayon alone buys over the serial build. Both builds produce bit-identical scores (rayon only reorders independent per-sample work; there is no matrix decomposition whose backend could differ), and both match R within the parity bar. Peak resident set is ~57 MB for either Rust build.

Full methodology, a seeded-synthetic scaling sweep (gene and sample axes), and the peak-memory curves are in benchmarks/REPORT.md. Reproduce:

Rscript -e "source('benchmarks/bench_xcell.R')"          # real GBM, single point -> results_xcell.json
Rscript -e "source('benchmarks/bench_scaling_xcell.R')"  # synthetic size sweep   -> results_scaling_xcell.json

Fidelity notes

  • Spillover NNLS. R calls pracma::lsqlincon(K, x, lb = 0), which reduces the constrained least-squares problem to a quadratic program over the Gram matrix KᵀK and solves it with quadprog. Because αK has a unit diagonal and full rank, KᵀK is symmetric positive-definite and the objective is strictly convex: the optimum is unique, so the hand-rolled Lawson–Hanson solver and quadprog reach the same point (to ~1e-11 relative here), not bit-for-bit.
  • 80-bit reductions. R accumulates sum/mean in long double; Rust uses f64. The per-cell-type mean in stage 1 is the only such new reduction, giving ~1 ULP (1.8e-12 abs).
  • Canonical cell-type order is taken from xCell's own data (exported as a constant) rather than reproducing R's locale-dependent sort() collation.
  • Fail loud. Below the 5000-gene floor, xCell aborts; the port asserts rather than emitting a silent degenerate result.

R → Rust name map

R (xCell 1.1.0) xcell-rust
xCellAnalysis(expr, rnaseq, scale, alpha) xcell_analysis(&expr, &model, &XCellParams { rnaseq, scale, alpha })
rawEnrichmentAnalysis(expr, signatures, genes) raw_enrichment(&expr, &model)
transformScores(scores, fit.vals, scale) transform_scores(&raw, spill, scale)
spillOver(transformed, K, alpha) spill_over(&transformed, spill, alpha)
microenvironmentScores(adjusted) microenvironment_scores(&adjusted)
pracma::lsqlincon(K, x, lb = 0) hand-rolled Lawson–Hanson NNLS
xCell.data$spill / $spill.array model.spill_for(rnaseq)
expression matrix / ExpressionSet ExprMatrix (xcell::io::read_tsv_matrix)
returned score matrix EnrichmentResult { gene_sets, samples, scores }

Building from source

xcell-rust depends on gsva-rust as a sibling path dependency; check both out side by side (../gsva-rust). Then:

cargo build
cargo test

Gate (matches CI) — both feature configurations:

cargo fmt --all -- --check
cargo clippy --all-targets --all-features -- -D warnings        # default: parallel
cargo clippy --no-default-features --all-targets -- -D warnings # serial
cargo test
cargo test --no-default-features

How this port was produced

xcell-rust was written with AI assistance (Anthropic's Claude, via Claude Code); the Co-Authored-By trailers in the git history record this. It follows a port-and-validate discipline: every stage is checked for numeric parity against the real R xCell package on both synthetic fixtures and real public RNA-seq before it is considered done.

License

Licensed under GPL-3.0-or-later, matching the upstream xCell package (License: GPL-3). The bundled model data is exported from xCell. See LICENSE. xcell-rust depends on gsva-rust (Artistic-2.0), which is license-compatible.

Credits & citation

xCell is the work of Dvir Aran, Zicheng Hu, and Atul J. Butte, distributed as the xCell R package. This port reimplements their method against that package's behavior; it does not originate it.

If you use xcell-rust in research, please cite the original xCell paper, and the GSVA paper for the ssGSEA engine it builds on:

  • xCell — Aran D, Hu Z, Butte AJ. xCell: digitally portraying the tissue cellular heterogeneity landscape. Genome Biology. 2017;18:220. doi:10.1186/s13059-017-1349-1
  • ssGSEA engine — Hänzelmann S, Castelo R, Guinney J. GSVA: gene set variation analysis for microarray and RNA-Seq data. BMC Bioinformatics. 2013;14:7. doi:10.1186/1471-2105-14-7