xcell-rust
A pure-Rust port of xCell 1.1.0 (Aran, Hu & Butte, Genome Biology 2017),
the cell-type enrichment method that scores 64 immune and stromal cell types from
bulk expression. It runs the full xCell pipeline —
rawEnrichmentAnalysis → transformScores → spillOver → microenvironmentScores —
on top of the gsva ssGSEA engine, and
is validated for numeric parity against the original R package (R 4.6.0,
xCell 1.1.0, GSVA 2.6.2).
Disclaimer. xcell-rust is an independent, unofficial Rust port and is not affiliated with, endorsed by, or supported by the authors of xCell. For the canonical, authoritative implementation, use the original xCell package (Aran et al.).
Status: the full pipeline is landed and validated. Every stage is gated on real-data parity against R xCell 1.1.0 (see Validation). Pre-1.0: the public API may still shift before the first crates.io release.
The default build depends on two pure-Rust crates: gsva-rust (the ssGSEA
engine, taken as its zero-dependency core) and
rayon, for parity-preserving parallelism.
Building --no-default-features drops rayon for a pure-serial crate whose only
dependency is gsva-rust's core. Either way, no BLAS/LAPACK, quadprog/pracma,
Python, or R is required at runtime — even the spillover non-negative least
squares is hand-rolled.
Highlights
- The whole xCell 1.1.0 method, not a slice: raw ssGSEA enrichment over the 489 bundled signatures, per-cell-type calibration, spillover compensation, and the immune / stroma / microenvironment aggregate scores.
- Validated against R, not against hand-picked expected values: per-stage
references are dumped from the real xCell package at full
%.17gprecision. - Two-level validation: seeded synthetic fixtures (
cargo test, no R needed) plus a real-data parity gate against a public glioblastoma cohort. - Hand-rolled NNLS: the
spillOverstep solves each sample's non-negative least-squares problem with a Lawson–Hanson active-set solver, matching R'spracma::lsqlinconto ~1e-11 with no linear-algebra dependency. - Parallel by default, bit-identical: the
parallelfeature (rayon) solves the per-sample spillover NNLS across cores and forwards togsva-rust/parallelso ssGSEA parallelizes too; both axes only reorder independent per-sample work, so results match the serial build (and R) bit-for-bit. - MSRV 1.85, inherited from the
gsva-rustdependency's declared minimum.xCell.datais embedded, so analysis needs no network or data files.
Installation
[]
= "0.1"
The library is imported as xcell:
use ;
Quick start
use ;
// Expression as genes (rows, labelled by HGNC symbol) × samples. Real inputs are
// usually a TSV loaded with the bundled reader:
let expr: ExprMatrix =
read_tsv_matrix;
// The embedded xCell 1.1.0 model: signatures, gene universe, spillover matrix,
// and calibration table. Cheap to load.
let model = load;
// Defaults match R xCellAnalysis: rnaseq = true, scale = true, alpha = 0.5.
let scores = xcell_analysis;
// 64 cell types + ImmuneScore / StromaScore / MicroenvironmentScore, × samples.
let macrophages_s0 = scores.score;
println!;
xCell requires at least 5000 of its 10 808 universe genes to be present; with fewer the R package aborts, and so does this port (loudly, rather than returning a degenerate result).
The pipeline
| Stage | R function | What it does |
|---|---|---|
| 1 | rawEnrichmentAnalysis |
keep genes shared with xCell's universe, per-sample average-tie rank, unnormalized ssGSEA over 489 signatures, subtract each signature's minimum, average signatures per cell type |
| 2 | transformScores |
per-cell-type power-curve calibration t = ((raw − min)/5000)^V2 / (V3·2) |
| 3 | spillOver |
per-sample non-negative least squares against the compensation matrix K (αK with unit diagonal) to remove cross-cell-type spillover |
| 4 | microenvironmentScores |
append ImmuneScore, StromaScore, MicroenvironmentScore |
Validation
Every stage is validated against the real R xCell 1.1.0 package (R 4.6.0, GSVA 2.6.2) — never against hand-written expected values.
Real-data gate — a public glioblastoma cohort (Verhaak et al., via
GSVAdata; 11 861 genes × 173 samples). Each Rust stage is fed R's output from
the previous stage, so its error is measured in isolation; worst_rel ignores
structural zeros (entries < 1e-10, e.g. cell types absent from a sample that
spillover NNLS drives to exactly 0). Full metrics in
benchmarks/parity_realdata_results.json:
| Stage | worst abs err | worst rel err | notes |
|---|---|---|---|
rawEnrichmentAnalysis |
1.8e-12 |
4.2e-16 |
ssGSEA + per-cell-type mean; ~1 ULP |
transformScores |
1.1e-16 |
2.8e-16 |
power curve; ~1 ULP |
spillOver (NNLS) |
2.4e-15 |
9.6e-12 |
vs pracma::lsqlincon; unique optimum |
microenvironmentScores |
4.4e-16 |
3.3e-16 |
~1 ULP |
xCellAnalysis (end-to-end) |
2.9e-15 |
1.5e-11 |
whole pipeline |
Synthetic fixtures (committed; run offline by cargo test) pin stages 2–4 on
a seeded 64-cell-type input to < 1e-9 (abs) / < 1e-6 (the NNLS relative bound)
against R. Stage 1 needs ≥ 5000 genes, too large to commit, so it is covered by
the real-data gate above and rests on gsva-rust's own bit-identical ssGSEA
fixtures.
References are dumped from R by benchmarks/gen_parity_synthetic.R (synthetic,
committed) and benchmarks/parity_realdata.R (real data, gitignored). Reproduce
the real-data gate with:
Performance
Timing against the original R xCell package on the same real GBM cohort the
parity gate uses (Verhaak et al., 11 861 genes × 173 samples), across three
engines: R (single-threaded xCell 1.1.0), the serial Rust build
(--no-default-features, rayon-free), and the parallel Rust build (the default,
rayon over all 16 cores). Median of 5 runs, one warm-up excluded:
| workload | R (s) | serial (s) | parallel (s) | parallel vs serial | parallel vs R | serial ≡ parallel |
|---|---|---|---|---|---|---|
full (end-to-end xCellAnalysis) |
2.245 | 0.294 | 0.097 | 3.0× | 23.2× faster | bit-identical |
spillover (per-sample NNLS only) |
0.085 | 0.076 | 0.011 | 7.0× | 7.9× faster | bit-identical |
full is the user-facing call; its cost is dominated by stage-1 ssGSEA, which
parallelizes inside gsva-rust. spillover isolates the one kernel xcell-rust
itself parallelizes — the per-sample Lawson–Hanson NNLS — where rayon alone buys
7× over the serial build. Both builds produce bit-identical scores (rayon
only reorders independent per-sample work; there is no matrix decomposition whose
backend could differ), and both match R within the parity bar. Peak resident set
is ~57 MB for either Rust build.
Full methodology, a seeded-synthetic scaling sweep (gene and sample axes), and the
peak-memory curves are in benchmarks/REPORT.md. Reproduce:
Fidelity notes
- Spillover NNLS. R calls
pracma::lsqlincon(K, x, lb = 0), which reduces the constrained least-squares problem to a quadratic program over the Gram matrixKᵀKand solves it withquadprog. BecauseαKhas a unit diagonal and full rank,KᵀKis symmetric positive-definite and the objective is strictly convex: the optimum is unique, so the hand-rolled Lawson–Hanson solver andquadprogreach the same point (to ~1e-11 relative here), not bit-for-bit. - 80-bit reductions. R accumulates
sum/meaninlong double; Rust usesf64. The per-cell-type mean in stage 1 is the only such new reduction, giving ~1 ULP (1.8e-12 abs). - Canonical cell-type order is taken from xCell's own data (exported as a
constant) rather than reproducing R's locale-dependent
sort()collation. - Fail loud. Below the 5000-gene floor, xCell aborts; the port asserts rather than emitting a silent degenerate result.
R → Rust name map
| R (xCell 1.1.0) | xcell-rust |
|---|---|
xCellAnalysis(expr, rnaseq, scale, alpha) |
xcell_analysis(&expr, &model, &XCellParams { rnaseq, scale, alpha }) |
rawEnrichmentAnalysis(expr, signatures, genes) |
raw_enrichment(&expr, &model) |
transformScores(scores, fit.vals, scale) |
transform_scores(&raw, spill, scale) |
spillOver(transformed, K, alpha) |
spill_over(&transformed, spill, alpha) |
microenvironmentScores(adjusted) |
microenvironment_scores(&adjusted) |
pracma::lsqlincon(K, x, lb = 0) |
hand-rolled Lawson–Hanson NNLS |
xCell.data$spill / $spill.array |
model.spill_for(rnaseq) |
expression matrix / ExpressionSet |
ExprMatrix (xcell::io::read_tsv_matrix) |
| returned score matrix | EnrichmentResult { gene_sets, samples, scores } |
Building from source
xcell-rust depends on gsva-rust as a sibling path dependency; check both out
side by side (../gsva-rust). Then:
Gate (matches CI) — both feature configurations:
How this port was produced
xcell-rust was written with AI assistance (Anthropic's Claude, via
Claude Code); the Co-Authored-By
trailers in the git history record this. It follows a port-and-validate
discipline: every stage is checked for numeric parity against the real R xCell
package on both synthetic fixtures and real public RNA-seq before it is
considered done.
License
Licensed under GPL-3.0-or-later, matching the upstream xCell package
(License: GPL-3). The bundled model data is exported from xCell. See
LICENSE. xcell-rust depends on gsva-rust (Artistic-2.0), which is
license-compatible.
Credits & citation
xCell is the work of Dvir Aran, Zicheng Hu, and Atul J. Butte, distributed as the
xCell R package. This port reimplements their method against that package's
behavior; it does not originate it.
- Original software: github.com/dviraran/xCell
If you use xcell-rust in research, please cite the original xCell paper, and the GSVA paper for the ssGSEA engine it builds on:
- xCell — Aran D, Hu Z, Butte AJ. xCell: digitally portraying the tissue cellular heterogeneity landscape. Genome Biology. 2017;18:220. doi:10.1186/s13059-017-1349-1
- ssGSEA engine — Hänzelmann S, Castelo R, Guinney J. GSVA: gene set variation analysis for microarray and RNA-Seq data. BMC Bioinformatics. 2013;14:7. doi:10.1186/1471-2105-14-7