# xcell-rust
[](https://crates.io/crates/xcell-rust)
[](https://docs.rs/xcell-rust)
[](https://github.com/sinmojito/xcell-rust/actions/workflows/ci.yml)
[](LICENSE)
A pure-Rust port of **xCell 1.1.0** (Aran, Hu & Butte, *Genome Biology* 2017),
the cell-type enrichment method that scores 64 immune and stromal cell types from
bulk expression. It runs the full xCell pipeline —
`rawEnrichmentAnalysis → transformScores → spillOver → microenvironmentScores` —
on top of the [`gsva`](https://github.com/sinmojito/gsva-rust) ssGSEA engine, and
is validated for numeric parity against the original R package (R 4.6.0,
xCell 1.1.0, GSVA 2.6.2).
> **Disclaimer.** xcell-rust is an independent, unofficial Rust port and is **not**
> affiliated with, endorsed by, or supported by the authors of xCell. For the
> canonical, authoritative implementation, use the original
> [xCell package](https://github.com/dviraran/xCell) (Aran *et al.*).
> **Status: the full pipeline is landed and validated.** Every stage is gated on
> real-data parity against R xCell 1.1.0 (see [Validation](#validation)). Pre-1.0:
> the public API may still shift before the first crates.io release.
The default build depends on two pure-Rust crates: `gsva-rust` (the ssGSEA
engine, taken as its zero-dependency core) and
[`rayon`](https://crates.io/crates/rayon), for parity-preserving parallelism.
Building `--no-default-features` drops rayon for a pure-serial crate whose only
dependency is gsva-rust's core. Either way, no BLAS/LAPACK, `quadprog`/`pracma`,
Python, or R is required at runtime — even the spillover non-negative least
squares is hand-rolled.
## Highlights
- **The whole xCell 1.1.0 method**, not a slice: raw ssGSEA enrichment over the
489 bundled signatures, per-cell-type calibration, spillover compensation, and
the immune / stroma / microenvironment aggregate scores.
- **Validated against R**, not against hand-picked expected values: per-stage
references are dumped from the real xCell package at full `%.17g` precision.
- **Two-level validation:** seeded synthetic fixtures (`cargo test`, no R needed)
plus a real-data parity gate against a public glioblastoma cohort.
- **Hand-rolled NNLS:** the `spillOver` step solves each sample's non-negative
least-squares problem with a Lawson–Hanson active-set solver, matching R's
`pracma::lsqlincon` to ~1e-11 with no linear-algebra dependency.
- **Parallel by default, bit-identical:** the `parallel` feature (rayon) solves
the per-sample spillover NNLS across cores and forwards to `gsva-rust/parallel`
so ssGSEA parallelizes too; both axes only reorder independent per-sample work,
so results match the serial build (and R) bit-for-bit.
- MSRV 1.85, inherited from the `gsva-rust` dependency's declared minimum.
`xCell.data` is embedded, so analysis needs no network or data files.
## Installation
```toml
[dependencies]
xcell-rust = "0.1"
```
The library is imported as `xcell`:
```rust
use xcell::{xcell_analysis, XCellModel, XCellParams};
```
## Quick start
```rust
use xcell::{xcell_analysis, ExprMatrix, XCellModel, XCellParams};
// Expression as genes (rows, labelled by HGNC symbol) × samples. Real inputs are
// usually a TSV loaded with the bundled reader:
let expr: ExprMatrix =
xcell::io::read_tsv_matrix(&std::fs::read_to_string("expr.tsv").unwrap());
// The embedded xCell 1.1.0 model: signatures, gene universe, spillover matrix,
// and calibration table. Cheap to load.
let model = XCellModel::load();
// Defaults match R xCellAnalysis: rnaseq = true, scale = true, alpha = 0.5.
let scores = xcell_analysis(&expr, &model, &XCellParams::default());
// 64 cell types + ImmuneScore / StromaScore / MicroenvironmentScore, × samples.
let macrophages_s0 = scores.score(
scores.gene_sets.iter().position(|g| g == "Macrophages").unwrap(),
0,
);
println!("Macrophages in sample 0 = {macrophages_s0}");
```
xCell requires at least 5000 of its 10 808 universe genes to be present; with
fewer the R package aborts, and so does this port (loudly, rather than returning a
degenerate result).
## The pipeline
| 1 | `rawEnrichmentAnalysis` | keep genes shared with xCell's universe, per-sample average-tie rank, unnormalized ssGSEA over 489 signatures, subtract each signature's minimum, average signatures per cell type |
| 2 | `transformScores` | per-cell-type power-curve calibration `t = ((raw − min)/5000)^V2 / (V3·2)` |
| 3 | `spillOver` | per-sample non-negative least squares against the compensation matrix `K` (`αK` with unit diagonal) to remove cross-cell-type spillover |
| 4 | `microenvironmentScores` | append `ImmuneScore`, `StromaScore`, `MicroenvironmentScore` |
## Validation
Every stage is validated against the **real R xCell 1.1.0 package** (R 4.6.0,
GSVA 2.6.2) — never against hand-written expected values.
**Real-data gate** — a public glioblastoma cohort (Verhaak *et al.*, via
`GSVAdata`; 11 861 genes × 173 samples). Each Rust stage is fed R's output from
the previous stage, so its error is measured in isolation; `worst_rel` ignores
structural zeros (entries < 1e-10, e.g. cell types absent from a sample that
spillover NNLS drives to exactly 0). Full metrics in
[`benchmarks/parity_realdata_results.json`](https://github.com/sinmojito/xcell-rust/blob/main/benchmarks/parity_realdata_results.json):
| `rawEnrichmentAnalysis` | `1.8e-12` | `4.2e-16` | ssGSEA + per-cell-type mean; ~1 ULP |
| `transformScores` | `1.1e-16` | `2.8e-16` | power curve; ~1 ULP |
| `spillOver` (NNLS) | `2.4e-15` | `9.6e-12` | vs `pracma::lsqlincon`; unique optimum |
| `microenvironmentScores` | `4.4e-16` | `3.3e-16` | ~1 ULP |
| `xCellAnalysis` (end-to-end) | `2.9e-15` | `1.5e-11` | whole pipeline |
**Synthetic fixtures** (committed; run offline by `cargo test`) pin stages 2–4 on
a seeded 64-cell-type input to < 1e-9 (abs) / < 1e-6 (the NNLS relative bound)
against R. Stage 1 needs ≥ 5000 genes, too large to commit, so it is covered by
the real-data gate above and rests on `gsva-rust`'s own bit-identical ssGSEA
fixtures.
References are dumped from R by `benchmarks/gen_parity_synthetic.R` (synthetic,
committed) and `benchmarks/parity_realdata.R` (real data, gitignored). Reproduce
the real-data gate with:
```sh
Rscript -e "source('benchmarks/parity_realdata.R')" # dumps R references
cargo run --release --example parity_realdata # checks parity, writes JSON
```
## Performance
Timing against the original R xCell package on the **same real GBM cohort** the
parity gate uses (Verhaak *et al.*, 11 861 genes × 173 samples), across three
engines: R (single-threaded xCell 1.1.0), the serial Rust build
(`--no-default-features`, rayon-free), and the parallel Rust build (the default,
rayon over all 16 cores). Median of 5 runs, one warm-up excluded:
| `full` (end-to-end `xCellAnalysis`) | 2.245 | 0.294 | 0.097 | 3.0× | **23.2× faster** | bit-identical |
| `spillover` (per-sample NNLS only) | 0.085 | 0.076 | 0.011 | 7.0× | **7.9× faster** | bit-identical |
`full` is the user-facing call; its cost is dominated by stage-1 ssGSEA, which
parallelizes inside `gsva-rust`. `spillover` isolates the one kernel xcell-rust
itself parallelizes — the per-sample Lawson–Hanson NNLS — where rayon alone buys
**7×** over the serial build. Both builds produce **bit-identical** scores (rayon
only reorders independent per-sample work; there is no matrix decomposition whose
backend could differ), and both match R within the parity bar. Peak resident set
is ~57 MB for either Rust build.
Full methodology, a seeded-synthetic scaling sweep (gene and sample axes), and the
peak-memory curves are in [`benchmarks/REPORT.md`](https://github.com/sinmojito/xcell-rust/blob/main/benchmarks/REPORT.md). Reproduce:
```sh
Rscript -e "source('benchmarks/bench_xcell.R')" # real GBM, single point -> results_xcell.json
Rscript -e "source('benchmarks/bench_scaling_xcell.R')" # synthetic size sweep -> results_scaling_xcell.json
```
## Fidelity notes
- **Spillover NNLS.** R calls `pracma::lsqlincon(K, x, lb = 0)`, which reduces the
constrained least-squares problem to a quadratic program over the Gram matrix
`KᵀK` and solves it with `quadprog`. Because `αK` has a unit diagonal and full
rank, `KᵀK` is symmetric positive-definite and the objective is strictly convex:
the optimum is unique, so the hand-rolled Lawson–Hanson solver and `quadprog`
reach the same point (to ~1e-11 relative here), not bit-for-bit.
- **80-bit reductions.** R accumulates `sum`/`mean` in `long double`; Rust uses
`f64`. The per-cell-type mean in stage 1 is the only such new reduction, giving
~1 ULP (1.8e-12 abs).
- **Canonical cell-type order** is taken from xCell's own data (exported as a
constant) rather than reproducing R's locale-dependent `sort()` collation.
- **Fail loud.** Below the 5000-gene floor, xCell aborts; the port asserts rather
than emitting a silent degenerate result.
## R → Rust name map
| `xCellAnalysis(expr, rnaseq, scale, alpha)` | `xcell_analysis(&expr, &model, &XCellParams { rnaseq, scale, alpha })` |
| `rawEnrichmentAnalysis(expr, signatures, genes)` | `raw_enrichment(&expr, &model)` |
| `transformScores(scores, fit.vals, scale)` | `transform_scores(&raw, spill, scale)` |
| `spillOver(transformed, K, alpha)` | `spill_over(&transformed, spill, alpha)` |
| `microenvironmentScores(adjusted)` | `microenvironment_scores(&adjusted)` |
| `pracma::lsqlincon(K, x, lb = 0)` | hand-rolled Lawson–Hanson NNLS |
| `xCell.data$spill` / `$spill.array` | `model.spill_for(rnaseq)` |
| expression matrix / `ExpressionSet` | `ExprMatrix` (`xcell::io::read_tsv_matrix`) |
| returned score matrix | `EnrichmentResult { gene_sets, samples, scores }` |
## Building from source
`xcell-rust` depends on `gsva-rust` as a sibling path dependency; check both out
side by side (`../gsva-rust`). Then:
```sh
cargo build
cargo test
```
Gate (matches CI) — both feature configurations:
```sh
cargo fmt --all -- --check
cargo clippy --all-targets --all-features -- -D warnings # default: parallel
cargo clippy --no-default-features --all-targets -- -D warnings # serial
cargo test
cargo test --no-default-features
```
## How this port was produced
xcell-rust was written with AI assistance (Anthropic's Claude, via
[Claude Code](https://www.anthropic.com/claude-code)); the `Co-Authored-By`
trailers in the git history record this. It follows a port-and-validate
discipline: every stage is checked for numeric parity against the real R xCell
package on both synthetic fixtures and real public RNA-seq before it is
considered done.
## License
Licensed under **GPL-3.0-or-later**, matching the upstream xCell package
(`License: GPL-3`). The bundled model data is exported from xCell. See
[`LICENSE`](LICENSE). `xcell-rust` depends on `gsva-rust` (Artistic-2.0), which is
license-compatible.
## Credits & citation
xCell is the work of Dvir Aran, Zicheng Hu, and Atul J. Butte, distributed as the
`xCell` R package. This port reimplements their method against that package's
behavior; it does not originate it.
- Original software: [github.com/dviraran/xCell](https://github.com/dviraran/xCell)
If you use xcell-rust in research, please cite the original xCell paper, and the
GSVA paper for the ssGSEA engine it builds on:
- **xCell** — Aran D, Hu Z, Butte AJ. *xCell: digitally portraying the tissue
cellular heterogeneity landscape.* Genome Biology. 2017;18:220.
doi:[10.1186/s13059-017-1349-1](https://doi.org/10.1186/s13059-017-1349-1)
- **ssGSEA engine** — Hänzelmann S, Castelo R, Guinney J. *GSVA: gene set
variation analysis for microarray and RNA-Seq data.* BMC Bioinformatics.
2013;14:7. doi:[10.1186/1471-2105-14-7](https://doi.org/10.1186/1471-2105-14-7)