xcell-rust 0.1.0

Pure-Rust port of xCell (Aran et al. 2017) cell-type enrichment — ssGSEA, spillover-corrected — validated for numeric parity against the R xCell package. Built on gsva-rust.
Documentation
# xcell-rust

[![Crates.io](https://img.shields.io/crates/v/xcell-rust.svg)](https://crates.io/crates/xcell-rust)
[![Docs.rs](https://img.shields.io/docsrs/xcell-rust)](https://docs.rs/xcell-rust)
[![CI](https://github.com/sinmojito/xcell-rust/actions/workflows/ci.yml/badge.svg)](https://github.com/sinmojito/xcell-rust/actions/workflows/ci.yml)
[![License: GPL-3.0-or-later](https://img.shields.io/badge/license-GPL--3.0--or--later-blue.svg)](LICENSE)

A pure-Rust port of **xCell 1.1.0** (Aran, Hu & Butte, *Genome Biology* 2017),
the cell-type enrichment method that scores 64 immune and stromal cell types from
bulk expression. It runs the full xCell pipeline —
`rawEnrichmentAnalysis → transformScores → spillOver → microenvironmentScores` —
on top of the [`gsva`](https://github.com/sinmojito/gsva-rust) ssGSEA engine, and
is validated for numeric parity against the original R package (R 4.6.0,
xCell 1.1.0, GSVA 2.6.2).

> **Disclaimer.** xcell-rust is an independent, unofficial Rust port and is **not**
> affiliated with, endorsed by, or supported by the authors of xCell. For the
> canonical, authoritative implementation, use the original
> [xCell package]https://github.com/dviraran/xCell (Aran *et al.*).

> **Status: the full pipeline is landed and validated.** Every stage is gated on
> real-data parity against R xCell 1.1.0 (see [Validation]#validation). Pre-1.0:
> the public API may still shift before the first crates.io release.

The default build depends on two pure-Rust crates: `gsva-rust` (the ssGSEA
engine, taken as its zero-dependency core) and
[`rayon`](https://crates.io/crates/rayon), for parity-preserving parallelism.
Building `--no-default-features` drops rayon for a pure-serial crate whose only
dependency is gsva-rust's core. Either way, no BLAS/LAPACK, `quadprog`/`pracma`,
Python, or R is required at runtime — even the spillover non-negative least
squares is hand-rolled.

## Highlights

- **The whole xCell 1.1.0 method**, not a slice: raw ssGSEA enrichment over the
  489 bundled signatures, per-cell-type calibration, spillover compensation, and
  the immune / stroma / microenvironment aggregate scores.
- **Validated against R**, not against hand-picked expected values: per-stage
  references are dumped from the real xCell package at full `%.17g` precision.
- **Two-level validation:** seeded synthetic fixtures (`cargo test`, no R needed)
  plus a real-data parity gate against a public glioblastoma cohort.
- **Hand-rolled NNLS:** the `spillOver` step solves each sample's non-negative
  least-squares problem with a Lawson–Hanson active-set solver, matching R's
  `pracma::lsqlincon` to ~1e-11 with no linear-algebra dependency.
- **Parallel by default, bit-identical:** the `parallel` feature (rayon) solves
  the per-sample spillover NNLS across cores and forwards to `gsva-rust/parallel`
  so ssGSEA parallelizes too; both axes only reorder independent per-sample work,
  so results match the serial build (and R) bit-for-bit.
- MSRV 1.85, inherited from the `gsva-rust` dependency's declared minimum.
  `xCell.data` is embedded, so analysis needs no network or data files.

## Installation

```toml
[dependencies]
xcell-rust = "0.1"
```

The library is imported as `xcell`:

```rust
use xcell::{xcell_analysis, XCellModel, XCellParams};
```

## Quick start

```rust
use xcell::{xcell_analysis, ExprMatrix, XCellModel, XCellParams};

// Expression as genes (rows, labelled by HGNC symbol) × samples. Real inputs are
// usually a TSV loaded with the bundled reader:
let expr: ExprMatrix =
    xcell::io::read_tsv_matrix(&std::fs::read_to_string("expr.tsv").unwrap());

// The embedded xCell 1.1.0 model: signatures, gene universe, spillover matrix,
// and calibration table. Cheap to load.
let model = XCellModel::load();

// Defaults match R xCellAnalysis: rnaseq = true, scale = true, alpha = 0.5.
let scores = xcell_analysis(&expr, &model, &XCellParams::default());

// 64 cell types + ImmuneScore / StromaScore / MicroenvironmentScore, × samples.
let macrophages_s0 = scores.score(
    scores.gene_sets.iter().position(|g| g == "Macrophages").unwrap(),
    0,
);
println!("Macrophages in sample 0 = {macrophages_s0}");
```

xCell requires at least 5000 of its 10 808 universe genes to be present; with
fewer the R package aborts, and so does this port (loudly, rather than returning a
degenerate result).

## The pipeline

| Stage | R function | What it does |
|---|---|---|
| 1 | `rawEnrichmentAnalysis` | keep genes shared with xCell's universe, per-sample average-tie rank, unnormalized ssGSEA over 489 signatures, subtract each signature's minimum, average signatures per cell type |
| 2 | `transformScores` | per-cell-type power-curve calibration `t = ((raw − min)/5000)^V2 / (V3·2)` |
| 3 | `spillOver` | per-sample non-negative least squares against the compensation matrix `K` (`αK` with unit diagonal) to remove cross-cell-type spillover |
| 4 | `microenvironmentScores` | append `ImmuneScore`, `StromaScore`, `MicroenvironmentScore` |

## Validation

Every stage is validated against the **real R xCell 1.1.0 package** (R 4.6.0,
GSVA 2.6.2) — never against hand-written expected values.

**Real-data gate** — a public glioblastoma cohort (Verhaak *et al.*, via
`GSVAdata`; 11 861 genes × 173 samples). Each Rust stage is fed R's output from
the previous stage, so its error is measured in isolation; `worst_rel` ignores
structural zeros (entries < 1e-10, e.g. cell types absent from a sample that
spillover NNLS drives to exactly 0). Full metrics in
[`benchmarks/parity_realdata_results.json`](https://github.com/sinmojito/xcell-rust/blob/main/benchmarks/parity_realdata_results.json):

| Stage | worst abs err | worst rel err | notes |
|---|---|---|---|
| `rawEnrichmentAnalysis` | `1.8e-12` | `4.2e-16` | ssGSEA + per-cell-type mean; ~1 ULP |
| `transformScores` | `1.1e-16` | `2.8e-16` | power curve; ~1 ULP |
| `spillOver` (NNLS) | `2.4e-15` | `9.6e-12` | vs `pracma::lsqlincon`; unique optimum |
| `microenvironmentScores` | `4.4e-16` | `3.3e-16` | ~1 ULP |
| `xCellAnalysis` (end-to-end) | `2.9e-15` | `1.5e-11` | whole pipeline |

**Synthetic fixtures** (committed; run offline by `cargo test`) pin stages 2–4 on
a seeded 64-cell-type input to < 1e-9 (abs) / < 1e-6 (the NNLS relative bound)
against R. Stage 1 needs ≥ 5000 genes, too large to commit, so it is covered by
the real-data gate above and rests on `gsva-rust`'s own bit-identical ssGSEA
fixtures.

References are dumped from R by `benchmarks/gen_parity_synthetic.R` (synthetic,
committed) and `benchmarks/parity_realdata.R` (real data, gitignored). Reproduce
the real-data gate with:

```sh
Rscript -e "source('benchmarks/parity_realdata.R')"   # dumps R references
cargo run --release --example parity_realdata          # checks parity, writes JSON
```

## Performance

Timing against the original R xCell package on the **same real GBM cohort** the
parity gate uses (Verhaak *et al.*, 11 861 genes × 173 samples), across three
engines: R (single-threaded xCell 1.1.0), the serial Rust build
(`--no-default-features`, rayon-free), and the parallel Rust build (the default,
rayon over all 16 cores). Median of 5 runs, one warm-up excluded:

| workload | R (s) | serial (s) | parallel (s) | parallel vs serial | parallel vs R | serial ≡ parallel |
|---|---:|---:|---:|---:|---:|---|
| `full` (end-to-end `xCellAnalysis`) | 2.245 | 0.294 | 0.097 | 3.0× | **23.2× faster** | bit-identical |
| `spillover` (per-sample NNLS only) | 0.085 | 0.076 | 0.011 | 7.0× | **7.9× faster** | bit-identical |

`full` is the user-facing call; its cost is dominated by stage-1 ssGSEA, which
parallelizes inside `gsva-rust`. `spillover` isolates the one kernel xcell-rust
itself parallelizes — the per-sample Lawson–Hanson NNLS — where rayon alone buys
**7×** over the serial build. Both builds produce **bit-identical** scores (rayon
only reorders independent per-sample work; there is no matrix decomposition whose
backend could differ), and both match R within the parity bar. Peak resident set
is ~57 MB for either Rust build.

Full methodology, a seeded-synthetic scaling sweep (gene and sample axes), and the
peak-memory curves are in [`benchmarks/REPORT.md`](https://github.com/sinmojito/xcell-rust/blob/main/benchmarks/REPORT.md). Reproduce:

```sh
Rscript -e "source('benchmarks/bench_xcell.R')"          # real GBM, single point -> results_xcell.json
Rscript -e "source('benchmarks/bench_scaling_xcell.R')"  # synthetic size sweep   -> results_scaling_xcell.json
```

## Fidelity notes

- **Spillover NNLS.** R calls `pracma::lsqlincon(K, x, lb = 0)`, which reduces the
  constrained least-squares problem to a quadratic program over the Gram matrix
  `KᵀK` and solves it with `quadprog`. Because `αK` has a unit diagonal and full
  rank, `KᵀK` is symmetric positive-definite and the objective is strictly convex:
  the optimum is unique, so the hand-rolled Lawson–Hanson solver and `quadprog`
  reach the same point (to ~1e-11 relative here), not bit-for-bit.
- **80-bit reductions.** R accumulates `sum`/`mean` in `long double`; Rust uses
  `f64`. The per-cell-type mean in stage 1 is the only such new reduction, giving
  ~1 ULP (1.8e-12 abs).
- **Canonical cell-type order** is taken from xCell's own data (exported as a
  constant) rather than reproducing R's locale-dependent `sort()` collation.
- **Fail loud.** Below the 5000-gene floor, xCell aborts; the port asserts rather
  than emitting a silent degenerate result.

## R → Rust name map

| R (xCell 1.1.0) | `xcell-rust` |
|---|---|
| `xCellAnalysis(expr, rnaseq, scale, alpha)` | `xcell_analysis(&expr, &model, &XCellParams { rnaseq, scale, alpha })` |
| `rawEnrichmentAnalysis(expr, signatures, genes)` | `raw_enrichment(&expr, &model)` |
| `transformScores(scores, fit.vals, scale)` | `transform_scores(&raw, spill, scale)` |
| `spillOver(transformed, K, alpha)` | `spill_over(&transformed, spill, alpha)` |
| `microenvironmentScores(adjusted)` | `microenvironment_scores(&adjusted)` |
| `pracma::lsqlincon(K, x, lb = 0)` | hand-rolled Lawson–Hanson NNLS |
| `xCell.data$spill` / `$spill.array` | `model.spill_for(rnaseq)` |
| expression matrix / `ExpressionSet` | `ExprMatrix` (`xcell::io::read_tsv_matrix`) |
| returned score matrix | `EnrichmentResult { gene_sets, samples, scores }` |

## Building from source

`xcell-rust` depends on `gsva-rust` as a sibling path dependency; check both out
side by side (`../gsva-rust`). Then:

```sh
cargo build
cargo test
```

Gate (matches CI) — both feature configurations:

```sh
cargo fmt --all -- --check
cargo clippy --all-targets --all-features -- -D warnings        # default: parallel
cargo clippy --no-default-features --all-targets -- -D warnings # serial
cargo test
cargo test --no-default-features
```

## How this port was produced

xcell-rust was written with AI assistance (Anthropic's Claude, via
[Claude Code](https://www.anthropic.com/claude-code)); the `Co-Authored-By`
trailers in the git history record this. It follows a port-and-validate
discipline: every stage is checked for numeric parity against the real R xCell
package on both synthetic fixtures and real public RNA-seq before it is
considered done.

## License

Licensed under **GPL-3.0-or-later**, matching the upstream xCell package
(`License: GPL-3`). The bundled model data is exported from xCell. See
[`LICENSE`](LICENSE). `xcell-rust` depends on `gsva-rust` (Artistic-2.0), which is
license-compatible.

## Credits & citation

xCell is the work of Dvir Aran, Zicheng Hu, and Atul J. Butte, distributed as the
`xCell` R package. This port reimplements their method against that package's
behavior; it does not originate it.

- Original software: [github.com/dviraran/xCell]https://github.com/dviraran/xCell

If you use xcell-rust in research, please cite the original xCell paper, and the
GSVA paper for the ssGSEA engine it builds on:

- **xCell** — Aran D, Hu Z, Butte AJ. *xCell: digitally portraying the tissue
  cellular heterogeneity landscape.* Genome Biology. 2017;18:220.
  doi:[10.1186/s13059-017-1349-1]https://doi.org/10.1186/s13059-017-1349-1
- **ssGSEA engine** — Hänzelmann S, Castelo R, Guinney J. *GSVA: gene set
  variation analysis for microarray and RNA-Seq data.* BMC Bioinformatics.
  2013;14:7. doi:[10.1186/1471-2105-14-7]https://doi.org/10.1186/1471-2105-14-7