mcpcounter-rust 0.1.0

Pure-Rust port of the MCPcounter cell-population quantification methods: MCP-counter (human) and mMCP-counter (mouse), validated for numeric parity against the original R implementations.
Documentation
# mcpcounter-rust

[![Crates.io](https://img.shields.io/crates/v/mcpcounter-rust.svg)](https://crates.io/crates/mcpcounter-rust)
[![Docs.rs](https://img.shields.io/docsrs/mcpcounter-rust)](https://docs.rs/mcpcounter-rust)
[![CI](https://github.com/sinmojito/MCPcounter-rust/actions/workflows/ci.yml/badge.svg)](https://github.com/sinmojito/MCPcounter-rust/actions/workflows/ci.yml)
[![License: GPL-3.0-or-later](https://img.shields.io/badge/license-GPL--3.0--or--later-blue.svg)](https://github.com/sinmojito/MCPcounter-rust/blob/main/LICENSE)

**mcpcounter-rust** is a pure-Rust, **zero-dependency** port of the **MCPcounter**
family of cell-population quantification methods — **MCP-counter** for human
tissue (Becht et al. 2016) and **mMCP-counter** for mouse (Petitprez et al.
2020). It reproduces the original R packages **bit-for-bit on synthetic inputs**
and to **the last bit of an IEEE-754 double** (worst relative error **2.2e-16**,
~1 ULP) on real public RNA-seq, with **no BLAS, LAPACK, Python, or R at
runtime** — and no third-party crates at all.

> **Disclaimer.** mcpcounter-rust is an independent, unofficial Rust port and is
> **not** affiliated with, endorsed by, or supported by the authors of
> MCP-counter or mMCP-counter. For the canonical, authoritative implementations,
> use the upstream R packages [ebecht/MCPcounter]https://github.com/ebecht/MCPcounter
> and [cit-bioinfo/mMCP-counter]https://github.com/cit-bioinfo/mMCP-counter.

## Highlights

- **Numeric parity with R, validated at two levels.** Seeded synthetic fixtures
  match the R packages to a **relative error of `0` (bit-identical)**; real
  public RNA-seq matches to **≤ 1 ULP** (2.18e-16 human, 1.69e-16 mouse).
  Nothing is accepted unless it reproduces R 4.6.0.
- **Both family methods, one engine.** Human **MCP-counter** (10 immune &
  stromal populations, arithmetic-mean scores) and mouse **mMCP-counter** (16
  populations, median scores) share a single R-faithful mean/median core.
- **Faithful to R's quirks, on purpose.** Reproduces R's order-sensitive
  two-pass mean, even-`n` median (`mean(c(a, b))`, *not* `(a + b) / 2`),
  MCP-counter's `intersect` de-duplication, and mMCP-counter's marker
  *multiplicity* (a marker listed *k* times counts *k* times) — the details that
  make last-bit parity hold. See [Fidelity notes]#fidelity-notes.
- **Zero dependencies.** No native libraries, no build script, no transitive
  crates — a single, audit-tiny library. Marker signatures are embedded at
  compile time.
- **Library-first.** `use mcpcounter_rust::...`, plus a dependency-free TSV
  reader for matrix IO.

## Installation

```bash
cargo add mcpcounter-rust
```

The crate is `mcpcounter-rust`; the library imports as `mcpcounter_rust` (Cargo
maps the dash to an underscore). It builds on Rust **1.74+** and pulls in **no
dependencies**.

## Quick start

`mcp_counter` on a tiny human matrix — the canonical call, exactly as
`MCPcounter.estimate` in R. Markers are matched by name against the bundled
signature; genes that aren't markers are ignored, and populations with no
present markers are dropped:

```rust
use mcpcounter_rust::{mcp_counter, ExprMatrix, FeaturesType};

fn main() {
    // 6 genes (rows) x 3 samples (cols), log2-scale expression, stored row-major.
    // Five are MCP-counter markers; GAPDH is a non-marker housekeeping gene and
    // is ignored, exactly as in R.
    let genes = vec![
        "CD3D".to_string(),  // T cells
        "CD28".to_string(),  // T cells
        "CD8B".to_string(),  // CD8 T cells
        "CD19".to_string(),  // B lineage
        "CD79A".to_string(), // B lineage
        "GAPDH".to_string(), // not a marker -> ignored
    ];
    let samples = vec![
        "tumor_A".to_string(),
        "tumor_B".to_string(),
        "normal".to_string(),
    ];
    #[rustfmt::skip]
    let data = vec![
        //  tumor_A  tumor_B  normal
            8.1,     7.4,     2.0, // CD3D
            7.6,     7.1,     1.8, // CD28
            6.9,     6.2,     1.5, // CD8B
            4.2,     8.3,     2.1, // CD19
            4.0,     8.0,     1.9, // CD79A
           11.0,    11.1,    11.0, // GAPDH (flat housekeeping)
    ];

    let expr = ExprMatrix::new(genes, samples, data);
    let result = mcp_counter(&expr, FeaturesType::HugoSymbols);

    // Each score is the mean of a population's present markers; populations
    // appear in MCP-counter's signature order.
    for (p, pop) in result.populations.iter().enumerate() {
        print!("{pop:<14}");
        for s in 0..result.samples.len() {
            print!("{:>10.3}", result.score(p, s));
        }
        println!();
    }
}
```

```text
T cells            7.850     7.250     1.900
CD8 T cells        6.900     6.200     1.500
B lineage          4.100     8.150     2.000
```

The mouse estimator has the same shape —
`mmcp_counter(&expr, MurineFeaturesType::GeneSymbol, GenomeVersion::GCRm39)` —
returning median scores over 16 murine populations. The runnable version of the
above is [`examples/quickstart.rs`](https://github.com/sinmojito/MCPcounter-rust/blob/main/examples/quickstart.rs);
the real-data parity harness is
[`examples/parity_realdata.rs`](https://github.com/sinmojito/MCPcounter-rust/blob/main/examples/parity_realdata.rs).

## Methods, inputs & licensing

Each method re-implements a published algorithm and inherits its upstream
license; the crate is therefore distributed under **GPL-3.0-or-later**.

| Method | Function | Populations | Score | Feature namespaces | Upstream / license |
|---|---|---:|---|---|---|
| MCP-counter (human) | `mcp_counter` | 10 | mean | `HugoSymbols`, `Affy133P2Probesets`, `EntrezId`, `EnsemblId` | [ebecht/MCPcounter]https://github.com/ebecht/MCPcounter — code GPL-3; signatures CC0 / public domain |
| mMCP-counter (mouse) | `mmcp_counter` | 16 | median | `GeneSymbol`, `EnsemblId`, `Probes``GCRm38` / `GCRm39`) | [cit-bioinfo/mMCP-counter]https://github.com/cit-bioinfo/mMCP-counter — code + signatures GPL-3 |

Input is an `ExprMatrix` (genes × samples; build with `ExprMatrix::new`, or parse
a TSV via `mcpcounter_rust::io::read_tsv_matrix`). No internal transform or
normalization is applied, matching R; `NA` maps to `f64::NAN` and is dropped per
population (`na.rm = TRUE`).

> **Scope.** This crate is the **MCPcounter family only**. Other deconvolution
> methods (ssGSEA-based, solver-based, …) are intentionally *not* bundled here —
> they belong in their own focused crates.

## Validation

Every method is checked against the unmodified upstream R package (R 4.6.0),
never against hand- or AI-computed values. Parity is verified at two levels, and
**both are required** — the real-data level catches what synthetic data hides:

| Method | Dataset | Organism | Shape (genes × samples) | Worst relative error |
|---|---|---|---:|---:|
| MCP-counter | seeded synthetic || committed fixtures | **0** (bit-identical) |
| mMCP-counter | seeded synthetic || committed fixtures | **0** (bit-identical) |
| MCP-counter | `dataset_racle` (Racle 2017, eLife) | human | 32,467 × 4 | **2.18e-16** |
| mMCP-counter | `dataset_petitprez` (Petitprez 2020, Genome Med) | mouse | 53,819 × 6 | **1.69e-16** |

Synthetic inputs report a perfect `0`; only real expression — with its hundreds
of present markers per population — surfaces the sub-ULP rounding, because R
accumulates means and medians in 80-bit `long double` that an `f64` cannot match
to the final bit. A flawless synthetic score would have *hidden* this, so
real-data validation is treated as a required gate, not a bonus.

Reproduce it from public data only (the large input matrices are never
committed):

```bash
Rscript benchmarks/parity_realdata.R   # dump R reference outputs to scratch/
cargo run --example parity_realdata    # revalidate; writes the metrics JSON
```

Synthetic fixtures run as plain unit tests (`cargo test`); aggregate real-data
metrics live in
[`benchmarks/parity_realdata_results.json`](https://github.com/sinmojito/MCPcounter-rust/blob/main/benchmarks/parity_realdata_results.json).

## Fidelity notes

The behaviours below are reproduced deliberately — they are why the numbers land
on the last bit rather than merely "close":

- **Order-sensitive two-pass mean.** R's `mean` accumulates in 80-bit long
  double over two passes; marker rows are summed in ascending matrix-row order
  (R's `intersect(rownames, markers)` order) so the rounding matches.
- **Even-`n` median = `mean(c(a, b))`.** R averages the two central order
  statistics with that same two-pass mean, not the naive `(a + b) / 2` — they
  can differ in the last ULP.
- **MCP-counter de-duplicates; mMCP-counter does not.** MCP-counter selects rows
  via `intersect` (one row per marker, first occurrence); mMCP-counter selects
  via `exp[sig[, features], ]`, so a marker listed *k* times contributes *k*
  copies to the median.
- **Fixed 16-population mMCP output order**, with all-`NA`/absent populations
  dropped but all-zero populations kept.
- **Empty-result divergence (documented).** When no markers match, R `stop()`s;
  this port returns an empty `DeconvResult` instead.

## R → mcpcounter-rust name map

| R | mcpcounter-rust |
|---|---|
| `MCPcounter.estimate(expr, featuresType = …)` | `mcp_counter(&expr, FeaturesType::…)` |
| `mMCPcounter.estimate(exp, features = …, genomeVersion = …)` | `mmcp_counter(&expr, MurineFeaturesType::…, GenomeVersion::…)` |
| `"HUGO_symbols"` / `"affy133P2_probesets"` | `FeaturesType::HugoSymbols` / `Affy133P2Probesets` |
| result matrix `[population, sample]` | `DeconvResult::score(pop, sample)` |

## Performance

The crate is dependency-free and resolves each population's markers by direct
row lookup, so scoring is **O(present markers)** per population rather than a
per-population rescan of the whole gene panel — microsecond-scale per call on a
32k-gene human matrix (compute only). Note the shipped benchmark validates
**numeric parity, not wall-clock speed**: there is no head-to-head timing
harness against R, so no speedup multiple is claimed here.

## Building from source

```bash
cargo build --release
cargo test                      # unit + synthetic-parity tests
cargo run --example quickstart
```

The crate has a single build configuration (no feature flags, no dependencies),
so `cargo build` and `cargo build --all-features` are identical. The local CI
gate is `cargo fmt --all -- --check`, then
`cargo clippy --all-targets --all-features -- -D warnings`, then `cargo test`.

## How this port was produced

mcpcounter-rust was written with AI assistance (Anthropic's Claude, via Claude
Code); the `Co-Authored-By` trailers in the git history record this. Its outputs
are checked numerically against the upstream R packages, as described under
[Validation](#validation) — never against AI-generated expected values. Upstream
sources and signature tables are fetched/exported locally for study; nothing is
vendored beyond the embedded marker signatures.

## License

MCP-counter and mMCP-counter are licensed GPL-3. As a derivative port,
mcpcounter-rust is distributed under **GPL-3.0-or-later**; the full text is in
[`LICENSE`](https://github.com/sinmojito/MCPcounter-rust/blob/main/LICENSE). The
bundled MCP-counter marker signatures are upstream CC0 / public-domain data; the
mMCP-counter signatures are GPL-3 (exported from the package's data tables).

## Credits & citation

MCP-counter and mMCP-counter are the work of Etienne Becht, Florent Petitprez,
Aurélien de Reyniès, and colleagues. This port reimplements their algorithms; it
does not originate them. If you use mcpcounter-rust in research, please cite the
original papers:

> Becht, E., Giraldo, N.A., Lacroix, L., et al. (2016). Estimating the
> population abundance of tissue-infiltrating immune and stromal cell
> populations using gene expression. *Genome Biology* 17, 218.
> doi:[10.1186/s13059-016-1070-5]https://doi.org/10.1186/s13059-016-1070-5

> Petitprez, F., Levy, S., Sun, C.-M., et al. (2020). The murine Microenvironment
> Cell Population counter method to estimate abundance of tissue-infiltrating
> immune and stromal cell populations in murine samples using gene expression.
> *Genome Medicine* 12, 86.
> doi:[10.1186/s13073-020-00783-w]https://doi.org/10.1186/s13073-020-00783-w