inferCNAsc
Copy number alteration (CNA) inference from single-cell RNA-seq data. A Rust core with optional Python bindings via PyO3.
The pipeline is a chromosome-aware sliding-window smoother, per-gene z-score
thresholding, and a parallel run-length merge that assembles per-cell CNA
regions. The Rust core is parallelized over genes (smoothing, z-scoring) and
over cells (region assembly) with rayon; the Python layer handles AnnData
adaptation, Ensembl annotation lookup, evaluation, and plotting.
Installation
Wheels are built for Linux x86_64/aarch64, macOS universal2, and Windows x86_64 against Python's abi3-py310 stable ABI, so a single wheel serves Python 3.10+.
For the Rust API:
[]
= "0.2"
No feature flags are needed for the native Rust API.
Python
=
=
=
gene_df is a DataFrame with columns gene, chrom, start, end.
infercnasc.io.annotate_genes(gene_ids) fetches these from Ensembl with a
local requests-cache backing store.
Sparse AnnData.X is supported natively. Coordinate-annotation filtering
runs on the sparse matrix first, so the eventual dense materialization is
limited to genes that survive annotation; this avoids the standard scRNA
memory blow-up of an unconditional .toarray().
Rust
use ;
let smoothed = smooth_expression?;
let = find_cnas;
let cnas = assign_cnas_to_cells;
smooth_expression returns Result<Array2<f64>, InferError>. find_cnas and
assign_cnas_to_cells are infallible.
Pipeline
- Gene annotation. Gene identifiers are resolved to genomic coordinates via the Ensembl REST API, with responses cached locally under the platform cache directory.
- Smoothing. A sliding-window mean is applied along each chromosome. The window resets at chromosome boundaries. Columns are processed in parallel.
- CNA calling. Per-gene z-scores are computed across cells. Entries
above
+z_thresholdare flagged as gains, entries below-z_thresholdas losses. Zero-variance genes are skipped. - Region assembly. Consecutive flagged genes on the same chromosome are
merged into
CnaRecordregions by a parallel per-cell run-length scan. Runs shorter thanmin_region_sizeare dropped.
Benchmarks
End-to-end pipeline (smoothing + calling + region assembly) on real public tumor scRNA-seq data: the Tirosh 2016 oligodendroglioma dataset shipped with the inferCNV R package (184 cells x ~10,000 annotated genes). A planted-chr1-loss synthetic matrix is also run to show scaling at larger sizes. Single local run on a Ryzen laptop; your numbers will differ.
| implementation | Tirosh (184 x 10,338) | synth (2000 x 10,000) |
|---|---|---|
_core direct FFI call |
0.032 s | 0.62 s |
CNAInferrer.fit (wrapper) |
0.111 s | 0.84 s |
infercnvpy.tl.infercnv |
1.134 s | 1.95 s |
| pure-numpy reference | 0.547 s | (skip) |
_core direct is the straight FFI call; CNAInferrer.fit adds the
coordinate filter, DataFrame sort, and DataFrame assembly around it.
infercnvpy.tl.infercnv uses a different algorithmic core (log
fold-change on sparse windows) and is the nearest published
Python-ecosystem comparator. The pure-numpy reference is a faithful
per-chromosome cumulative-sum smoothing + z-score reimplementation used
as an apples-to-apples control for the algorithm itself.
Reproduce:
Evaluation
=
# {"true_positives": ..., "precision": ..., "recall": ..., "f1": ...}
Matching is any-overlap on genomic coordinates within the same chromosome and label. The implementation is O(n_inferred + n_truth) via a chromosome- and-label-indexed bucket sweep.
Acknowledgements
A pre-release Python prototype predating this crate was developed with Raeann Kalinowski and Amy Liu as a 2025 course project at Johns Hopkins; this repository is a full independent rewrite and is not affiliated with that coursework.
License
MIT