proseg 3.0.0

Probabilistic cell segmentation for in situ spatial transcriptomics

# Proseg

Proseg (**pro**babilistic **seg**mentation) is a cell segmentation method for
spatial transcriptomics. Xenium, CosMx, MERSCOPE, and Visium HD platforms are
currently supported, but it can be easily adapted to others.

![](https://github.com/dcjones/proseg/blob/main/figure.png)

![Crates.io Version](https://img.shields.io/crates/v/proseg)
[![Conda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat-square&logo=anaconda)](http://bioconda.github.io/recipes/rust-proseg/README.html)
[![Docker](https://img.shields.io/badge/install%20with-docker-important.svg?style=flat-square&logo=docker)](https://quay.io/repository/biocontainers/rust-proseg)
<!--[![nf-core logo](https://raw.githubusercontent.com/nf-core/logos/master/nf-core-logos/nf-core-logo-lightbg.png)](https://nf-co.re/modules/proseg_proseg/)-->
<a href="https://nf-co.re/modules/proseg_proseg/">
  <img src="https://raw.githubusercontent.com/nf-core/logos/master/nf-core-logos/nf-core-logo-lightbg.png" alt="nf-core logo" height="24">
</a>

Read the paper:

🗎 [Jones, D.C., Elz, A.E., Hadadianpour, A. et al. **Cell simulation as cell segmentation.** Nat Methods (2025). https://doi.org/10.1038/s41592-025-02697-0](https://doi.org/10.1038/s41592-025-02697-0)

And the Research Brief:

🗎 [**Confronting the challenge of cell segmentation in spatial transcriptomics.** Nat Methods (2025). https://doi.org/10.1038/s41592-025-02717-z](https://doi.org/10.1038/s41592-025-02717-z)


# Table of Contents

  * [Installing]#installing
  * [Migrating to Proseg 3]#migrating-to-proseg-3
  * [Usage]#usage
    * [Spatialdata output format]#spatialdata-output-format
    * [General arguments]#general-arguments
    * [Output arguments]#output-arguments
    * [Model arguments]#model-arguments
    * [Xenium]#xenium
      * [Importing into Xenium Explorer]#importing-into-xenium-explorer
    * [CosMx]#cosmx
    * [MERSCOPE]#merscope
    * [VisiumHD]#visiumhd
    * [Initializing using Cellpose masks]#initializing-using-cellpose-masks
  * [Getting help]#getting-help


# Installing

Proseg can be built and installed with cargo by running.

```sh
cargo install proseg
```

The easiest way to install cargo for most is [rustup](https://rustup.rs/).

## From source

It can also be build manually from source, which is useful mainly if you want to try a specific revision or make changes

```sh
git clone https://github.com/dcjones/proseg.git
cd proseg
cargo build --release
```

Proseg can then be run with:
```sh
target/release/proseg
```


# Migrating to Proseg 3

Proseg 3 has a few changes that users of earlier versions should be aware of.

  1. By default proseg will output a
     [spatialdata]https://spatialdata.scverse.org/en/stable/ zarr directory
     that can be read by the spatiadata package in python. There are many other output
     options still, but these are disabled by default.
     * This is admittedly a little less convenience for R users. My recommendation would
       be to convert the AnnData part to h5ad in python like:
       ```python
       import spatialdata
       sdata = spatialdata.read_zarr("proseg-output.zarr")
       sdata.tables["table"].write_h5ad("proseg-anndata.h5ad")
       ```
       then read this h5ad file into R with [zellkonverter]https://github.com/theislab/zellkonverter.
    * The `proseg-to-baysor` command now operates on these zarr directories.
  2. By default, count matrices generated by Proseg 3 are integer point estimates,
     not continuous expected counts, which should make some downstream analysis simpler.
  3. Some simplifications were made to the model and sampling procedure. Now the
     sampling schedule is controlled with these four arguments:
     `--burnin-samples`, `--samples` giving the number of iterations, and
     `--burnin-voxel-size` and `--voxel-size` giving the x/y size of the voxels
     in microns. The burn in voxel size must be an integer multiple of the final
     voxel size.
  4. The `--nbglayers` arguments has been removed. There is now just one
     `--voxel-layers` argument controlling how many voxels are stacked on the z-axis.
  5. The voxel morphology prior has been changed. Instead of `--perimeter-bound`
     and `--perimeter-eta`, there is one `--cell-compactness` argument, where
     smaller numbers lead to more compact (equivalently, more circular) cells.


# Usage

Proseg is run on a table of transcript positions which in some form must include
preliminary assignments of transcripts to nuclei or cells. Xenium, CosMx, and MERSCOPE
all provide this out of the box in some form.

Proseg is invoked from the command line like:

```sh
proseg [arguments...] /path/to/transcripts.csv.gz
```

The method is general purpose. There are command line arguments to tell it which
columns in the csv file to use, and how they should be interpreted, but
typically one of the presets `--xenium`, `--cosmx`, `--merfish`, or `--visiumhd`
are used for common platforms.

Proseg is a sampling method, and in its current form in non-deterministic. From
run to run, results will vary slightly, but not in a way that would seriously affect
the interpretation of the data.

To see a list of command line arguments, run
```sh
proseg --help
```
Most of these can be disregarded. The most relevent ones will be described below:

## Spatialdata output format

The spatialdata zarr output that proseg generates can be read with
```python
import spatialdata
sdata = spatialdata.read_zarr("proseg-output.zarr")
```

This object contains:
  * Transcript positions and metadatai in `sdata.points["transcripts"]`. (This
    can use significant space, so can be excluded if not needed with
    `--exclude-spatialdata-transcripts`).
  * Cell polygons in `sdata.shapes["cell_boundaries"]`
  * Cell level information in AnnData format in `sdata.tables["table"]`, which contains:
    * Cell metadata in `obs`
    * Gene metadata in `var`
    * Sparse cell-by-gene count matrix in `X`
    * Cell centroids in `obsm["spatial"]`
    * Some information about the proseg run in `uns["proseg_run]`.

## General arguments

  * `--nthreads N` sets the number of threads to parallelize across. By default
    proseg will use all available CPU cores, which may be a bad idea on a shared
    machine.
  * `--output-spatialdata output.zarr`: Proseg will output a [spatialdata]https://spatialdata.scverse.org/en/stable/ zarr directory
    which can be read by the spatialdata python package and contains all metadata, count matrix, and cell geometry.
  * `--overwrite`: An existing zarr directory will not be overwritten unless this argument is passed.
  * `--voxel-layers N`: Number of layers on the z-axis to model 3D cells.
  * `--samples N`: Run the sampler for this N iterations.
  * `--burnin-samples N`: Run the sampler for a preliminary N samples at a lower resolution.
  * `--voxel-size S`: Voxel size in microns on the x/y axis.
  * `--burnin-voxel-size S`: Larger voxel size to use for the burn-in phase. (This must be an integer multiple of the final voxel size).

## Output arguments

In addition to the spatialdata zarr output, results can be written to separate
number of tables, which can be either gzipped csv files or parquet files, and
[GeoJSON](https://geojson.org/) files giving cell boundaries.

  * `--output-counts counts.mtx.gz`: Output a cell-by-gene count matrix in gziped [matrix market format]https://math.nist.gov/MatrixMarket/formats.html. (Which can be read with e.g. `mmread` in scipy).
  * `--output-expected-counts expected-counts.mtx.gz`: Output an **expected**
    count matrix, where the counts are non-integer estimates from taking the
    mean over multiple samples.
  * `--output-cell-metadata cell-metadata.csv.gz`: Cell centroids, volume, and other information.
  * `--output-transcript-metadata transcript-metadata.csv.gz`: Transcript ids, genes, revised positions, assignment probability, etc.
  * `--output-gene-metadata gene-metadata.csv.gz`: Per-gene summary statistics
  * `--output-rates rates.csv.gz`: Cell-by-gene Poisson rate parameters. These are essentially expected relative expression values, but may be too overly-smoothed for use in downstream analysis.

Cell boundaries can be output a number of ways:

  * `--output-cell-polygons cell-polygons.geojson.gz`: 2D consensus polygons for each cell in GeoJSON format. These are flattened from 3D, which each xy position assigned to the dominant cell.
  * `--output-cell-polygon-layers cell-polygons-layers.geojson.gz`: Output a separate, non-overlapping cell polygon for each z-layer, preserving 3D segmentation.

## Model arguments

A number of options can alter assumptions made by the model, which generally should
not need

  * `--ncomponents 10`: Cell gene expression is a modeled as a mixture of negative binomial distributions. This parameter controls the number of mixture components. More components will tend to nudge the cells into more distinct types, but setting it too high risks manifesting cell types that are not real.
  * `--no-diffusion`: By default Proseg models cells as leaky, under the assumption that some amount of RNA leaks from cells and diffuses elsewhere. This seems to be the case in much of the Xenium data we've seen, but could be a harmfully incorrect assumption in some data. This argument disables that part of the model.
  * `--diffusion-probability 0.2`: Prior probability of a transcript is diffused and should be repositioned.
  * `--diffusion-sigma-far 4`: Prior standard deviation on repositioning distance of diffused transcripts.
  * `--diffusion-sigma-near 1`: Prior standard deviation on repositioning distance of non-diffused transcripts..
  * `--nuclear-reassignment_prob 0.2`: Prior probability that the initial nuclear assignment (if any) is incorrect.
  * `--cell-compactness 0.03`: Larger numbers allow less spherical cells.

# General advice

  * If the dataset consists of multiple unrelated tissue sections, it's
    generally safer to split these and segment them separately. This avoids
    potential sub-optimal modeling due to different z-coordinate distributions.
  * If proseg crashes without any kind of error message, or if it suddenly
    starts running extremely slowly, that's usually a sign that it ran out of memory. The best option
    would be to run on a system with more memory, but increasing the voxel size,
    lowering the number of voxel layers, and disabling diffusion will all reduce the memory usage,
    typically at the cost of lower transcript assignment accuracy.
  * Proseg is aided by doing local repositioning transcripts (I call this
    "transcript repo" or "transcript diffusion"). If your skeptical that
    transcripts leak or diffuse, it's still useful to allow some amount of
    repositioning to overcome limitations with voxel resolution. Setting
    `--diffusion-probability 0.0` will still let transcripts make small scale adjustments,
    unlike `--no-diffusion` which will completely disable that part of the model.


## Xenium

Proseg will work on Xenium transcript tables in either `csv.gz` or `parquet` format. The latter will be slightly more efficient to read.

```sh
proseg --xenium transcripts.csv.gz
# or
proseg --xenium transcripts.parquet
```

### Importing into Xenium Explorer

After segmenting with Proseg, the data can be converted back to format readable by Xenium Explorer.

  1. First convert the proseg output to a format matching Baysor's using the included `proseg-to-baysor` command:
    ```sh
    proseg-to-baysor proseg-output.zarr \
        --output-transcript-metadata proseg-to-baysor-transcript-metadata.csv
        --output-cell-polygons proseg-to-baysor-cell-polygons.geojson
    ```
  2. This can then be converted to a xenium bundle with [Xenium Ranger]https://www.10xgenomics.com/support/software/xenium-ranger/latest.
    ```sh
    xeniumranger import-segmentation \
        --id sample_id \
        --xenium-bundle /path/to/xenium/bundle \
        --transcript-assignment proseg-to-baysor-transcript-metadata.csv \
        --viz-polygons proseg-to-baysor-cell-polygons.geojson \
        --units=microns
    ```
    where `/path/to/xenium/bundle` is the original xenium bundle.
  3. Xenium Explorer should be able to read the resulting xenium bundle which
     will be written to a directory named whatever is passed to `--id`.

Known issues:
  * Problems importing can arise if transcripts with qv scores below 20 are not
    filtered out. This is done by default in Proseg, but lowering this cutoff could cause issues.
  * Earlier versions of Xenium Ranger/Explorer tend to mangle polygons generated by Proseg. Since Proseg is voxel based,
  any cell boundary that isn't axis aligned (i.e. composed of vertical and horizontal line segments), is due to Xenium
  software mis-rendering them.

## CosMx

Proseg works on CosMx transcript table with
```sh
proseg --cosmx sample_tx_file.csv
```

### Legacy CosMx data

Earlier versions of CosMx did not automatically provide a single table of global
transcript positions. To work around this, we provide a Julia program in
`extra/stitch-cosmx.jl` to construct a table from the flat files downloaded from
AtoMx.

To run this, some dependencies are required, which can be installed with
```sh
julia -e 'import Pkg; Pkg.add(["Glob", "CSV", "DataFrames", "CodecZlib", "ArgParse"])'
```

Then the program can be run with like
```sh
julia stitch-cosmx.jl /path/to/cosmx-flatfiles transcripts.csv.gz
```
to output a complete transcripts table to `transcripts.csv.gz`.

From here proseg can be run with (note the `--cosmx-micron` not `--cosmx`, since this format differs slightly)
```sh
proseg --cosmx-micron transcripts.csv.gz
```

## MERSCOPE

Proseg should work on the provide transcripts table with `--merscope`:
```sh
proseg --merscope detected_transcripts.cs.gz
```

## VisiumHD

Running VisiumHD on Proseg is slightly more complicated that in situ methods. First, it needs
to read a sparse count matrix across bins/squares. Instead of passing a transcripts table to proseg,
it should be pointed to the `binned_outputs` directory

Proseg still requires some form of registered image segmentation. The easiest way is
to use the [tools provided by Space Ranger to handle segmentation in image
Visium HD](
https://www.10xgenomics.com/support/software/space-ranger/latest/analysis/outputs/segmented-outputs),
which provides a `barcode_mappings.parquet` file that Proseg can read.

```sh
proseg --visiumhd
    --spaceranger-barcode-mappings /path/to/barcode_mappings.parquet \
    /path/to/binned_outputs \
```

Alternatively, it's possible, though trickier to use a Cellpose mask, assuming it's correctly registered.
```sh
proseg --visiumhd
    --cellpose-masks /path/to/cellpose_masks.npy \
    /path/to/binned_outputs \
```
There isn't a standardized cellpose output format. See the `extras/cellpose-xenium.py` file for code
to run cellpose and output to format proseg can read.

## Initializing using Cellpose masks

Typically Proseg is run with prior segmentation in the form of a transcript table with prior cell assignments. It's also possible to initialize directly using Cellpose (or similar) masks,
and there is potentially some benefit to doing this, because it can provide a more detailed pixel-level prior to proseg.

This does require more effort and care, since it's critical that coordinates are properly registered and scaled from mask pixels to transcripts micron positions. There are the relevant arguments.

  * `--cellpose-masks masks.npy.gz`: An (optionally gzipped) npy file with a 2d masks array with uint32 type, where 0 represents unassigned state and `[1, ncells]` gives the pixels cell assignment.
  * `--cellpose-cellprobs cellprobs.npy.gz`: An (optionally gzipped) npy file giving segmentation uncertainty. This should be a float32 array matching the dimensionality of the masks array.
  * `--cellpose-scale 0.324`: Microns per pixel of the cellpose mask.
  * `--cellpose-x-transform a b c`: Affine transformation from pixels to microns for the x-coordinate (where `x_micron = a*x_pixel + b*y_pixel + c`)
  * `--cellpose-y-transform a b c`: Affine transformation from pixels to microns from the y-coordinate (where `y_micron = a*x_pixel + b*y_pixel + c`)

# Getting help

Don't hesitate to open an issue or to email me (email is in my bio). This data
is complex and varied. Weird cases do arise which may not be correctly
considered by Proseg, so I appreciate being made aware of them.