redicat 0.3.10

REDICAT - RNA Editing Cellular Assessment Toolkit: A highly parallelized utility for analyzing RNA editing events in single-cell RNA-seq data
# REDICAT

**RNA Editing Cellular Assessment Toolkit**

REDICAT is a high-performance Rust toolkit for RNA editing quantification in single-cell RNA-seq, with an emphasis on strand-aware mismatch logic, sparse-matrix efficiency, and reproducible end-to-end processing from indexed BAM/CRAM to analysis-ready AnnData (`.h5ad`).

![](./imgs/redicat.png)

## Citation
If REDICAT supports your research, please cite:
> Wei, T., Li, J., Lei, X., Lin, R., Wu, Q., Zhang, Z., Shuai, S., & Tian, R. (2025). Multimodal CRISPR screens uncover DDX39B as a global repressor of A-to-I RNA editing. *Cell Reports, 44*(7). https://doi.org/10.1016/j.celrep.2025.116009

![](./imgs/redicat-1.png)

## Installation
```bash
# Clone and build
git clone https://github.com/aStudyingTurtle/redicat.git
cd redicat
cargo build --release

# Optional: install into Cargo's bin directory
cargo install --path .
```


---

## 1) Scientific overview

RNA editing introduces biologically meaningful RNA–DNA mismatches (for example A-to-I observed as A>G, and C-to-U observed as C>T). In single-cell data, these events are sparse and confounded by sequencing noise, alignment artifacts, and SNPs. REDICAT addresses this by:

- extracting per-cell, per-site strand-aware base counts from BAM/CRAM,
- applying biologically informed mismatch filters,
- generating reference/alternate/other sparse matrices,
- computing cell-level editing metrics (including CEI), and
- preserving metadata in standard AnnData format for downstream statistical analysis.

**Innovation highlights**
- Parallel genomic scheduling with bounded-memory streaming.
- UMI-level consensus deduplication to suppress conflicting molecules.
- Strand-aware editing inference (`AG`, `CT`, and related mismatch classes).
- Sparse linear-algebra-first implementation for scale.

---

## 2) Code architecture (from source-level review)

### 2.1 Top-level modules

- `src/main.rs`: CLI entry point with subcommands: `bulk`, `bam2mtx`, `call`.
- `src/commands/*`: user-facing command orchestration.
- `src/lib/core/*`: shared infrastructure (errors, IO, sparse ops, filtering).
- `src/lib/engine/*`: parallel genomic scheduler (`par_granges`) and position models.
- `src/lib/pipeline/bam2mtx/*`: BAM/CRAM → sparse base-count AnnData conversion.
- `src/lib/pipeline/call/*`: editing annotation/filtering/ref-alt matrix/CEI pipeline.

### 2.2 Runtime interaction logic

1. **`bulk`** (`src/commands/bulk`)
   - Uses `ParGranges` scheduler to iterate genomic chunks in parallel.
   - `BaseProcessor` performs pileup counting and emits per-position records.
   - Output is gzipped TSV (or bgzip-compatible path handling).

2. **`bam2mtx`** (`src/commands/bam2mtx`)
   - Reads position manifest TSV (or runs `bulk` first pass with `--two-pass`).
   - Splits positions into depth-aware chunks (special handling for near-max-depth loci).
   - `OptimizedChunkProcessor` performs per-chunk pileup traversal with:
     - mapping/base quality filtering,
     - barcode whitelist matching,
     - UMI consensus conflict suppression,
     - optional stranded counting.
   - `AnnDataConverter` assembles sparse layers (`A1/T1/G1/C1` and optional `A0/T0/G0/C0`) and writes `.h5ad`.

3. **`call`** (`src/commands/call` + `src/lib/pipeline/call`)
   - Loads input AnnData and known editing-site whitelist.
   - Annotates candidate sites, computes coverage/base summaries, adds reference base from FASTA.
   - Applies mismatch classification thresholds.
   - Constructs strand-aware `ref/alt/others` matrices.
   - Computes cell-level `CEI = alt / (ref + alt)`.
   - Computes site-level mismatch summaries and writes output `.h5ad`.

### 2.3 Biological mismatch logic implemented

`EditingType` supports: `ag`, `ac`, `at`, `ca`, `cg`, `ct`.

For each site, REDICAT uses strand-aware rules:
- expected reference bases on either strand,
- expected alternate base conditional on strand-consistent reference,
- rejection of sites with excess non-reference/non-alt support.

This operationalizes your stated biology:
- A-to-I: forward A>G with reverse-complement signal,
- C-to-U: forward C>T with reverse-complement signal,
while retaining generalized mismatch classes for exploratory analyses.

---

## 3) Installation

## 3.1 Requirements

- Linux/macOS (validated workflow shown for Linux).
- Rust toolchain (recommended stable):
  - `rustup`, `cargo`, `rustc`.
- Typical native build dependencies for `rust-htslib`/compression stack:
  - `pkg-config`, `clang`, `zlib`, `bzip2`, `xz`, `curl`, OpenSSL dev headers.
- For CRAM input: reference FASTA should be available and indexed.

## 3.2 Build

```bash
git clone https://github.com/aStudyingTurtle/redicat.git
cd redicat
cargo build --release
```

Binary path:

```bash
./target/release/redicat
```

Quick check:

```bash
./target/release/redicat --help
./target/release/redicat bam2mtx --help
./target/release/redicat call --help
./target/release/redicat bulk --help
```

---

## 4) Input/output data contracts

### 4.1 BAM/CRAM prerequisites

- Coordinate-sorted and indexed alignment file.
- Cell barcode tag (default `CB`) and UMI tag (default `UB`) present if single-cell mode is expected.

### 4.2 `bulk` output schema (TSV.gz)

Serialized `PileupPosition` includes columns:

- `CHR`, `POS`, `DEPTH`, `A`, `C`, `G`, `T`, `N`, `INS`, `DEL`, `REF_SKIP`, `FAIL`, `NEAR_MAX_DEPTH`
- Optional `REF_BASE` when available.

`POS` is 1-based by default (0-based with `--zero-base`).

### 4.3 Position manifest required by `bam2mtx`

`bam2mtx` expects at least these columns in TSV:

- `CHR`, `POS`, `DEPTH`, `INS`, `DEL`, `REF_SKIP`, `FAIL`, `NEAR_MAX_DEPTH`

The easiest way to produce a valid manifest is `--two-pass`, which internally runs `bulk`.

### 4.4 Site whitelist required by `call`

- Tab-separated file, plain or gzipped.
- First two columns must be chromosome and position (`CHR`, `POS` convention).
- Key format used internally: `CHR:POS` matching `AnnData.var_names`.

### 4.5 `call` output

- `.h5ad` with layers (priority written):
  - `ref`, `alt`, `others`, `coverage` (when present)
- `obs` additions:
  - `ref`, `alt`, `others` (row sums over filtered editing sites)
  - `CEI`
- `var` additions:
  - `is_editing_site`, `ref`, `Mismatch`, `filter_pass`
  - site-level columns: `<XY>_ref`, `<XY>_alt`, `<XY>_others` for selected editing type `X>Y`

`bam2mtx` also writes a sidecar file for capped-depth loci:
- `<output_stem>_skipped_sites.txt`.

---

## 5) Command-line usage

## 5.1 `bulk` (first-pass site discovery / pileup profiling)

```bash
redicat bulk input.bam \
  --output first_pass.tsv.gz \
  --threads 16 \
  --chunksize 100000 \
  --mapquality 255 \
  --min-depth 5
```

Use `--all` to emit all covered positions (not only editing-enriched candidates).

## 5.2 `bam2mtx` (BAM to sparse single-cell base matrices)

### Two-pass (recommended)

```bash
redicat bam2mtx \
  --bam input.bam \
  --barcodes barcodes.tsv.gz \
  --output base_counts.h5ad \
  --two-pass \
  --threads 16 \
  --stranded
```

### Single-pass with precomputed manifest

```bash
redicat bam2mtx \
  --bam input.bam \
  --tsv first_pass.tsv.gz \
  --barcodes barcodes.tsv.gz \
  --output base_counts.h5ad \
  --threads 16
```

## 5.3 `call` (editing annotation and quantification)

```bash
redicat call \
  --input base_counts.h5ad \
  --output editing_calls.h5ad \
  --fa reference.fa \
  --site-white-list rediportal_sites.tsv.gz \
  --editingtype ag \
  --max-other-threshold 0.1 \
  --min-edited-threshold 0.001 \
  --min-ref-threshold 0.001 \
  --min-coverage 2 \
  --threads 16
```

Dry-run validation:

```bash
redicat call --input base_counts.h5ad --output out.h5ad --fa reference.fa --site-white-list sites.tsv.gz --dry-run
```

---

## 6) Parameter reference and tuning guidance

## 6.1 `bulk` parameters

| Parameter | Type | Default | Range / constraints | Function and biological meaning | Tuning guidance |
|---|---:|---:|---|---|---|
| `reads` | path | required | indexed BAM/CRAM | Input alignments for per-position pileup | Use coordinate-sorted, indexed files |
| `--output/-o` | path | required | writable path | Output TSV(.gz) of pileup features | Keep compressed for large cohorts |
| `--threads/-t` | usize | `10` | `>=1` | Worker parallelism | Use physical cores; avoid oversubscription |
| `--chunksize/-c` | u32 | engine constant (`500000`) | `>=1` | Genomic tile size per task | Larger = less scheduling overhead, higher memory burst |
| `--min-baseq/-Q` | u8? | `30` effective | `0..255` | Low-quality bases become `N` | Raise in noisy libraries |
| `--mapquality/-q` | u8 | `255` | `0..255` | Read-level alignment confidence filter | Relax for aligners not using 255 for unique mapping |
| `--zero-base/-z` | bool | `false` | n/a | Output coordinate convention | Keep default 1-based for interoperability |
| `--max-depth/-D` | u32 | `10000` | `>=1` | Pileup cap; near-cap flagged | Increase for ultra-deep loci; monitor runtime |
| `--min-depth/-d` | u32 | `5` | `>=1` | Minimum effective depth | Increase for conservative site discovery |
| `--max-n-fraction/-n` | u32 | `10` | `>=1` | Ambiguous base tolerance (`N <= max(depth/n,2)`) | Larger value = stricter ambiguity control |
| `--all/-a` | bool | `false` | n/a | Emit all sites vs editing-enriched subset | Use `true` for QC/debug; `false` for candidate enrichment |
| `--editing-threshold/-et` | u32 | `10000` | `>=1` | Multi-base support criterion in candidate mode | Larger to increase sensitivity |
| `--allcontigs/-A` | bool | `false` | n/a | Include non-canonical contigs | Enable for decoys/spike-ins/custom assemblies |

## 6.2 `bam2mtx` parameters

| Parameter | Type | Default | Range / constraints | Function and biological meaning | Tuning guidance |
|---|---:|---:|---|---|---|
| `--bam/-b` | path | required | indexed BAM/CRAM | Input alignment file | Required |
| `--tsv` | path | optional | required unless `--two-pass` | Position manifest to quantify | Prefer `--two-pass` for schema safety; in two-pass mode output is normalized to `.gz` |
| `--barcodes` | path | required | whitelist file | Restricts cells to known barcodes | Use filtered cell whitelist from upstream pipeline |
| `--two-pass` | bool | `false` | n/a | Auto-generate site manifest via internal `bulk` run | Recommended default |
| `--output/-o` | path | required | `.h5ad` destination | Output sparse AnnData | Required |
| `--threads/-t` | usize | `20` | `>=1` | Parallel processing threads | Balance with storage throughput |
| `--min-mapq/-q` | u8 | `255` | `0..255` | Minimum read mapping quality | Match aligner conventions |
| `--min-baseq/-Q` | u8 | `30` | `0..255` | Minimum base quality | Raise for stringent mismatch calling |
| `--stranded/-S` | bool | `false` | n/a | Preserve strand layers (`A0..C0`, `A1..C1`) | Strongly recommended for strand-specific libraries |
| `--max-depth/-D` | u32 | `65536` | `>=1` | Pileup depth cap in matrix extraction | Set near expected extreme depth |
| `--umi-tag` | string | `UB` | SAM tag name | UMI field for molecule deduplication | Adjust to chemistry (
`RX`, etc.) |
| `--cb-tag` | string | `CB` | SAM tag name | Cell barcode field | Adjust if pipeline uses alternate tags |
| `--reference/-r` | path | optional | required for CRAM | FASTA for CRAM decoding | Mandatory for CRAM workflows |
| `--chunksize/-c` | u32 | `100000` | `>=1` | Weight budget for normal loci chunks | Increase for throughput, decrease for memory control |
| `--chunk-size-max-depth` | u32 | `2` | `>=1` | Small chunk cap for near-max-depth loci | Keep low to avoid hotspot stalls |
| `--matrix-density` | f64 | `0.005` | `>0` practical | Sparse pre-allocation hint | Raise for dense targeted panels |
| `--allcontigs/-A` | bool | `false` | n/a | Include non-canonical contigs | Enable for non-standard references |

## 6.3 `call` parameters

| Parameter | Type | Default | Range / constraints | Function and biological meaning | Tuning guidance |
|---|---:|---:|---|---|---|
| `--input` | string path | required | `.h5ad` | Base-count AnnData input | Output of `bam2mtx` |
| `--output` | string path | required | `.h5ad` | Annotated editing output | Existing output is overwritten |
| `--fa` | string path | required | FASTA + `.fai` required | Reference base retrieval per site | Must match alignment reference build |
| `--site-white-list` | string path | required | TSV/TSV.GZ, CHR/POS leading columns | Known candidate editing sites | Use curated catalogs + project-specific blacklist logic |
| `--editingtype` | enum | `ag` | `ag/ac/at/ca/cg/ct` | Defines expected ref→alt and strand-aware complements | Use `ag` for A-to-I focused studies; `ct` for C-to-U hypotheses |
| `--max-other-threshold` | f64 | `0.1` | validated `0..1` | Max tolerated non-ref/non-alt fraction | Lower for specificity in noisy cohorts |
| `--min-edited-threshold` | f64 | `0.001` | validated `0..1` | Minimum edited-base fraction | Raise for stronger effect-size confidence |
| `--min-ref-threshold` | f64 | `0.001` | validated `0..1` | Minimum reference-base fraction | Raise to avoid low-reference unstable sites |
| `--chunksize/-c` | usize | `100000` | `>=1` | Pipeline chunking granularity | Decrease if memory constrained |
| `--threads/-t` | Option<usize> | internal fallback `4` | `>=1` recommended | Rayon pool size for pipeline | Use core count for full throughput |
| `--min-coverage` | u16 | `2` | `>=1` | Site-level minimum coverage | Increase for stringent single-cell confidence |
| `--verbose/-v` | bool | `false` | n/a | Verbose logging flag | Enable for debugging |
| `--dry-run` | bool | `false` | n/a | Validate inputs only | Always run before large cohorts |

### 6.4 Mismatch decision rule used in `call`

For each site with coverage $C$:

- $other\_max = \max(\lceil C \cdot max\_other\_threshold \rceil, 2)$
- $edited\_min = \max(\lceil C \cdot min\_edited\_threshold \rceil, 1)$
- $ref\_min = \max(\lceil C \cdot min\_ref\_threshold \rceil, 1)$

A site passes if:
1. reference base is valid for selected editing type,
2. reference count $\ge ref\_min$,
3. edited count $\ge edited\_min$,
4. sum of other bases $\le other\_max$,
5. site coverage $\ge min\_coverage$.

---

## 7) Reproducible workflow (recommended)

1. **Generate candidate loci**
   - `redicat bulk` with stringent quality filters.
2. **Build base-count matrix**
   - `redicat bam2mtx --two-pass` (or provide validated TSV).
3. **Call editing signal**
   - `redicat call` with matched reference build and curated whitelist.
4. **Post-analysis**
   - Use `obs['CEI']`, `var['filter_pass']`, and `<XY>_alt/<XY>_ref` for downstream statistics.

For publication-grade reproducibility, record:
- full command lines,
- software version (`redicat --version`),
- reference genome build and checksum,
- whitelist provenance and version,
- all threshold parameters.

---

## 8) Result interpretation

### 8.1 Cell-level metrics

- `ref`, `alt`, `others` in `obs`: aggregated counts across retained sites.
- `CEI`: cell editing index, interpreted as editing burden proxy:

$$
CEI = \frac{alt}{ref + alt}
$$

Practical guidance:
- Compare CEI across cell states/conditions with coverage-aware covariates.
- Exclude cells with very low total informative counts (`ref + alt`) before differential analyses.

### 8.2 Site-level metrics

In `var`, fields such as `AG_ref`, `AG_alt`, `AG_others` (for `editingtype=ag`) summarize cohort-level support at each site.

Recommended biological interpretation:
- prioritize sites with high `*_alt`, low `*_others`, and `filter_pass=true`;
- remove known SNP loci (dbSNP/gnomAD) in downstream curation;
- inspect strand consistency for expected editing class behavior.

---

## 9) Performance and complexity

Let:
- $R$ = number of aligned reads inspected,
- $P$ = number of queried positions,
- $U$ = number of unique (cell, UMI, site) molecules after filtering,
- $nnz$ = non-zeros in sparse matrices,
- $T$ = worker threads.

### 9.1 `bulk`
- Pileup traversal is approximately $O(R_{region})$ over processed chunks.
- Parallel scheduling reduces wall-time to roughly $\sim O(R/T)$ in balanced workloads.

### 9.2 `bam2mtx`
- Site processing: $O(R_{sites})$ for pileup reads intersecting selected loci.
- UMI consensus hash aggregation: expected $O(U)$.
- Sparse assembly: near $O(nnz)$ plus sorting/merge costs for triplet consolidation.

### 9.3 `call`
- Coverage/base summaries and sparse reductions: near $O(nnz)$.
- Site annotation/filtering/reference lookup: $O(P)$ (with caching effects for FASTA access).
- Overall dominated by sparse matrix operations and site count.

In practice, REDICAT is throughput-optimized for sparse single-cell matrices with multi-core scaling and bounded-memory spill strategies in matrix assembly.

---

## 10) Notes and current implementation caveats

- Canonical contig filtering defaults to `chr1..chr22, chrX, chrY, chrM`; use `--allcontigs` for non-canonical references.
- `call` writes priority layers (`ref`, `alt`, `others`, `coverage`) and annotations; preserve intermediate matrices if additional layers are required for custom methods.