redicat 0.3.7

REDICAT - RNA Editing Cellular Assessment Toolkit: A highly parallelized utility for analyzing RNA editing events in single-cell RNA-seq data
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
# REDICAT

**RNA Editing Cellular Assessment Toolkit**

REDICAT is a high-performance Rust toolkit for RNA editing quantification in single-cell RNA-seq, with an emphasis on strand-aware mismatch logic, sparse-matrix efficiency, and reproducible end-to-end processing from indexed BAM/CRAM to analysis-ready AnnData (`.h5ad`).

![](./imgs/redicat.png)

## Citation
If REDICAT supports your research, please cite:
> Wei, T., Li, J., Lei, X., Lin, R., Wu, Q., Zhang, Z., Shuai, S., & Tian, R. (2025). Multimodal CRISPR screens uncover DDX39B as a global repressor of A-to-I RNA editing. *Cell Reports, 44*(7). https://doi.org/10.1016/j.celrep.2025.116009

![](./imgs/redicat-1.png)

## Installation
```bash
# Clone and build
git clone https://github.com/aStudyingTurtle/redicat.git
cd redicat
cargo build --release

# Optional: install into Cargo's bin directory
cargo install --path .
```


---

## 1) Scientific overview

RNA editing introduces biologically meaningful RNA–DNA mismatches (for example A-to-I observed as A>G, and C-to-U observed as C>T). In single-cell data, these events are sparse and confounded by sequencing noise, alignment artifacts, and SNPs. REDICAT addresses this by:

- extracting per-cell, per-site strand-aware base counts from BAM/CRAM,
- applying biologically informed mismatch filters,
- generating reference/alternate/other sparse matrices,
- computing cell-level editing metrics (including CEI), and
- preserving metadata in standard AnnData format for downstream statistical analysis.

**Innovation highlights**
- Parallel genomic scheduling with bounded-memory streaming.
- UMI-level consensus deduplication to suppress conflicting molecules.
- Strand-aware editing inference (`AG`, `CT`, and related mismatch classes).
- Sparse linear-algebra-first implementation for scale.

---

## 2) Code architecture (from source-level review)

### 2.1 Top-level modules

- `src/main.rs`: CLI entry point with subcommands: `bulk`, `bam2mtx`, `call`.
- `src/commands/*`: user-facing command orchestration.
- `src/lib/core/*`: shared infrastructure (errors, IO, sparse ops, filtering).
- `src/lib/engine/*`: parallel genomic scheduler (`par_granges`) and position models.
- `src/lib/pipeline/bam2mtx/*`: BAM/CRAM → sparse base-count AnnData conversion.
- `src/lib/pipeline/call/*`: editing annotation/filtering/ref-alt matrix/CEI pipeline.

### 2.2 Runtime interaction logic

1. **`bulk`** (`src/commands/bulk`)
   - Uses `ParGranges` scheduler to iterate genomic chunks in parallel.
   - `BaseProcessor` performs pileup counting and emits per-position records.
   - Output is gzipped TSV (or bgzip-compatible path handling).

2. **`bam2mtx`** (`src/commands/bam2mtx`)
   - Reads position manifest TSV (or runs `bulk` first pass with `--two-pass`).
   - Splits positions into depth-aware chunks (special handling for near-max-depth loci).
   - `OptimizedChunkProcessor` performs per-chunk pileup traversal with:
     - mapping/base quality filtering,
     - barcode whitelist matching,
     - UMI consensus conflict suppression,
     - optional stranded counting.
   - `AnnDataConverter` assembles sparse layers (`A1/T1/G1/C1` and optional `A0/T0/G0/C0`) and writes `.h5ad`.

3. **`call`** (`src/commands/call` + `src/lib/pipeline/call`)
   - Loads input AnnData and known editing-site whitelist.
   - Annotates candidate sites, computes coverage/base summaries, adds reference base from FASTA.
   - Applies mismatch classification thresholds.
   - Constructs strand-aware `ref/alt/others` matrices.
   - Computes cell-level `CEI = alt / (ref + alt)`.
   - Computes site-level mismatch summaries and writes output `.h5ad`.

### 2.3 Biological mismatch logic implemented

`EditingType` supports: `ag`, `ac`, `at`, `ca`, `cg`, `ct`.

For each site, REDICAT uses strand-aware rules:
- expected reference bases on either strand,
- expected alternate base conditional on strand-consistent reference,
- rejection of sites with excess non-reference/non-alt support.

This operationalizes your stated biology:
- A-to-I: forward A>G with reverse-complement signal,
- C-to-U: forward C>T with reverse-complement signal,
while retaining generalized mismatch classes for exploratory analyses.

---

## 3) Installation

## 3.1 Requirements

- Linux/macOS (validated workflow shown for Linux).
- Rust toolchain (recommended stable):
  - `rustup`, `cargo`, `rustc`.
- Typical native build dependencies for `rust-htslib`/compression stack:
  - `pkg-config`, `clang`, `zlib`, `bzip2`, `xz`, `curl`, OpenSSL dev headers.
- For CRAM input: reference FASTA should be available and indexed.

## 3.2 Build

```bash
git clone https://github.com/aStudyingTurtle/redicat.git
cd redicat
cargo build --release
```

Binary path:

```bash
./target/release/redicat
```

Quick check:

```bash
./target/release/redicat --help
./target/release/redicat bam2mtx --help
./target/release/redicat call --help
./target/release/redicat bulk --help
```

---

## 4) Input/output data contracts

### 4.1 BAM/CRAM prerequisites

- Coordinate-sorted and indexed alignment file.
- Cell barcode tag (default `CB`) and UMI tag (default `UB`) present if single-cell mode is expected.

### 4.2 `bulk` output schema (TSV.gz)

Serialized `PileupPosition` includes columns:

- `CHR`, `POS`, `DEPTH`, `A`, `C`, `G`, `T`, `N`, `INS`, `DEL`, `REF_SKIP`, `FAIL`, `NEAR_MAX_DEPTH`
- Optional `REF_BASE` when available.

`POS` is 1-based by default (0-based with `--zero-base`).

### 4.3 Position manifest required by `bam2mtx`

`bam2mtx` expects at least these columns in TSV:

- `CHR`, `POS`, `DEPTH`, `INS`, `DEL`, `REF_SKIP`, `FAIL`, `NEAR_MAX_DEPTH`

The easiest way to produce a valid manifest is `--two-pass`, which internally runs `bulk`.

### 4.4 Site whitelist required by `call`

- Tab-separated file, plain or gzipped.
- First two columns must be chromosome and position (`CHR`, `POS` convention).
- Key format used internally: `CHR:POS` matching `AnnData.var_names`.

### 4.5 `call` output

- `.h5ad` with layers (priority written):
  - `ref`, `alt`, `others`, `coverage` (when present)
- `obs` additions:
  - `ref`, `alt`, `others` (row sums over filtered editing sites)
  - `CEI`
- `var` additions:
  - `is_editing_site`, `ref`, `Mismatch`, `filter_pass`
  - site-level columns: `<XY>_ref`, `<XY>_alt`, `<XY>_others` for selected editing type `X>Y`

`bam2mtx` also writes a sidecar file for capped-depth loci:
- `<output_stem>_skiped_sites.txt` (note current filename spelling in code).

---

## 5) Command-line usage

## 5.1 `bulk` (first-pass site discovery / pileup profiling)

```bash
redicat bulk input.bam \
  --output first_pass.tsv.gz \
  --threads 16 \
  --chunksize 100000 \
  --mapquality 255 \
  --min-depth 5
```

Use `--all` to emit all covered positions (not only editing-enriched candidates).

## 5.2 `bam2mtx` (BAM to sparse single-cell base matrices)

### Two-pass (recommended)

```bash
redicat bam2mtx \
  --bam input.bam \
  --barcodes barcodes.tsv.gz \
  --output base_counts.h5ad \
  --two-pass \
  --threads 16 \
  --stranded
```

### Single-pass with precomputed manifest

```bash
redicat bam2mtx \
  --bam input.bam \
  --tsv first_pass.tsv.gz \
  --barcodes barcodes.tsv.gz \
  --output base_counts.h5ad \
  --threads 16
```

## 5.3 `call` (editing annotation and quantification)

```bash
redicat call \
  --input base_counts.h5ad \
  --output editing_calls.h5ad \
  --fa reference.fa \
  --site-white-list rediportal_sites.tsv.gz \
  --editingtype ag \
  --max-other-threshold 0.1 \
  --min-edited-threshold 0.001 \
  --min-ref-threshold 0.001 \
  --min-coverage 2 \
  --threads 16
```

Dry-run validation:

```bash
redicat call --input base_counts.h5ad --output out.h5ad --fa reference.fa --site-white-list sites.tsv.gz --dry-run
```

---

## 6) Parameter reference and tuning guidance

## 6.1 `bulk` parameters

| Parameter | Type | Default | Range / constraints | Function and biological meaning | Tuning guidance |
|---|---:|---:|---|---|---|
| `reads` | path | required | indexed BAM/CRAM | Input alignments for per-position pileup | Use coordinate-sorted, indexed files |
| `--output/-o` | path | required | writable path | Output TSV(.gz) of pileup features | Keep compressed for large cohorts |
| `--threads/-t` | usize | `10` | `>=1` | Worker parallelism | Use physical cores; avoid oversubscription |
| `--chunksize/-c` | u32 | engine constant (`100000`) | `>=1` | Genomic tile size per task | Larger = less scheduling overhead, higher memory burst |
| `--min-baseq/-Q` | u8? | `30` effective | `0..255` | Low-quality bases become `N` | Raise in noisy libraries |
| `--mapquality/-q` | u8 | `255` | `0..255` | Read-level alignment confidence filter | Relax for aligners not using 255 for unique mapping |
| `--zero-base/-z` | bool | `false` | n/a | Output coordinate convention | Keep default 1-based for interoperability |
| `--max-depth/-D` | u32 | `10000` | `>=1` | Pileup cap; near-cap flagged | Increase for ultra-deep loci; monitor runtime |
| `--min-depth/-d` | u32 | `5` | `>=1` | Minimum effective depth | Increase for conservative site discovery |
| `--max-n-fraction/-n` | u32 | `10` | `>=1` | Ambiguous base tolerance (`N <= max(depth/n,2)`) | Lower value = stricter ambiguity control |
| `--all/-a` | bool | `false` | n/a | Emit all sites vs editing-enriched subset | Use `true` for QC/debug; `false` for candidate enrichment |
| `--editing-threshold/-et` | u32 | `10000` | `>=1` | Multi-base support criterion in candidate mode | Lower to increase sensitivity |
| `--allcontigs/-A` | bool | `false` | n/a | Include non-canonical contigs | Enable for decoys/spike-ins/custom assemblies |

## 6.2 `bam2mtx` parameters

| Parameter | Type | Default | Range / constraints | Function and biological meaning | Tuning guidance |
|---|---:|---:|---|---|---|
| `--bam/-b` | path | required | indexed BAM/CRAM | Input alignment file | Required |
| `--tsv` | path | optional | required unless `--two-pass` | Position manifest to quantify | Prefer `--two-pass` for schema safety |
| `--barcodes` | path | required | whitelist file | Restricts cells to known barcodes | Use filtered cell whitelist from upstream pipeline |
| `--two-pass` | bool | `false` | n/a | Auto-generate site manifest via internal `bulk` run | Recommended default |
| `--output/-o` | path | required | `.h5ad` destination | Output sparse AnnData | Required |
| `--threads/-t` | usize | `10` | `>=1` | Parallel processing threads | Balance with storage throughput |
| `--min-mapq/-q` | u8 | `255` | `0..255` | Minimum read mapping quality | Match aligner conventions |
| `--min-baseq/-Q` | u8 | `30` | `0..255` | Minimum base quality | Raise for stringent mismatch calling |
| `--min-depth/-d` | u32 | `3` | `>=1` | Minimum non-`N` depth | Increase to reduce sparse noise |
| `--max-n-fraction/-n` | u32 | `10` | `>=1` | Ambiguous-base tolerance denominator | Lower for stricter base quality context |
| `--editing-threshold/-et` | u32 | `10000` | `>=1` | Candidate support threshold in first pass | Lower when expecting low editing burden |
| `--stranded/-S` | bool | `false` | n/a | Preserve strand layers (`A0..C0`, `A1..C1`) | Strongly recommended for strand-specific libraries |
| `--max-depth/-D` | u32 | `655360` | `>=1` | Pileup depth cap in matrix extraction | Set near expected extreme depth |
| `--umi-tag` | string | `UB` | SAM tag name | UMI field for molecule deduplication | Adjust to chemistry (
`RX`, etc.) |
| `--cb-tag` | string | `CB` | SAM tag name | Cell barcode field | Adjust if pipeline uses alternate tags |
| `--reference/-r` | path | optional | required for CRAM | FASTA for CRAM decoding | Mandatory for CRAM workflows |
| `--chunksize/-c` | u32 | `100000` | `>=1` | Weight budget for normal loci chunks | Increase for throughput, decrease for memory control |
| `--chunk-size-max-depth` | u32 | `2` | `>=1` | Small chunk cap for near-max-depth loci | Keep low to avoid hotspot stalls |
| `--matrix-density` | f64 | `0.005` | `>0` practical | Sparse pre-allocation hint | Raise for dense targeted panels |
| `--allcontigs/-A` | bool | `false` | n/a | Include non-canonical contigs | Enable for non-standard references |

## 6.3 `call` parameters

| Parameter | Type | Default | Range / constraints | Function and biological meaning | Tuning guidance |
|---|---:|---:|---|---|---|
| `--input` | string path | required | `.h5ad` | Base-count AnnData input | Output of `bam2mtx` |
| `--output` | string path | required | `.h5ad` | Annotated editing output | Existing output is overwritten |
| `--fa` | string path | required | FASTA + `.fai` required | Reference base retrieval per site | Must match alignment reference build |
| `--site-white-list` | string path | required | TSV/TSV.GZ, CHR/POS leading columns | Known candidate editing sites | Use curated catalogs + project-specific blacklist logic |
| `--editingtype` | enum | `ag` | `ag/ac/at/ca/cg/ct` | Defines expected ref→alt and strand-aware complements | Use `ag` for A-to-I focused studies; `ct` for C-to-U hypotheses |
| `--max-other-threshold` | f64 | `0.1` | validated `0..1` | Max tolerated non-ref/non-alt fraction | Lower for specificity in noisy cohorts |
| `--min-edited-threshold` | f64 | `0.001` | validated `0..1` | Minimum edited-base fraction | Raise for stronger effect-size confidence |
| `--min-ref-threshold` | f64 | `0.001` | validated `0..1` | Minimum reference-base fraction | Raise to avoid low-reference unstable sites |
| `--chunksize/-c` | usize | `100000` | `>=1` | Pipeline chunking granularity | Decrease if memory constrained |
| `--threads/-t` | Option<usize> | internal fallback `2` | `>=1` recommended | Rayon pool size for pipeline | Use core count for full throughput |
| `--min-coverage` | u16 | `2` | `>=1` | Site-level minimum coverage | Increase for stringent single-cell confidence |
| `--verbose/-v` | bool | `false` | n/a | Verbose logging flag | Enable for debugging |
| `--dry-run` | bool | `false` | n/a | Validate inputs only | Always run before large cohorts |

### 6.4 Mismatch decision rule used in `call`

For each site with coverage $C$:

- $other\_max = \max(\lceil C \cdot max\_other\_threshold \rceil, 2)$
- $edited\_min = \max(\lceil C \cdot min\_edited\_threshold \rceil, 1)$
- $ref\_min = \max(\lceil C \cdot min\_ref\_threshold \rceil, 1)$

A site passes if:
1. reference base is valid for selected editing type,
2. reference count $\ge ref\_min$,
3. edited count $\ge edited\_min$,
4. sum of other bases $\le other\_max$,
5. site coverage $\ge min\_coverage$.

---

## 7) Reproducible workflow (recommended)

1. **Generate candidate loci**
   - `redicat bulk` with stringent quality filters.
2. **Build base-count matrix**
   - `redicat bam2mtx --two-pass` (or provide validated TSV).
3. **Call editing signal**
   - `redicat call` with matched reference build and curated whitelist.
4. **Post-analysis**
   - Use `obs['CEI']`, `var['filter_pass']`, and `<XY>_alt/<XY>_ref` for downstream statistics.

For publication-grade reproducibility, record:
- full command lines,
- software version (`redicat --version`),
- reference genome build and checksum,
- whitelist provenance and version,
- all threshold parameters.

---

## 8) Result interpretation

### 8.1 Cell-level metrics

- `ref`, `alt`, `others` in `obs`: aggregated counts across retained sites.
- `CEI`: cell editing index, interpreted as editing burden proxy:

$$
CEI = \frac{alt}{ref + alt}
$$

Practical guidance:
- Compare CEI across cell states/conditions with coverage-aware covariates.
- Exclude cells with very low total informative counts (`ref + alt`) before differential analyses.

### 8.2 Site-level metrics

In `var`, fields such as `AG_ref`, `AG_alt`, `AG_others` (for `editingtype=ag`) summarize cohort-level support at each site.

Recommended biological interpretation:
- prioritize sites with high `*_alt`, low `*_others`, and `filter_pass=true`;
- remove known SNP loci (dbSNP/gnomAD) in downstream curation;
- inspect strand consistency for expected editing class behavior.

---

## 9) Performance and complexity

Let:
- $R$ = number of aligned reads inspected,
- $P$ = number of queried positions,
- $U$ = number of unique (cell, UMI, site) molecules after filtering,
- $nnz$ = non-zeros in sparse matrices,
- $T$ = worker threads.

### 9.1 `bulk`
- Pileup traversal is approximately $O(R_{region})$ over processed chunks.
- Parallel scheduling reduces wall-time to roughly $\sim O(R/T)$ in balanced workloads.

### 9.2 `bam2mtx`
- Site processing: $O(R_{sites})$ for pileup reads intersecting selected loci.
- UMI consensus hash aggregation: expected $O(U)$.
- Sparse assembly: near $O(nnz)$ plus sorting/merge costs for triplet consolidation.

### 9.3 `call`
- Coverage/base summaries and sparse reductions: near $O(nnz)$.
- Site annotation/filtering/reference lookup: $O(P)$ (with caching effects for FASTA access).
- Overall dominated by sparse matrix operations and site count.

In practice, REDICAT is throughput-optimized for sparse single-cell matrices with multi-core scaling and bounded-memory spill strategies in matrix assembly.

---

## 10) Notes and current implementation caveats

- Canonical contig filtering defaults to `chr1..chr22, chrX, chrY, chrM`; use `--allcontigs` for non-canonical references.
- `bam2mtx --two-pass` internally sets first-pass `bulk` max depth to `8000`.
- `call` writes priority layers (`ref`, `alt`, `others`, `coverage`) and annotations; preserve intermediate matrices if additional layers are required for custom methods.