# REDICAT · RNA Editing Cellular Assessment Toolkit
**REDICAT** is a high-throughput Rust toolkit for exploring RNA editing events in both bulk and **single-cell sequencing** experiments. The command-line interface ships three production pipelines—`bulk`, `bam2mtx`, and `call`—all built on a shared library that emphasizes lock-free parallelism, sparse data structures, and deterministic output.

## Citation
If REDICAT supports your research, please cite:
> Wei, T., Li, J., Lei, X., Lin, R., Wu, Q., Zhang, Z., Shuai, S., & Tian, R. (2025). Multimodal CRISPR screens uncover DDX39B as a global repressor of A-to-I RNA editing. *Cell Reports, 44*(7). https://doi.org/10.1016/j.celrep.2025.116009
## Highlights
- **Canonical-contig smart defaults:** By default every command operates on the primary human chromosomes (chr1–chr22, chrX, chrY, chrM). Add `--allcontigs` to opt into decoys, patches, or spike-ins when you need full genome coverage.
- **On-the-fly site vetting:** Depth, N-content, and editing-support thresholds mirror the `bulk` heuristics so only confident loci advance to matrix assembly.
- **Fearless concurrency:** Rayon thread pools, per-thread chunk reducers, and reader pooling keep pileup traversal and sparse matrix assembly saturated on modern CPUs.
- **Dynamic Rayon work queue:** `ParGranges` now flattens genomic tiles into a global work queue so `bulk` keeps every core busy even when only a handful of contigs remain in flight.
- **Parallel AnnData assembly:** `bam2mtx` now collects per-thread chunk results, merges them with a parallel sparse assembler, and writes `.h5ad` outputs without a single threaded bottleneck or extra global channels.
- **Run-aware sparse merging (New):** The AnnData converter now persists sorted triplet runs to disk and performs a streaming multi-way merge when finalising CSR layers, keeping peak RSS close to the sparse output size and preventing the single-threaded 30+ minute stall that previously occurred after "Processed 100%" log events.
- **Depth-aware chunking:** `bam2mtx` streams TSV manifests with Polars, accumulates position weights (`DEPTH + REF_SKIP + FAIL`) until the `--chunksize` budget is spent, and automatically falls back to the hotspot-friendly `--chunk-size-max-depth` batches (default 2) whenever `NEAR_MAX_DEPTH=true`.
- **Local-time observability:** Console logs now emit system time zone timestamps and guaranteed 30-minute status beacons that report remaining chunks alongside outstanding `NEAR_MAX_DEPTH` loci for long-running jobs.
- **Depth-cap skip audit:** Any site whose pileup depth meets or exceeds `--max-depth` is skipped before UMI aggregation, recorded to a sibling `<name>_skiped_sites.txt` manifest, and summarised in the final log output for easy triage.
- **Layered architecture:** The library is split into `core` (shared utilities & sparse ops), `engine` (parallel schedulers and position primitives), and `pipeline` (bam2mtx & call workflows), keeping reusable pieces lightweight and testable.
- **Sparse-first analytics:** All matrix work is written against CSR matrices with adaptive density hints so memory usage tracks the number of edited positions, not the theoretical genome size.
- **Strand-collapsed editing counts (New):** The `call` pipeline now merges complementary `*0`/`*1` strands before assigning counts to the `ref`, `alt`, and `others` matrices, guaranteeing strand-invariant allele totals.
- **Optimized sparse operations:** Uses two-pointer algorithms for sparse matrix operations (O(nnz) complexity) instead of HashMap-based approaches, with SmallVec for temporary allocations and FxHashMap for faster non-cryptographic hashing.
- **Minimal memory overhead:** Arc clones are kept to minimum, performance-critical paths avoid unnecessary allocations, and streaming converters maintain bounded memory usage.
- **AnnData-native exports:** The toolkit produces `.h5ad` files that slot directly into Python-based workflows (Scanpy, scVI, Seurat via conversion).

## Repository Layout
```
src/
main.rs # CLI entry point and subcommand wiring
commands/
bulk/ # Bulk depth CLI (args, processor, orchestration)
bam2mtx/ # Matrix converter CLI (args, workflow, engine)
call/ # RNA editing pipeline CLI (args + pipeline)
common.rs # Shared CLI/runtime helpers (threads, contigs)
lib/
core/ # Concurrency, IO, errors, read filters, sparse utilities
engine/ # Parallel genomic range scheduler + position structs
pipeline/
bam2mtx/ # BAM ➜ sparse matrix processing pipeline
call/ # RNA editing analytics pipeline
```
## Installation
```bash
# Clone and build
git clone https://github.com/aStudyingTurtle/redicat.git
cd redicat
cargo build --release
# Optional: install into Cargo's bin directory
cargo install --path .
```
Binary artifacts live at `target/release/redicat` after compilation.
## Quick Start
```bash
# Discover high-confidence editing sites on canonical chromosomes
redicat bulk sample.bam -o sites.tsv.gz --threads 16
# Generate single-cell mismatch matrices using a two-pass workflow
redicat bam2mtx --bam sample.bam --barcodes barcodes.tsv \
--output counts.h5ad --two-pass --threads 16
# Tune coverage heuristics with `--min-depth`, `--max-n-fraction`, and `--editing-threshold` to mirror bulk filtering when necessary.
# Run the RNA editing analysis pipeline on the produced AnnData matrix
redicat call --input counts.h5ad --output results.h5ad \
--fa GRCh38.fa --site-white-list sites.tsv.gz
```
Add `--allcontigs` to any command that should consider every contig in the BAM header instead of the canonical subset.
## Subcommands
### `bulk`
Calculates per-base depth and nucleotide counts across a BAM/CRAM. Filtering options allow you to tune MAPQ, base quality, minimum depth, and an editing-specific heuristic. The revamped scheduler flattens genomic tiles into a unified Rayon work queue and forces small stealable tasks via `with_min_len(1)` + adaptive `with_max_len`, keeping CPU utilisation high for long-running bulk traversals. Results stream to a bgzip-compressed TSV with near-max-depth flags computed from the larger of the filtered depth or the full pileup support (`DEPTH + REF_SKIP + FAIL`).
`near_max_depth` flags now key off that max-depth metric, ensuring the 1% ceiling tracks whichever of the filtered or raw pileup coverage is higher.
Positions whose filtered depth reaches the `--max-depth` ceiling are explicitly marked as `NEAR_MAX_DEPTH`, providing an audit trail for loci clipped by the depth guard.
Some libraries of `bulk` were taken form the [`perbase`](https://github.com/sstadick/perbase) project.
Option reference:
| Short | Long | Description | Default |
| ----- | --------------------- | ----------------------------------------------------- | --------------- |
| — | `<reads>` | Input indexed BAM/CRAM to analyze. | required |
| `-o` | `--output` | Output path (bgzip added when missing). | required |
| `-t` | `--threads` | Worker threads for pileup traversal. | `10` |
| `-c` | `--chunksize` | Ideal basepairs per worker (`ParGranges` chunk). | `500000` |
| `-Q` | `--min-baseq` | Base-quality floor; lower bases counted as `N`. | `30 (implicit)` |
| `-q` | `--mapquality` | Minimum mapping quality accepted. | `255` |
| `-z` | `--zero-base` | Emit 0-based coordinates instead of 1-based. | `false` |
| `-D` | `--max-depth` | Depth ceiling; positions within 1% of the cap are flagged as `NEAR_MAX_DEPTH`. | `8000` |
| `-d` | `--min-depth` | Minimum coverage required to report a site. | `10` |
| `-n` | `--max-n-fraction` | Maximum tolerated N fraction (depth / value). | `20` |
| `-a` | `--all` | Report all sites rather than editing-enriched subset. | `false` |
| `-et` | `--editing-threshold` | Depth divisor when checking for editing support. | `1000` |
| `-A` | `--allcontigs` | Traverse every contig instead of canonical set. | `false` |
### `bam2mtx`
Converts barcoded BAMs into sparse AnnData matrices (per base, per barcode, strand-aware). Accepts precomputed site lists or can bootstrap one via `--two-pass`, which internally runs `bulk` with the same contour filters. Parallel chunk workers now reduce their results locally and a multi-threaded sparse assembler materialises the final CSR layers before the `.h5ad` file is written, avoiding the previous channel-based bottleneck and keeping memory predictable.
During long conversions the console prints system-local timestamps and, in addition to 5% progress beacons, guarantees a status snapshot every 30 minutes summarising remaining chunks and unprocessed `NEAR_MAX_DEPTH` loci so operators can watch backlog decay in real time. At completion the command emits a `<output>_skiped_sites.txt` sibling file listing the contig, 1-based position, and observed depth of every locus that was skipped because its pileup depth met or exceeded `--max-depth`.
**Recent Memory Optimizations (v0.3.1):**
- **UMI String Interning**: Identical UMI sequences are now deduplicated in memory using `Arc<str>`, reducing memory usage by 50-70% for high-depth positions with redundant UMIs.
- **Adaptive HashMap Sizing**: Hash maps for counts and UMI consensus now use dynamic capacity estimation based on whether a chunk contains high-depth positions, preventing over-allocation.
- **Aggressive Triplet Spilling**: The in-memory triplet threshold has been reduced from 500K to 100K (with a chunk-size multiplier of 2× instead of 32×), ensuring more frequent disk flushes and lower peak memory usage.
- **Enhanced Monitoring**: Each chunk now logs detailed statistics including position count, high-depth site count, and total weight, making it easier to identify memory-intensive workloads.
Highlights:
- TSV positions are filtered to canonical contigs unless `--allcontigs` is specified.
- Depth/N/editing thresholds drop low-quality loci before matrix assembly, matching the first-pass `bulk` defaults.
- `--two-pass` scouts sites with `bulk --max-depth 8000`, trimming runaway pileups before the heavy second pass.
- Reader pools and chunk-aware parallelism keep pileups cache-friendly while chunk staging bounds memory.
- Output matrices respect the density hint supplied via `--matrix-density`, and the streaming converter remaps cell/position indices on the fly before materialising CSR layers.
- Info-level progress logs report chunk-level completion percentage, detailed chunk statistics, and ETA so long runs stay observable.
Option reference:
| Short | Long | Description | Default |
| ----- | --------------------- | -------------------------------------------------------------- | -------- |
| `-b` | `--bam` | Input indexed BAM/CRAM file. | required |
| — | `--tsv` | Optional site list TSV (columns `CHR` and `POS`). | optional |
| — | `--barcodes` | Cell barcode whitelist (one per line). | required |
| — | `--two-pass` | Run `bulk` first to build a fresh site list. | `false` |
| `-o` | `--output` | Output AnnData (`.h5ad`) path. | required |
| `-t` | `--threads` | Worker threads for pileup + AnnData writing. | `10` |
| `-q` | `--min-mapq` | Minimum mapping quality per read. | `255` |
| `-Q` | `--min-baseq` | Minimum base quality; lower bases count as `N`. | `30` |
| `-d` | `--min-depth` | Minimum non-`N` coverage to keep a site. | `10` |
| `-n` | `--max-n-fraction` | Maximum tolerated ambiguous fraction (depth / value). | `20` |
| `-et` | `--editing-threshold` | Ensures at least two bases exceed `depth / editing-threshold`. | `1000` |
| `-D` | `--max-depth` | Cap on pileup traversal depth during matrix assembly (default 655,360). Sites with depth ≥ cap are skipped, logged, and mirrored to `<output>_skiped_sites.txt`. | `65536` |
| `-S` | `--stranded` | Treat UMIs as strand-aware. | `false` |
| — | `--umi-tag` | BAM tag containing UMI sequence. | `UB` |
| — | `--cb-tag` | BAM tag containing cell barcode. | `CB` |
| `-r` | `--reference` | Reference FASTA (required for CRAM). | optional |
| `-c` | `--chunksize` | Genomic position weight budget per processing chunk (applied to standard loci). | `100000` |
| — | `--chunk-size-max-depth` | Maximum hotspot chunk length when `NEAR_MAX_DEPTH` is `true` in the TSV. **Updated default to 2 for faster hotspot turnover.** | `2` |
| — | `--matrix-density` | Hint for CSR buffer sizing. | `0.005` |
| `-A` | `--allcontigs` | Include decoys and noncanonical contigs. | `false` |
### `call`
Executes the full RNA editing pipeline on `.h5ad` inputs: variant annotation, strand-aware ref/alt counting, CEI computation, and mismatch statistics. Validation steps ensure reference FASTA, white-list TSV, and input AnnData all exist and are readable before work begins.
Recent updates collapse opposing strand layers (`A0` + `A1`, etc.) before counts are distributed to the `ref`, `alt`, and `others` matrices. This guarantees that allelic support is identical regardless of BAM strand orientation and avoids spuriously split coverage in downstream metrics.
**Editing Logic (Updated):**
- `redicat call --editingtype` now enforces complementary substitutions for all six editing signatures. For example, `--editingtype AG` keeps A→G events alongside the strand-complementary T→C observations, while discarding loci whose reference base is neither A nor T. Equivalent complementary rules exist for AC, AT, CA, CG, and CT.
- **Optimized ref/alt/obs matrix calculation (v0.3.2+)**: The matrices use an efficient mask-based sparse operation approach inspired by the Python implementation, achieving O(nnz) complexity. For AG editing:
- When ref_base=A: ref layer = A counts, alt layer = G counts, obs layer = T+C counts
- When ref_base=T: ref layer = T counts, alt layer = C counts, obs layer = A+G counts
- Sites with ref_base=G or C are excluded
- One-hot encoding masks are created for ref/alt/others positions, then applied via element-wise sparse multiplication to each base matrix, maintaining full sparsity and dramatically improving performance for large datasets.
- Similar logic applies to all other editing types (AC, AT, CA, CG, CT), ensuring strand-invariant and biologically accurate allele counting with optimal efficiency.
- The site white-list contributes an `is_editing_site` flag inside `var`, while the mismatch thresholds (`--max-other-threshold`, `--min-coverage`, `--min-edited-threshold`, `--min-ref-threshold`) produce a `filter_pass` flag. Only loci marked with both flags feed the observation-level `ref`, `alt`, and `others` totals as well as CEI.
- The generated AnnData therefore includes a `var` table with `is_editing_site`, `filter_pass`, and `Mismatch` columns, and an `obs` table whose `ref`/`alt`/`others` columns and `CEI` metric reflect exclusively the confidently filtered editing sites.
Important options:
- `--editingtype` · choose among AG, AC, AT, CA, CG, CT editing signatures.
- `--min-coverage` · per-site coverage filter before CEI calculation.
- `--threads` · configure Rayon pool size (defaults to 2 for stability).
## Performance Notes
- **Sparse matrix column sums for site statistics (New v0.3.2):** Site-level mismatch statistics (XX_ref, XX_alt, XX_others) are now computed using efficient sparse matrix column sum operations directly on the ref/alt/others layers, achieving O(nnz) complexity and eliminating the need for DataFrame row iteration.
- **Parallel tree reduction for sparse matrix sums (New v0.3.2):** Coverage matrix calculation now uses a divide-and-conquer parallel tree reduction strategy, achieving O(log n) parallel depth instead of O(n) sequential operations. For 8 base layers, this reduces from 7 sequential additions to 3 parallel levels with 4-way parallelism at the first level, dramatically improving performance while maintaining full sparsity.
- **Mask-based sparse matrix operations (New v0.3.2):** The `call` pipeline now uses one-hot encoding masks combined with element-wise sparse multiplication for ref/alt/obs matrix calculation, achieving O(nnz) complexity instead of O(n_obs × n_vars). This dramatically improves performance for large sparse datasets while maintaining full correctness and sparsity.
- **UMI String Interning (New):** Identical UMI sequences are deduplicated in memory using `Arc<str>`, reducing memory footprint by 50-70% for datasets with repetitive UMIs at high-depth positions.
- **Adaptive HashMap Capacity (New):** Hash maps dynamically adjust their initial capacity based on chunk characteristics (high-depth vs normal), preventing over-allocation while maintaining performance.
- **Parallel sparse reduction (New):** `bam2mtx` merges per-thread chunk outputs with a lock-free sparse assembler, so CSR materialisation scales with core count without staging everything in a single consumer thread.
- **Aggressive Triplet Spilling (New):** The streaming matrix builder now uses a lower spill threshold (100K triplets, minimum 50K cap) to flush data to disk more frequently, dramatically reducing peak memory usage during large conversions.
- **Depth-cap awareness (New):** `bulk` now marks any site whose filtered depth reaches the configured maximum as `NEAR_MAX_DEPTH`, complementing the existing 1% proximity heuristic and making depth clipping visible downstream.
- **Streaming CSR materialisation (New):** Triplet runs are merged with a heap-based iterator that deduplicates row/column pairs on the fly, so final layer construction stays parallel-friendly and never loads all triplets into RAM at once.
- **Enhanced Chunk Monitoring (New):** Each chunk logs position count, high-depth site count, and total weight, providing visibility into workload characteristics and helping diagnose memory-intensive regions.
- **Reader pooling:** Each worker thread checks out an `IndexedReader` from a pool, avoiding reopen/seek penalties when iterating across genomic tiles.
- **Triplet batching:** Sparse matrices are assembled from thread-local batches of `(row, col, value)` triplets, eliminating cross-thread contention, and the final merge operates fully in parallel without an intermediate channel stage.
- **Adaptive chunking:** Dual CLI knobs (`--chunksize` for the depth-weighted standard loci budget and `--chunk-size-max-depth` for hotspot batch caps, now defaulting to 2) govern the parallel granularity for `bam2mtx` (and the batch size for AnnData writes), while the refactored `ParGranges` scheduler flattens tiles into a single Rayon queue so `bulk` honors `--chunksize` without serial bottlenecks. Normal chunk weights now match the manifest’s `DEPTH + REF_SKIP + FAIL` metric for tighter alignment with bulk heuristics.
- **Chunk-level fetch & depth guards:** `bam2mtx` batches contiguous sites per contig, records observed depth, and—in `--two-pass` mode—relies on a first-pass `bulk` run capped at `--max-depth 8000` plus in-run pileup guards to prune oversaturated loci before dense pileups reach the matrix stage.
- **Progress-aware logging:** Chunk processors emit periodic `%` complete + ETA updates plus detailed per-chunk statistics so multi-hour conversions remain easy to follow from the console.
## Testing & Data
```bash
# REDIportal3 white-list, for use with `call`
wget https://comics.med.sustech.edu.cn/static_data/redicat/data/rediportal3.tsv.gz
# 10X 3' 20K_PBMC data
# chr22
wget https://comics.med.sustech.edu.cn/static_data/redicat/data/chr22.bam
wget https://comics.med.sustech.edu.cn/static_data/redicat/data/chr22.bam.bai
# cell barcodes
wget https://comics.med.sustech.edu.cn/static_data/redicat/data/barcodes.tsv.gz
```
A minimal BAM (`test/chr22.bam`) and index are included for smoke tests. Run `cargo test` to execute unit tests that exercise pileup traversal and CLI argument parsing.