redicat 0.3.2

REDICAT - RNA Editing Cellular Assessment Toolkit: A highly parallelized utility for analyzing RNA editing events in single-cell RNA-seq data
Documentation
# REDICAT · RNA Editing Cellular Assessment Toolkit

**REDICAT** is a high-throughput Rust toolkit for exploring RNA editing events in both bulk and **single-cell sequencing** experiments. The command-line interface ships three production pipelines—`bulk`, `bam2mtx`, and `call`—all built on a shared library that emphasizes lock-free parallelism, sparse data structures, and deterministic output.

![](./imgs/redicat.png)

## Citation
If REDICAT supports your research, please cite:
> Wei, T., Li, J., Lei, X., Lin, R., Wu, Q., Zhang, Z., Shuai, S., & Tian, R. (2025). Multimodal CRISPR screens uncover DDX39B as a global repressor of A-to-I RNA editing. *Cell Reports, 44*(7). https://doi.org/10.1016/j.celrep.2025.116009


## Highlights
- **Canonical-contig smart defaults:** By default every command operates on the primary human chromosomes (chr1–chr22, chrX, chrY, chrM). Add `--allcontigs` to opt into decoys, patches, or spike-ins when you need full genome coverage.
- **On-the-fly site vetting:** Depth, N-content, and editing-support thresholds mirror the `bulk` heuristics so only confident loci advance to matrix assembly.
- **Fearless concurrency:** Rayon thread pools, per-thread chunk reducers, and reader pooling keep pileup traversal and sparse matrix assembly saturated on modern CPUs.
- **Dynamic Rayon work queue:** `ParGranges` now flattens genomic tiles into a global work queue so `bulk` keeps every core busy even when only a handful of contigs remain in flight.
- **Parallel AnnData assembly:** `bam2mtx` now collects per-thread chunk results, merges them with a parallel sparse assembler, and writes `.h5ad` outputs without a single threaded bottleneck or extra global channels.
- **Run-aware sparse merging (New):** The AnnData converter now persists sorted triplet runs to disk and performs a streaming multi-way merge when finalising CSR layers, keeping peak RSS close to the sparse output size and preventing the single-threaded 30+ minute stall that previously occurred after "Processed 100%" log events.
- **Depth-aware chunking:** `bam2mtx` streams TSV manifests with Polars, accumulates position weights (`DEPTH + REF_SKIP + FAIL`) until the `--chunksize` budget is spent, and automatically falls back to the hotspot-friendly `--chunk-size-max-depth` batches (default 2) whenever `NEAR_MAX_DEPTH=true`.
- **Local-time observability:** Console logs now emit system time zone timestamps and guaranteed 30-minute status beacons that report remaining chunks alongside outstanding `NEAR_MAX_DEPTH` loci for long-running jobs.
- **Depth-cap skip audit:** Any site whose pileup depth meets or exceeds `--max-depth` is skipped before UMI aggregation, recorded to a sibling `<name>_skiped_sites.txt` manifest, and summarised in the final log output for easy triage.
- **Layered architecture:** The library is split into `core` (shared utilities & sparse ops), `engine` (parallel schedulers and position primitives), and `pipeline` (bam2mtx & call workflows), keeping reusable pieces lightweight and testable.
- **Sparse-first analytics:** All matrix work is written against CSR matrices with adaptive density hints so memory usage tracks the number of edited positions, not the theoretical genome size.
- **Strand-collapsed editing counts (New):** The `call` pipeline now merges complementary `*0`/`*1` strands before assigning counts to the `ref`, `alt`, and `others` matrices, guaranteeing strand-invariant allele totals.
- **Optimized sparse operations:** Uses two-pointer algorithms for sparse matrix operations (O(nnz) complexity) instead of HashMap-based approaches, with SmallVec for temporary allocations and FxHashMap for faster non-cryptographic hashing.
- **Minimal memory overhead:** Arc clones are kept to minimum, performance-critical paths avoid unnecessary allocations, and streaming converters maintain bounded memory usage.
- **AnnData-native exports:** The toolkit produces `.h5ad` files that slot directly into Python-based workflows (Scanpy, scVI, Seurat via conversion).

![](./imgs/redicat-1.png)


## Repository Layout
```
src/
  main.rs                 # CLI entry point and subcommand wiring
  commands/
    bulk/                 # Bulk depth CLI (args, processor, orchestration)
    bam2mtx/              # Matrix converter CLI (args, workflow, engine)
    call/                 # RNA editing pipeline CLI (args + pipeline)
    common.rs             # Shared CLI/runtime helpers (threads, contigs)
  lib/
    core/                 # Concurrency, IO, errors, read filters, sparse utilities
    engine/               # Parallel genomic range scheduler + position structs
    pipeline/
      bam2mtx/            # BAM ➜ sparse matrix processing pipeline
      call/               # RNA editing analytics pipeline
```

## Installation
```bash
# Clone and build
git clone https://github.com/aStudyingTurtle/redicat.git
cd redicat
cargo build --release

# Optional: install into Cargo's bin directory
cargo install --path .
```

Binary artifacts live at `target/release/redicat` after compilation.

## Quick Start
```bash
# Discover high-confidence editing sites on canonical chromosomes
redicat bulk sample.bam -o sites.tsv.gz --threads 16

# Generate single-cell mismatch matrices using a two-pass workflow
redicat bam2mtx --bam sample.bam --barcodes barcodes.tsv \
                --output counts.h5ad --two-pass --threads 16

# Tune coverage heuristics with `--min-depth`, `--max-n-fraction`, and `--editing-threshold` to mirror bulk filtering when necessary.

# Run the RNA editing analysis pipeline on the produced AnnData matrix
redicat call --input counts.h5ad --output results.h5ad \
             --fa GRCh38.fa --site-white-list sites.tsv.gz
```

Add `--allcontigs` to any command that should consider every contig in the BAM header instead of the canonical subset.

## Subcommands
### `bulk`
Calculates per-base depth and nucleotide counts across a BAM/CRAM. Filtering options allow you to tune MAPQ, base quality, minimum depth, and an editing-specific heuristic. The revamped scheduler flattens genomic tiles into a unified Rayon work queue and forces small stealable tasks via `with_min_len(1)` + adaptive `with_max_len`, keeping CPU utilisation high for long-running bulk traversals. Results stream to a bgzip-compressed TSV with near-max-depth flags computed from the larger of the filtered depth or the full pileup support (`DEPTH + REF_SKIP + FAIL`).
`near_max_depth` flags now key off that max-depth metric, ensuring the 1% ceiling tracks whichever of the filtered or raw pileup coverage is higher.
Positions whose filtered depth reaches the `--max-depth` ceiling are explicitly marked as `NEAR_MAX_DEPTH`, providing an audit trail for loci clipped by the depth guard.
Some libraries of `bulk` were taken form the [`perbase`](https://github.com/sstadick/perbase) project.

Option reference:

| Short | Long                  | Description                                           | Default         |
| ----- | --------------------- | ----------------------------------------------------- | --------------- |
| —     | `<reads>`             | Input indexed BAM/CRAM to analyze.                    | required        |
| `-o`  | `--output`            | Output path (bgzip added when missing).               | required        |
| `-t`  | `--threads`           | Worker threads for pileup traversal.                  | `10`            |
| `-c`  | `--chunksize`         | Ideal basepairs per worker (`ParGranges` chunk).      | `500000`        |
| `-Q`  | `--min-baseq`         | Base-quality floor; lower bases counted as `N`.       | `30 (implicit)` |
| `-q`  | `--mapquality`        | Minimum mapping quality accepted.                     | `255`           |
| `-z`  | `--zero-base`         | Emit 0-based coordinates instead of 1-based.          | `false`         |
| `-D`  | `--max-depth`         | Depth ceiling; positions within 1% of the cap are flagged as `NEAR_MAX_DEPTH`. | `8000`          |
| `-d`  | `--min-depth`         | Minimum coverage required to report a site.           | `10`            |
| `-n`  | `--max-n-fraction`    | Maximum tolerated N fraction (depth / value).         | `20`            |
| `-a`  | `--all`               | Report all sites rather than editing-enriched subset. | `false`         |
| `-et` | `--editing-threshold` | Depth divisor when checking for editing support.      | `1000`          |
| `-A`  | `--allcontigs`        | Traverse every contig instead of canonical set.       | `false`         |

### `bam2mtx`
Converts barcoded BAMs into sparse AnnData matrices (per base, per barcode, strand-aware). Accepts precomputed site lists or can bootstrap one via `--two-pass`, which internally runs `bulk` with the same contour filters. Parallel chunk workers now reduce their results locally and a multi-threaded sparse assembler materialises the final CSR layers before the `.h5ad` file is written, avoiding the previous channel-based bottleneck and keeping memory predictable.

During long conversions the console prints system-local timestamps and, in addition to 5% progress beacons, guarantees a status snapshot every 30 minutes summarising remaining chunks and unprocessed `NEAR_MAX_DEPTH` loci so operators can watch backlog decay in real time. At completion the command emits a `<output>_skiped_sites.txt` sibling file listing the contig, 1-based position, and observed depth of every locus that was skipped because its pileup depth met or exceeded `--max-depth`.

**Recent Memory Optimizations (v0.3.1):**
- **UMI String Interning**: Identical UMI sequences are now deduplicated in memory using `Arc<str>`, reducing memory usage by 50-70% for high-depth positions with redundant UMIs.
- **Adaptive HashMap Sizing**: Hash maps for counts and UMI consensus now use dynamic capacity estimation based on whether a chunk contains high-depth positions, preventing over-allocation.
- **Aggressive Triplet Spilling**: The in-memory triplet threshold has been reduced from 500K to 100K (with a chunk-size multiplier of 2× instead of 32×), ensuring more frequent disk flushes and lower peak memory usage.
- **Enhanced Monitoring**: Each chunk now logs detailed statistics including position count, high-depth site count, and total weight, making it easier to identify memory-intensive workloads.

Highlights:
- TSV positions are filtered to canonical contigs unless `--allcontigs` is specified.
- Depth/N/editing thresholds drop low-quality loci before matrix assembly, matching the first-pass `bulk` defaults.
- `--two-pass` scouts sites with `bulk --max-depth 8000`, trimming runaway pileups before the heavy second pass.
- Reader pools and chunk-aware parallelism keep pileups cache-friendly while chunk staging bounds memory.
- Output matrices respect the density hint supplied via `--matrix-density`, and the streaming converter remaps cell/position indices on the fly before materialising CSR layers.
- Info-level progress logs report chunk-level completion percentage, detailed chunk statistics, and ETA so long runs stay observable.

Option reference:

| Short | Long                  | Description                                                    | Default  |
| ----- | --------------------- | -------------------------------------------------------------- | -------- |
| `-b`  | `--bam`               | Input indexed BAM/CRAM file.                                   | required |
| —     | `--tsv`               | Optional site list TSV (columns `CHR` and `POS`).              | optional |
| —     | `--barcodes`          | Cell barcode whitelist (one per line).                         | required |
| —     | `--two-pass`          | Run `bulk` first to build a fresh site list.                   | `false`  |
| `-o`  | `--output`            | Output AnnData (`.h5ad`) path.                                 | required |
| `-t`  | `--threads`           | Worker threads for pileup + AnnData writing.                   | `10`     |
| `-q`  | `--min-mapq`          | Minimum mapping quality per read.                              | `255`    |
| `-Q`  | `--min-baseq`         | Minimum base quality; lower bases count as `N`.                | `30`     |
| `-d`  | `--min-depth`         | Minimum non-`N` coverage to keep a site.                       | `10`     |
| `-n`  | `--max-n-fraction`    | Maximum tolerated ambiguous fraction (depth / value).          | `20`     |
| `-et` | `--editing-threshold` | Ensures at least two bases exceed `depth / editing-threshold`. | `1000`   |
| `-D`  | `--max-depth`         | Cap on pileup traversal depth during matrix assembly (default 655,360). Sites with depth ≥ cap are skipped, logged, and mirrored to `<output>_skiped_sites.txt`.         | `655360`  |
| `-S`  | `--stranded`          | Treat UMIs as strand-aware.                                    | `false`  |
| —     | `--umi-tag`           | BAM tag containing UMI sequence.                               | `UB`     |
| —     | `--cb-tag`            | BAM tag containing cell barcode.                               | `CB`     |
| `-r`  | `--reference`         | Reference FASTA (required for CRAM).                           | optional |
| `-c`  | `--chunksize`         | Genomic position weight budget per processing chunk (applied to standard loci). | `100000` |
| —     | `--chunk-size-max-depth` | Maximum hotspot chunk length when `NEAR_MAX_DEPTH` is `true` in the TSV. **Updated default to 2 for faster hotspot turnover.** | `2`     |
| —     | `--matrix-density`    | Hint for CSR buffer sizing.                                    | `0.005`  |
| `-A`  | `--allcontigs`        | Include decoys and noncanonical contigs.                       | `false`  |

### `call`
Executes the full RNA editing pipeline on `.h5ad` inputs: variant annotation, strand-aware ref/alt counting, CEI computation, and mismatch statistics. Validation steps ensure reference FASTA, white-list TSV, and input AnnData all exist and are readable before work begins.

Recent updates collapse opposing strand layers (`A0` + `A1`, etc.) before counts are distributed to the `ref`, `alt`, and `others` matrices. This guarantees that allelic support is identical regardless of BAM strand orientation and avoids spuriously split coverage in downstream metrics.

Important options:
- `--editingtype` · choose among AG, AC, AT, CA, CG, CT editing signatures.
- `--min-coverage` · per-site coverage filter before CEI calculation.
- `--threads` · configure Rayon pool size (defaults to 2 for stability).

## Performance Notes
- **UMI String Interning (New):** Identical UMI sequences are deduplicated in memory using `Arc<str>`, reducing memory footprint by 50-70% for datasets with repetitive UMIs at high-depth positions.
- **Adaptive HashMap Capacity (New):** Hash maps dynamically adjust their initial capacity based on chunk characteristics (high-depth vs normal), preventing over-allocation while maintaining performance.
- **Parallel sparse reduction (New):** `bam2mtx` merges per-thread chunk outputs with a lock-free sparse assembler, so CSR materialisation scales with core count without staging everything in a single consumer thread.
- **Aggressive Triplet Spilling (New):** The streaming matrix builder now uses a lower spill threshold (100K triplets, minimum 50K cap) to flush data to disk more frequently, dramatically reducing peak memory usage during large conversions.
- **Depth-cap awareness (New):** `bulk` now marks any site whose filtered depth reaches the configured maximum as `NEAR_MAX_DEPTH`, complementing the existing 1% proximity heuristic and making depth clipping visible downstream.
- **Streaming CSR materialisation (New):** Triplet runs are merged with a heap-based iterator that deduplicates row/column pairs on the fly, so final layer construction stays parallel-friendly and never loads all triplets into RAM at once.
- **Enhanced Chunk Monitoring (New):** Each chunk logs position count, high-depth site count, and total weight, providing visibility into workload characteristics and helping diagnose memory-intensive regions.
- **Reader pooling:** Each worker thread checks out an `IndexedReader` from a pool, avoiding reopen/seek penalties when iterating across genomic tiles.
- **Triplet batching:** Sparse matrices are assembled from thread-local batches of `(row, col, value)` triplets, eliminating cross-thread contention, and the final merge operates fully in parallel without an intermediate channel stage.
- **Adaptive chunking:** Dual CLI knobs (`--chunksize` for the depth-weighted standard loci budget and `--chunk-size-max-depth` for hotspot batch caps, now defaulting to 2) govern the parallel granularity for `bam2mtx` (and the batch size for AnnData writes), while the refactored `ParGranges` scheduler flattens tiles into a single Rayon queue so `bulk` honors `--chunksize` without serial bottlenecks. Normal chunk weights now match the manifest’s `DEPTH + REF_SKIP + FAIL` metric for tighter alignment with bulk heuristics.
- **Chunk-level fetch & depth guards:** `bam2mtx` batches contiguous sites per contig, records observed depth, and—in `--two-pass` mode—relies on a first-pass `bulk` run capped at `--max-depth 8000` plus in-run pileup guards to prune oversaturated loci before dense pileups reach the matrix stage.
- **Progress-aware logging:** Chunk processors emit periodic `%` complete + ETA updates plus detailed per-chunk statistics so multi-hour conversions remain easy to follow from the console.

## Testing & Data
```bash
# REDIportal3 white-list, for use with `call`
wget https://comics.med.sustech.edu.cn/static_data/redicat/data/rediportal3.tsv.gz

# 10X 3' 20K_PBMC data
# chr22
wget https://comics.med.sustech.edu.cn/static_data/redicat/data/chr22.bam
wget https://comics.med.sustech.edu.cn/static_data/redicat/data/chr22.bam.bai
# cell barcodes
wget https://comics.med.sustech.edu.cn/static_data/redicat/data/barcodes.tsv.gz
```
A minimal BAM (`test/chr22.bam`) and index are included for smoke tests. Run `cargo test` to execute unit tests that exercise pileup traversal and CLI argument parsing.