redicat 0.3.11

REDICAT - RNA Editing Cellular Assessment Toolkit: A highly parallelized utility for analyzing RNA editing events in single-cell RNA-seq data
docs.rs failed to build redicat-0.3.11
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
Visit the last successful build: redicat-0.3.5

REDICAT

RNA Editing Cellular Assessment Toolkit

REDICAT is a high-performance Rust toolkit for RNA editing quantification in single-cell RNA-seq, with an emphasis on strand-aware mismatch logic, sparse-matrix efficiency, and reproducible end-to-end processing from indexed BAM/CRAM to analysis-ready AnnData (.h5ad).

Citation

If REDICAT supports your research, please cite:

Wei, T., Li, J., Lei, X., Lin, R., Wu, Q., Zhang, Z., Shuai, S., & Tian, R. (2025). Multimodal CRISPR screens uncover DDX39B as a global repressor of A-to-I RNA editing. Cell Reports, 44(7). https://doi.org/10.1016/j.celrep.2025.116009

Installation

# Clone and build
git clone https://github.com/aStudyingTurtle/redicat.git
cd redicat
cargo build --release

# Optional: install into Cargo's bin directory
cargo install --path .

1) Scientific overview

RNA editing introduces biologically meaningful RNA–DNA mismatches (for example A-to-I observed as A>G, and C-to-U observed as C>T). In single-cell data, these events are sparse and confounded by sequencing noise, alignment artifacts, and SNPs. REDICAT addresses this by:

  • extracting per-cell, per-site strand-aware base counts from BAM/CRAM,
  • applying biologically informed mismatch filters,
  • generating reference/alternate/other sparse matrices,
  • computing cell-level editing metrics (including CEI), and
  • preserving metadata in standard AnnData format for downstream statistical analysis.

Innovation highlights

  • Parallel genomic scheduling with bounded-memory streaming.
  • UMI-level consensus deduplication to suppress conflicting molecules.
  • Strand-aware editing inference (AG, CT, and related mismatch classes).
  • Sparse linear-algebra-first implementation for scale.

2) Code architecture (from source-level review)

2.1 Top-level modules

  • src/main.rs: CLI entry point with subcommands: bulk, bam2mtx, call.
  • src/commands/*: user-facing command orchestration.
  • src/lib/core/*: shared infrastructure (errors, IO, sparse ops, filtering).
  • src/lib/engine/*: parallel genomic scheduler (par_granges) and position models.
  • src/lib/pipeline/bam2mtx/*: BAM/CRAM → sparse base-count AnnData conversion.
  • src/lib/pipeline/call/*: editing annotation/filtering/ref-alt matrix/CEI pipeline.

2.2 Runtime interaction logic

  1. bulk (src/commands/bulk)

    • Uses ParGranges scheduler to iterate genomic chunks in parallel.
    • BaseProcessor performs pileup counting and emits per-position records.
    • Output is gzipped TSV (or bgzip-compatible path handling).
  2. bam2mtx (src/commands/bam2mtx)

    • Reads position manifest TSV (or runs bulk first pass with --two-pass).
    • Splits positions into depth-aware chunks (special handling for near-max-depth loci).
    • OptimizedChunkProcessor performs per-chunk pileup traversal with:
      • mapping/base quality filtering,
      • barcode whitelist matching,
      • UMI consensus conflict suppression,
      • optional stranded counting.
    • AnnDataConverter assembles sparse layers (A1/T1/G1/C1 and optional A0/T0/G0/C0) and writes .h5ad.
  3. call (src/commands/call + src/lib/pipeline/call)

    • Loads input AnnData and known editing-site whitelist.
    • Annotates candidate sites, computes coverage/base summaries, adds reference base from FASTA.
    • Applies mismatch classification thresholds.
    • Constructs strand-aware ref/alt/others matrices.
    • Computes cell-level CEI = alt / (ref + alt).
    • Computes site-level mismatch summaries and writes output .h5ad.

2.3 Biological mismatch logic implemented

EditingType supports: ag, ac, at, ca, cg, ct.

For each site, REDICAT uses strand-aware rules:

  • expected reference bases on either strand,
  • expected alternate base conditional on strand-consistent reference,
  • rejection of sites with excess non-reference/non-alt support.

This operationalizes your stated biology:

  • A-to-I: forward A>G with reverse-complement signal,
  • C-to-U: forward C>T with reverse-complement signal, while retaining generalized mismatch classes for exploratory analyses.

3) Installation

3.1 Requirements

  • Linux/macOS (validated workflow shown for Linux).
  • Rust toolchain (recommended stable):
    • rustup, cargo, rustc.
  • Typical native build dependencies for rust-htslib/compression stack:
    • pkg-config, clang, zlib, bzip2, xz, curl, OpenSSL dev headers.
  • For CRAM input: reference FASTA should be available and indexed.

3.2 Build

git clone https://github.com/aStudyingTurtle/redicat.git
cd redicat
cargo build --release

Binary path:

./target/release/redicat

Quick check:

./target/release/redicat --help
./target/release/redicat bam2mtx --help
./target/release/redicat call --help
./target/release/redicat bulk --help

4) Input/output data contracts

4.1 BAM/CRAM prerequisites

  • Coordinate-sorted and indexed alignment file.
  • Cell barcode tag (default CB) and UMI tag (default UB) present if single-cell mode is expected.

4.2 bulk output schema (TSV.gz)

Serialized PileupPosition includes columns:

  • CHR, POS, DEPTH, A, C, G, T, N, INS, DEL, REF_SKIP, FAIL, NEAR_MAX_DEPTH
  • Optional REF_BASE when available.

POS is 1-based by default (0-based with --zero-base).

4.3 Position manifest required by bam2mtx

bam2mtx expects at least these columns in TSV:

  • CHR, POS, DEPTH, INS, DEL, REF_SKIP, FAIL, NEAR_MAX_DEPTH

The easiest way to produce a valid manifest is --two-pass, which internally runs bulk.

4.4 Site whitelist required by call

  • Tab-separated file, plain or gzipped.
  • First two columns must be chromosome and position (CHR, POS convention).
  • Key format used internally: CHR:POS matching AnnData.var_names.

4.5 call output

  • .h5ad with layers (priority written):
    • ref, alt, others, coverage (when present)
  • obs additions:
    • ref, alt, others (row sums over filtered editing sites)
    • CEI
  • var additions:
    • is_editing_site, ref, Mismatch, filter_pass
    • site-level columns: <XY>_ref, <XY>_alt, <XY>_others for selected editing type X>Y

bam2mtx also writes a sidecar file for capped-depth loci:

  • <output_stem>_skipped_sites.txt.

5) Command-line usage

5.1 bulk (first-pass site discovery / pileup profiling)

redicat bulk input.bam \
  --output first_pass.tsv.gz \
  --threads 16 \
  --chunksize 100000 \
  --mapquality 255 \
  --min-depth 5

Use --all to emit all covered positions (not only editing-enriched candidates).

5.2 bam2mtx (BAM to sparse single-cell base matrices)

Two-pass (recommended)

redicat bam2mtx \
  --bam input.bam \
  --barcodes barcodes.tsv.gz \
  --output base_counts.h5ad \
  --two-pass \
  --threads 16 \
  --stranded

Single-pass with precomputed manifest

redicat bam2mtx \
  --bam input.bam \
  --tsv first_pass.tsv.gz \
  --barcodes barcodes.tsv.gz \
  --output base_counts.h5ad \
  --threads 16

5.3 call (editing annotation and quantification)

redicat call \
  --input base_counts.h5ad \
  --output editing_calls.h5ad \
  --fa reference.fa \
  --site-white-list rediportal_sites.tsv.gz \
  --editingtype ag \
  --max-other-threshold 0.1 \
  --min-edited-threshold 0.001 \
  --min-ref-threshold 0.001 \
  --min-coverage 2 \
  --threads 16

Dry-run validation:

redicat call --input base_counts.h5ad --output out.h5ad --fa reference.fa --site-white-list sites.tsv.gz --dry-run

6) Parameter reference and tuning guidance

6.1 bulk parameters

Parameter Type Default Range / constraints Function and biological meaning Tuning guidance
reads path required indexed BAM/CRAM Input alignments for per-position pileup Use coordinate-sorted, indexed files
--output/-o path required writable path Output TSV(.gz) of pileup features Keep compressed for large cohorts
--threads/-t usize 10 >=1 Worker parallelism Use physical cores; avoid oversubscription
--chunksize/-c u32 engine constant (500000) >=1 Genomic tile size per task Larger = less scheduling overhead, higher memory burst
--min-baseq/-Q u8? 30 effective 0..255 Low-quality bases become N Raise in noisy libraries
--mapquality/-q u8 255 0..255 Read-level alignment confidence filter Relax for aligners not using 255 for unique mapping
--zero-base/-z bool false n/a Output coordinate convention Keep default 1-based for interoperability
--max-depth/-D u32 10000 >=1 Pileup cap; near-cap flagged Increase for ultra-deep loci; monitor runtime
--min-depth/-d u32 5 >=1 Minimum effective depth Increase for conservative site discovery
--max-n-fraction/-n u32 10 >=1 Ambiguous base tolerance (N <= max(depth/n,2)) Larger value = stricter ambiguity control
--all/-a bool false n/a Emit all sites vs editing-enriched subset Use true for QC/debug; false for candidate enrichment
--editing-threshold/-et u32 10000 >=1 Multi-base support criterion in candidate mode Larger to increase sensitivity
--allcontigs/-A bool false n/a Include non-canonical contigs Enable for decoys/spike-ins/custom assemblies

6.2 bam2mtx parameters

Parameter Type Default Range / constraints Function and biological meaning Tuning guidance
--bam/-b path required indexed BAM/CRAM Input alignment file Required
--tsv path optional required unless --two-pass Position manifest to quantify Prefer --two-pass for schema safety; in two-pass mode output is normalized to .gz
--barcodes path required whitelist file Restricts cells to known barcodes Use filtered cell whitelist from upstream pipeline
--two-pass bool false n/a Auto-generate site manifest via internal bulk run Recommended default
--output/-o path required .h5ad destination Output sparse AnnData Required
--threads/-t usize 20 >=1 Parallel processing threads Balance with storage throughput
--min-mapq/-q u8 255 0..255 Minimum read mapping quality Match aligner conventions
--min-baseq/-Q u8 30 0..255 Minimum base quality Raise for stringent mismatch calling
--stranded/-S bool false n/a Preserve strand layers (A0..C0, A1..C1) Strongly recommended for strand-specific libraries
--max-depth/-D u32 65536 >=1 Pileup depth cap in matrix extraction Set near expected extreme depth
--umi-tag string UB SAM tag name UMI field for molecule deduplication Adjust to chemistry (
RX, etc.)
--cb-tag string CB SAM tag name Cell barcode field Adjust if pipeline uses alternate tags
--reference/-r path optional required for CRAM FASTA for CRAM decoding Mandatory for CRAM workflows
--chunksize/-c u32 100000 >=1 Weight budget for normal loci chunks Increase for throughput, decrease for memory control
--chunk-size-max-depth u32 2 >=1 Small chunk cap for near-max-depth loci Keep low to avoid hotspot stalls
--matrix-density f64 0.005 >0 practical Sparse pre-allocation hint Raise for dense targeted panels
--allcontigs/-A bool false n/a Include non-canonical contigs Enable for non-standard references

6.3 call parameters

Parameter Type Default Range / constraints Function and biological meaning Tuning guidance
--input string path required .h5ad Base-count AnnData input Output of bam2mtx
--output string path required .h5ad Annotated editing output Existing output is overwritten
--fa string path required FASTA + .fai required Reference base retrieval per site Must match alignment reference build
--site-white-list string path required TSV/TSV.GZ, CHR/POS leading columns Known candidate editing sites Use curated catalogs + project-specific blacklist logic
--editingtype enum ag ag/ac/at/ca/cg/ct Defines expected ref→alt and strand-aware complements Use ag for A-to-I focused studies; ct for C-to-U hypotheses
--max-other-threshold f64 0.1 validated 0..1 Max tolerated non-ref/non-alt fraction Lower for specificity in noisy cohorts
--min-edited-threshold f64 0.001 validated 0..1 Minimum edited-base fraction Raise for stronger effect-size confidence
--min-ref-threshold f64 0.001 validated 0..1 Minimum reference-base fraction Raise to avoid low-reference unstable sites
--chunksize/-c usize 100000 >=1 Pipeline chunking granularity Decrease if memory constrained
--threads/-t Option internal fallback 4 >=1 recommended Rayon pool size for pipeline Use core count for full throughput
--min-coverage u16 2 >=1 Site-level minimum coverage Increase for stringent single-cell confidence
--verbose/-v bool false n/a Verbose logging flag Enable for debugging
--dry-run bool false n/a Validate inputs only Always run before large cohorts

6.4 Mismatch decision rule used in call

For each site with coverage $C$:

  • $other_max = \max(\lceil C \cdot max_other_threshold \rceil, 2)$
  • $edited_min = \max(\lceil C \cdot min_edited_threshold \rceil, 1)$
  • $ref_min = \max(\lceil C \cdot min_ref_threshold \rceil, 1)$

A site passes if:

  1. reference base is valid for selected editing type,
  2. reference count $\ge ref_min$,
  3. edited count $\ge edited_min$,
  4. sum of other bases $\le other_max$,
  5. site coverage $\ge min_coverage$.

7) Reproducible workflow (recommended)

  1. Generate candidate loci
    • redicat bulk with stringent quality filters.
  2. Build base-count matrix
    • redicat bam2mtx --two-pass (or provide validated TSV).
  3. Call editing signal
    • redicat call with matched reference build and curated whitelist.
  4. Post-analysis
    • Use obs['CEI'], var['filter_pass'], and <XY>_alt/<XY>_ref for downstream statistics.

For publication-grade reproducibility, record:

  • full command lines,
  • software version (redicat --version),
  • reference genome build and checksum,
  • whitelist provenance and version,
  • all threshold parameters.

8) Result interpretation

8.1 Cell-level metrics

  • ref, alt, others in obs: aggregated counts across retained sites.
  • CEI: cell editing index, interpreted as editing burden proxy:

$$ CEI = \frac{alt}{ref + alt} $$

Practical guidance:

  • Compare CEI across cell states/conditions with coverage-aware covariates.
  • Exclude cells with very low total informative counts (ref + alt) before differential analyses.

8.2 Site-level metrics

In var, fields such as AG_ref, AG_alt, AG_others (for editingtype=ag) summarize cohort-level support at each site.

Recommended biological interpretation:

  • prioritize sites with high *_alt, low *_others, and filter_pass=true;
  • remove known SNP loci (dbSNP/gnomAD) in downstream curation;
  • inspect strand consistency for expected editing class behavior.

9) Performance and complexity

Let:

  • $R$ = number of aligned reads inspected,
  • $P$ = number of queried positions,
  • $U$ = number of unique (cell, UMI, site) molecules after filtering,
  • $nnz$ = non-zeros in sparse matrices,
  • $T$ = worker threads.

9.1 bulk

  • Pileup traversal is approximately $O(R_{region})$ over processed chunks.
  • Parallel scheduling reduces wall-time to roughly $\sim O(R/T)$ in balanced workloads.

9.2 bam2mtx

  • Site processing: $O(R_{sites})$ for pileup reads intersecting selected loci.
  • UMI consensus hash aggregation: expected $O(U)$.
  • Sparse assembly: near $O(nnz)$ plus sorting/merge costs for triplet consolidation.

9.3 call

  • Coverage/base summaries and sparse reductions: near $O(nnz)$.
  • Site annotation/filtering/reference lookup: $O(P)$ (with caching effects for FASTA access).
  • Overall dominated by sparse matrix operations and site count.

In practice, REDICAT is throughput-optimized for sparse single-cell matrices with multi-core scaling and bounded-memory spill strategies in matrix assembly.


10) Notes and current implementation caveats

  • Canonical contig filtering defaults to chr1..chr22, chrX, chrY, chrM; use --allcontigs for non-canonical references.
  • call writes priority layers (ref, alt, others, coverage) and annotations; preserve intermediate matrices if additional layers are required for custom methods.