Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
REDICAT
RNA Editing Cellular Assessment Toolkit
REDICAT is a high-performance Rust toolkit for RNA editing quantification in single-cell RNA-seq, with an emphasis on strand-aware mismatch logic, sparse-matrix efficiency, and reproducible end-to-end processing from indexed BAM/CRAM to analysis-ready AnnData (.h5ad).

Citation
If REDICAT supports your research, please cite:
Wei, T., Li, J., Lei, X., Lin, R., Wu, Q., Zhang, Z., Shuai, S., & Tian, R. (2025). Multimodal CRISPR screens uncover DDX39B as a global repressor of A-to-I RNA editing. Cell Reports, 44(7). https://doi.org/10.1016/j.celrep.2025.116009

Installation
# Clone and build
# Optional: install into Cargo's bin directory
1) Scientific overview
RNA editing introduces biologically meaningful RNA–DNA mismatches (for example A-to-I observed as A>G, and C-to-U observed as C>T). In single-cell data, these events are sparse and confounded by sequencing noise, alignment artifacts, and SNPs. REDICAT addresses this by:
- extracting per-cell, per-site strand-aware base counts from BAM/CRAM,
- applying biologically informed mismatch filters,
- generating reference/alternate/other sparse matrices,
- computing cell-level editing metrics (including CEI), and
- preserving metadata in standard AnnData format for downstream statistical analysis.
Innovation highlights
- Parallel genomic scheduling with bounded-memory streaming.
- UMI-level consensus deduplication to suppress conflicting molecules.
- Strand-aware editing inference (
AG,CT, and related mismatch classes). - Sparse linear-algebra-first implementation for scale.
2) Code architecture (from source-level review)
2.1 Top-level modules
src/main.rs: CLI entry point with subcommands:bulk,bam2mtx,call.src/commands/*: user-facing command orchestration.src/lib/core/*: shared infrastructure (errors, IO, sparse ops, filtering).src/lib/engine/*: parallel genomic scheduler (par_granges) and position models.src/lib/pipeline/bam2mtx/*: BAM/CRAM → sparse base-count AnnData conversion.src/lib/pipeline/call/*: editing annotation/filtering/ref-alt matrix/CEI pipeline.
2.2 Runtime interaction logic
-
bulk(src/commands/bulk)- Uses
ParGrangesscheduler to iterate genomic chunks in parallel. BaseProcessorperforms pileup counting and emits per-position records.- Output is gzipped TSV (or bgzip-compatible path handling).
- Uses
-
bam2mtx(src/commands/bam2mtx)- Reads position manifest TSV (or runs
bulkfirst pass with--two-pass). - Splits positions into depth-aware chunks (special handling for near-max-depth loci).
OptimizedChunkProcessorperforms per-chunk pileup traversal with:- mapping/base quality filtering,
- barcode whitelist matching,
- UMI consensus conflict suppression,
- optional stranded counting.
AnnDataConverterassembles sparse layers (A1/T1/G1/C1and optionalA0/T0/G0/C0) and writes.h5ad.
- Reads position manifest TSV (or runs
-
call(src/commands/call+src/lib/pipeline/call)- Loads input AnnData and known editing-site whitelist.
- Annotates candidate sites, computes coverage/base summaries, adds reference base from FASTA.
- Applies mismatch classification thresholds.
- Constructs strand-aware
ref/alt/othersmatrices. - Computes cell-level
CEI = alt / (ref + alt). - Computes site-level mismatch summaries and writes output
.h5ad.
2.3 Biological mismatch logic implemented
EditingType supports: ag, ac, at, ca, cg, ct.
For each site, REDICAT uses strand-aware rules:
- expected reference bases on either strand,
- expected alternate base conditional on strand-consistent reference,
- rejection of sites with excess non-reference/non-alt support.
This operationalizes your stated biology:
- A-to-I: forward A>G with reverse-complement signal,
- C-to-U: forward C>T with reverse-complement signal, while retaining generalized mismatch classes for exploratory analyses.
3) Installation
3.1 Requirements
- Linux/macOS (validated workflow shown for Linux).
- Rust toolchain (recommended stable):
rustup,cargo,rustc.
- Typical native build dependencies for
rust-htslib/compression stack:pkg-config,clang,zlib,bzip2,xz,curl, OpenSSL dev headers.
- For CRAM input: reference FASTA should be available and indexed.
3.2 Build
Binary path:
Quick check:
4) Input/output data contracts
4.1 BAM/CRAM prerequisites
- Coordinate-sorted and indexed alignment file.
- Cell barcode tag (default
CB) and UMI tag (defaultUB) present if single-cell mode is expected.
4.2 bulk output schema (TSV.gz)
Serialized PileupPosition includes columns:
CHR,POS,DEPTH,A,C,G,T,N,INS,DEL,REF_SKIP,FAIL,NEAR_MAX_DEPTH- Optional
REF_BASEwhen available.
POS is 1-based by default (0-based with --zero-base).
4.3 Position manifest required by bam2mtx
bam2mtx expects at least these columns in TSV:
CHR,POS,DEPTH,INS,DEL,REF_SKIP,FAIL,NEAR_MAX_DEPTH
The easiest way to produce a valid manifest is --two-pass, which internally runs bulk.
4.4 Site whitelist required by call
- Tab-separated file, plain or gzipped.
- First two columns must be chromosome and position (
CHR,POSconvention). - Key format used internally:
CHR:POSmatchingAnnData.var_names.
4.5 call output
.h5adwith layers (priority written):ref,alt,others,coverage(when present)
obsadditions:ref,alt,others(row sums over filtered editing sites)CEI
varadditions:is_editing_site,ref,Mismatch,filter_pass- site-level columns:
<XY>_ref,<XY>_alt,<XY>_othersfor selected editing typeX>Y
bam2mtx also writes a sidecar file for capped-depth loci:
<output_stem>_skiped_sites.txt(note current filename spelling in code).
5) Command-line usage
5.1 bulk (first-pass site discovery / pileup profiling)
Use --all to emit all covered positions (not only editing-enriched candidates).
5.2 bam2mtx (BAM to sparse single-cell base matrices)
Two-pass (recommended)
Single-pass with precomputed manifest
5.3 call (editing annotation and quantification)
Dry-run validation:
6) Parameter reference and tuning guidance
6.1 bulk parameters
| Parameter | Type | Default | Range / constraints | Function and biological meaning | Tuning guidance |
|---|---|---|---|---|---|
reads |
path | required | indexed BAM/CRAM | Input alignments for per-position pileup | Use coordinate-sorted, indexed files |
--output/-o |
path | required | writable path | Output TSV(.gz) of pileup features | Keep compressed for large cohorts |
--threads/-t |
usize | 10 |
>=1 |
Worker parallelism | Use physical cores; avoid oversubscription |
--chunksize/-c |
u32 | engine constant (100000) |
>=1 |
Genomic tile size per task | Larger = less scheduling overhead, higher memory burst |
--min-baseq/-Q |
u8? | 30 effective |
0..255 |
Low-quality bases become N |
Raise in noisy libraries |
--mapquality/-q |
u8 | 255 |
0..255 |
Read-level alignment confidence filter | Relax for aligners not using 255 for unique mapping |
--zero-base/-z |
bool | false |
n/a | Output coordinate convention | Keep default 1-based for interoperability |
--max-depth/-D |
u32 | 10000 |
>=1 |
Pileup cap; near-cap flagged | Increase for ultra-deep loci; monitor runtime |
--min-depth/-d |
u32 | 5 |
>=1 |
Minimum effective depth | Increase for conservative site discovery |
--max-n-fraction/-n |
u32 | 10 |
>=1 |
Ambiguous base tolerance (N <= max(depth/n,2)) |
Lower value = stricter ambiguity control |
--all/-a |
bool | false |
n/a | Emit all sites vs editing-enriched subset | Use true for QC/debug; false for candidate enrichment |
--editing-threshold/-et |
u32 | 10000 |
>=1 |
Multi-base support criterion in candidate mode | Lower to increase sensitivity |
--allcontigs/-A |
bool | false |
n/a | Include non-canonical contigs | Enable for decoys/spike-ins/custom assemblies |
6.2 bam2mtx parameters
| Parameter | Type | Default | Range / constraints | Function and biological meaning | Tuning guidance |
|---|---|---|---|---|---|
--bam/-b |
path | required | indexed BAM/CRAM | Input alignment file | Required |
--tsv |
path | optional | required unless --two-pass |
Position manifest to quantify | Prefer --two-pass for schema safety |
--barcodes |
path | required | whitelist file | Restricts cells to known barcodes | Use filtered cell whitelist from upstream pipeline |
--two-pass |
bool | false |
n/a | Auto-generate site manifest via internal bulk run |
Recommended default |
--output/-o |
path | required | .h5ad destination |
Output sparse AnnData | Required |
--threads/-t |
usize | 10 |
>=1 |
Parallel processing threads | Balance with storage throughput |
--min-mapq/-q |
u8 | 255 |
0..255 |
Minimum read mapping quality | Match aligner conventions |
--min-baseq/-Q |
u8 | 30 |
0..255 |
Minimum base quality | Raise for stringent mismatch calling |
--min-depth/-d |
u32 | 3 |
>=1 |
Minimum non-N depth |
Increase to reduce sparse noise |
--max-n-fraction/-n |
u32 | 10 |
>=1 |
Ambiguous-base tolerance denominator | Lower for stricter base quality context |
--editing-threshold/-et |
u32 | 10000 |
>=1 |
Candidate support threshold in first pass | Lower when expecting low editing burden |
--stranded/-S |
bool | false |
n/a | Preserve strand layers (A0..C0, A1..C1) |
Strongly recommended for strand-specific libraries |
--max-depth/-D |
u32 | 655360 |
>=1 |
Pileup depth cap in matrix extraction | Set near expected extreme depth |
--umi-tag |
string | UB |
SAM tag name | UMI field for molecule deduplication | Adjust to chemistry ( |
RX, etc.) |
|||||
--cb-tag |
string | CB |
SAM tag name | Cell barcode field | Adjust if pipeline uses alternate tags |
--reference/-r |
path | optional | required for CRAM | FASTA for CRAM decoding | Mandatory for CRAM workflows |
--chunksize/-c |
u32 | 100000 |
>=1 |
Weight budget for normal loci chunks | Increase for throughput, decrease for memory control |
--chunk-size-max-depth |
u32 | 2 |
>=1 |
Small chunk cap for near-max-depth loci | Keep low to avoid hotspot stalls |
--matrix-density |
f64 | 0.005 |
>0 practical |
Sparse pre-allocation hint | Raise for dense targeted panels |
--allcontigs/-A |
bool | false |
n/a | Include non-canonical contigs | Enable for non-standard references |
6.3 call parameters
| Parameter | Type | Default | Range / constraints | Function and biological meaning | Tuning guidance |
|---|---|---|---|---|---|
--input |
string path | required | .h5ad |
Base-count AnnData input | Output of bam2mtx |
--output |
string path | required | .h5ad |
Annotated editing output | Existing output is overwritten |
--fa |
string path | required | FASTA + .fai required |
Reference base retrieval per site | Must match alignment reference build |
--site-white-list |
string path | required | TSV/TSV.GZ, CHR/POS leading columns | Known candidate editing sites | Use curated catalogs + project-specific blacklist logic |
--editingtype |
enum | ag |
ag/ac/at/ca/cg/ct |
Defines expected ref→alt and strand-aware complements | Use ag for A-to-I focused studies; ct for C-to-U hypotheses |
--max-other-threshold |
f64 | 0.1 |
validated 0..1 |
Max tolerated non-ref/non-alt fraction | Lower for specificity in noisy cohorts |
--min-edited-threshold |
f64 | 0.001 |
validated 0..1 |
Minimum edited-base fraction | Raise for stronger effect-size confidence |
--min-ref-threshold |
f64 | 0.001 |
validated 0..1 |
Minimum reference-base fraction | Raise to avoid low-reference unstable sites |
--chunksize/-c |
usize | 100000 |
>=1 |
Pipeline chunking granularity | Decrease if memory constrained |
--threads/-t |
Option | internal fallback 2 |
>=1 recommended |
Rayon pool size for pipeline | Use core count for full throughput |
--min-coverage |
u16 | 2 |
>=1 |
Site-level minimum coverage | Increase for stringent single-cell confidence |
--verbose/-v |
bool | false |
n/a | Verbose logging flag | Enable for debugging |
--dry-run |
bool | false |
n/a | Validate inputs only | Always run before large cohorts |
6.4 Mismatch decision rule used in call
For each site with coverage $C$:
- $other_max = \max(\lceil C \cdot max_other_threshold \rceil, 2)$
- $edited_min = \max(\lceil C \cdot min_edited_threshold \rceil, 1)$
- $ref_min = \max(\lceil C \cdot min_ref_threshold \rceil, 1)$
A site passes if:
- reference base is valid for selected editing type,
- reference count $\ge ref_min$,
- edited count $\ge edited_min$,
- sum of other bases $\le other_max$,
- site coverage $\ge min_coverage$.
7) Reproducible workflow (recommended)
- Generate candidate loci
redicat bulkwith stringent quality filters.
- Build base-count matrix
redicat bam2mtx --two-pass(or provide validated TSV).
- Call editing signal
redicat callwith matched reference build and curated whitelist.
- Post-analysis
- Use
obs['CEI'],var['filter_pass'], and<XY>_alt/<XY>_reffor downstream statistics.
- Use
For publication-grade reproducibility, record:
- full command lines,
- software version (
redicat --version), - reference genome build and checksum,
- whitelist provenance and version,
- all threshold parameters.
8) Result interpretation
8.1 Cell-level metrics
ref,alt,othersinobs: aggregated counts across retained sites.CEI: cell editing index, interpreted as editing burden proxy:
$$ CEI = \frac{alt}{ref + alt} $$
Practical guidance:
- Compare CEI across cell states/conditions with coverage-aware covariates.
- Exclude cells with very low total informative counts (
ref + alt) before differential analyses.
8.2 Site-level metrics
In var, fields such as AG_ref, AG_alt, AG_others (for editingtype=ag) summarize cohort-level support at each site.
Recommended biological interpretation:
- prioritize sites with high
*_alt, low*_others, andfilter_pass=true; - remove known SNP loci (dbSNP/gnomAD) in downstream curation;
- inspect strand consistency for expected editing class behavior.
9) Performance and complexity
Let:
- $R$ = number of aligned reads inspected,
- $P$ = number of queried positions,
- $U$ = number of unique (cell, UMI, site) molecules after filtering,
- $nnz$ = non-zeros in sparse matrices,
- $T$ = worker threads.
9.1 bulk
- Pileup traversal is approximately $O(R_{region})$ over processed chunks.
- Parallel scheduling reduces wall-time to roughly $\sim O(R/T)$ in balanced workloads.
9.2 bam2mtx
- Site processing: $O(R_{sites})$ for pileup reads intersecting selected loci.
- UMI consensus hash aggregation: expected $O(U)$.
- Sparse assembly: near $O(nnz)$ plus sorting/merge costs for triplet consolidation.
9.3 call
- Coverage/base summaries and sparse reductions: near $O(nnz)$.
- Site annotation/filtering/reference lookup: $O(P)$ (with caching effects for FASTA access).
- Overall dominated by sparse matrix operations and site count.
In practice, REDICAT is throughput-optimized for sparse single-cell matrices with multi-core scaling and bounded-memory spill strategies in matrix assembly.
10) Notes and current implementation caveats
- Canonical contig filtering defaults to
chr1..chr22, chrX, chrY, chrM; use--allcontigsfor non-canonical references. bam2mtx --two-passinternally sets first-passbulkmax depth to8000.callwrites priority layers (ref,alt,others,coverage) and annotations; preserve intermediate matrices if additional layers are required for custom methods.