docs.rs failed to build redicat-0.3.8
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.

Visit the last successful build: redicat-0.3.5

REDICAT

RNA Editing Cellular Assessment Toolkit

REDICAT is a high-performance Rust toolkit for RNA editing quantification in single-cell RNA-seq, with an emphasis on strand-aware mismatch logic, sparse-matrix efficiency, and reproducible end-to-end processing from indexed BAM/CRAM to analysis-ready AnnData (.h5ad).

Citation

If REDICAT supports your research, please cite:

Wei, T., Li, J., Lei, X., Lin, R., Wu, Q., Zhang, Z., Shuai, S., & Tian, R. (2025). Multimodal CRISPR screens uncover DDX39B as a global repressor of A-to-I RNA editing. Cell Reports, 44(7). https://doi.org/10.1016/j.celrep.2025.116009

Installation

# Clone and build
git clone https://github.com/aStudyingTurtle/redicat.git
cd redicat
cargo build --release

# Optional: install into Cargo's bin directory
cargo install --path .

1) Scientific overview

RNA editing introduces biologically meaningful RNA–DNA mismatches (for example A-to-I observed as A>G, and C-to-U observed as C>T). In single-cell data, these events are sparse and confounded by sequencing noise, alignment artifacts, and SNPs. REDICAT addresses this by:

extracting per-cell, per-site strand-aware base counts from BAM/CRAM,
applying biologically informed mismatch filters,
generating reference/alternate/other sparse matrices,
computing cell-level editing metrics (including CEI), and
preserving metadata in standard AnnData format for downstream statistical analysis.

Innovation highlights

Parallel genomic scheduling with bounded-memory streaming.
UMI-level consensus deduplication to suppress conflicting molecules.
Strand-aware editing inference (AG, CT, and related mismatch classes).
Sparse linear-algebra-first implementation for scale.

2) Code architecture (from source-level review)

2.1 Top-level modules

src/main.rs: CLI entry point with subcommands: bulk, bam2mtx, call.
src/commands/*: user-facing command orchestration.
src/lib/core/*: shared infrastructure (errors, IO, sparse ops, filtering).
src/lib/engine/*: parallel genomic scheduler (par_granges) and position models.
src/lib/pipeline/bam2mtx/*: BAM/CRAM → sparse base-count AnnData conversion.
src/lib/pipeline/call/*: editing annotation/filtering/ref-alt matrix/CEI pipeline.

2.2 Runtime interaction logic

bulk (src/commands/bulk)
- Uses ParGranges scheduler to iterate genomic chunks in parallel.
- BaseProcessor performs pileup counting and emits per-position records.
- Output is gzipped TSV (or bgzip-compatible path handling).
bam2mtx (src/commands/bam2mtx)
- Reads position manifest TSV (or runs bulk first pass with --two-pass).
- Splits positions into depth-aware chunks (special handling for near-max-depth loci).
- OptimizedChunkProcessor performs per-chunk pileup traversal with:
  - mapping/base quality filtering,
  - barcode whitelist matching,
  - UMI consensus conflict suppression,
  - optional stranded counting.
- AnnDataConverter assembles sparse layers (A1/T1/G1/C1 and optional A0/T0/G0/C0) and writes .h5ad.
call (src/commands/call + src/lib/pipeline/call)
- Loads input AnnData and known editing-site whitelist.
- Annotates candidate sites, computes coverage/base summaries, adds reference base from FASTA.
- Applies mismatch classification thresholds.
- Constructs strand-aware ref/alt/others matrices.
- Computes cell-level CEI = alt / (ref + alt).
- Computes site-level mismatch summaries and writes output .h5ad.

2.3 Biological mismatch logic implemented

EditingType supports: ag, ac, at, ca, cg, ct.

For each site, REDICAT uses strand-aware rules:

expected reference bases on either strand,
expected alternate base conditional on strand-consistent reference,
rejection of sites with excess non-reference/non-alt support.

This operationalizes your stated biology:

A-to-I: forward A>G with reverse-complement signal,
C-to-U: forward C>T with reverse-complement signal, while retaining generalized mismatch classes for exploratory analyses.

3) Installation

3.1 Requirements

Linux/macOS (validated workflow shown for Linux).
Rust toolchain (recommended stable):
- rustup, cargo, rustc.
Typical native build dependencies for rust-htslib/compression stack:
- pkg-config, clang, zlib, bzip2, xz, curl, OpenSSL dev headers.
For CRAM input: reference FASTA should be available and indexed.

3.2 Build

git clone https://github.com/aStudyingTurtle/redicat.git
cd redicat
cargo build --release

Binary path:

./target/release/redicat

Quick check:

./target/release/redicat --help
./target/release/redicat bam2mtx --help
./target/release/redicat call --help
./target/release/redicat bulk --help

4) Input/output data contracts

4.1 BAM/CRAM prerequisites

Coordinate-sorted and indexed alignment file.
Cell barcode tag (default CB) and UMI tag (default UB) present if single-cell mode is expected.

4.2 `bulk` output schema (TSV.gz)

Serialized PileupPosition includes columns:

CHR, POS, DEPTH, A, C, G, T, N, INS, DEL, REF_SKIP, FAIL, NEAR_MAX_DEPTH
Optional REF_BASE when available.

POS is 1-based by default (0-based with --zero-base).

4.3 Position manifest required by `bam2mtx`

bam2mtx expects at least these columns in TSV:

CHR, POS, DEPTH, INS, DEL, REF_SKIP, FAIL, NEAR_MAX_DEPTH

The easiest way to produce a valid manifest is --two-pass, which internally runs bulk.

4.4 Site whitelist required by `call`

Tab-separated file, plain or gzipped.
First two columns must be chromosome and position (CHR, POS convention).
Key format used internally: CHR:POS matching AnnData.var_names.

4.5 `call` output

.h5ad with layers (priority written):
- ref, alt, others, coverage (when present)
obs additions:
- ref, alt, others (row sums over filtered editing sites)
- CEI
var additions:
- is_editing_site, ref, Mismatch, filter_pass
- site-level columns: <XY>_ref, <XY>_alt, <XY>_others for selected editing type X>Y

bam2mtx also writes a sidecar file for capped-depth loci:

<output_stem>_skiped_sites.txt (note current filename spelling in code).

5) Command-line usage

5.1 `bulk` (first-pass site discovery / pileup profiling)

redicat bulk input.bam \
  --output first_pass.tsv.gz \
  --threads 16 \
  --chunksize 100000 \
  --mapquality 255 \
  --min-depth 5

Use --all to emit all covered positions (not only editing-enriched candidates).

5.2 `bam2mtx` (BAM to sparse single-cell base matrices)

Two-pass (recommended)

redicat bam2mtx \
  --bam input.bam \
  --barcodes barcodes.tsv.gz \
  --output base_counts.h5ad \
  --two-pass \
  --threads 16 \
  --stranded

Single-pass with precomputed manifest

redicat bam2mtx \
  --bam input.bam \
  --tsv first_pass.tsv.gz \
  --barcodes barcodes.tsv.gz \
  --output base_counts.h5ad \
  --threads 16

5.3 `call` (editing annotation and quantification)

redicat call \
  --input base_counts.h5ad \
  --output editing_calls.h5ad \
  --fa reference.fa \
  --site-white-list rediportal_sites.tsv.gz \
  --editingtype ag \
  --max-other-threshold 0.1 \
  --min-edited-threshold 0.001 \
  --min-ref-threshold 0.001 \
  --min-coverage 2 \
  --threads 16

Dry-run validation:

redicat call --input base_counts.h5ad --output out.h5ad --fa reference.fa --site-white-list sites.tsv.gz --dry-run

6) Parameter reference and tuning guidance

6.1 `bulk` parameters

Parameter	Type	Default	Range / constraints	Function and biological meaning	Tuning guidance
`reads`	path	required	indexed BAM/CRAM	Input alignments for per-position pileup	Use coordinate-sorted, indexed files
`--output/-o`	path	required	writable path	Output TSV(.gz) of pileup features	Keep compressed for large cohorts
`--threads/-t`	usize	`10`	`>=1`	Worker parallelism	Use physical cores; avoid oversubscription
`--chunksize/-c`	u32	engine constant (`100000`)	`>=1`	Genomic tile size per task	Larger = less scheduling overhead, higher memory burst
`--min-baseq/-Q`	u8?	`30` effective	`0..255`	Low-quality bases become `N`	Raise in noisy libraries
`--mapquality/-q`	u8	`255`	`0..255`	Read-level alignment confidence filter	Relax for aligners not using 255 for unique mapping
`--zero-base/-z`	bool	`false`	n/a	Output coordinate convention	Keep default 1-based for interoperability
`--max-depth/-D`	u32	`10000`	`>=1`	Pileup cap; near-cap flagged	Increase for ultra-deep loci; monitor runtime
`--min-depth/-d`	u32	`5`	`>=1`	Minimum effective depth	Increase for conservative site discovery
`--max-n-fraction/-n`	u32	`10`	`>=1`	Ambiguous base tolerance (`N <= max(depth/n,2)`)	Lower value = stricter ambiguity control
`--all/-a`	bool	`false`	n/a	Emit all sites vs editing-enriched subset	Use `true` for QC/debug; `false` for candidate enrichment
`--editing-threshold/-et`	u32	`10000`	`>=1`	Multi-base support criterion in candidate mode	Lower to increase sensitivity
`--allcontigs/-A`	bool	`false`	n/a	Include non-canonical contigs	Enable for decoys/spike-ins/custom assemblies

6.2 `bam2mtx` parameters

Parameter	Type	Default	Range / constraints	Function and biological meaning	Tuning guidance
`--bam/-b`	path	required	indexed BAM/CRAM	Input alignment file	Required
`--tsv`	path	optional	required unless `--two-pass`	Position manifest to quantify	Prefer `--two-pass` for schema safety
`--barcodes`	path	required	whitelist file	Restricts cells to known barcodes	Use filtered cell whitelist from upstream pipeline
`--two-pass`	bool	`false`	n/a	Auto-generate site manifest via internal `bulk` run	Recommended default
`--output/-o`	path	required	`.h5ad` destination	Output sparse AnnData	Required
`--threads/-t`	usize	`10`	`>=1`	Parallel processing threads	Balance with storage throughput
`--min-mapq/-q`	u8	`255`	`0..255`	Minimum read mapping quality	Match aligner conventions
`--min-baseq/-Q`	u8	`30`	`0..255`	Minimum base quality	Raise for stringent mismatch calling
`--min-depth/-d`	u32	`3`	`>=1`	Minimum non-`N` depth	Increase to reduce sparse noise
`--max-n-fraction/-n`	u32	`10`	`>=1`	Ambiguous-base tolerance denominator	Lower for stricter base quality context
`--editing-threshold/-et`	u32	`10000`	`>=1`	Candidate support threshold in first pass	Lower when expecting low editing burden
`--stranded/-S`	bool	`false`	n/a	Preserve strand layers (`A0..C0`, `A1..C1`)	Strongly recommended for strand-specific libraries
`--max-depth/-D`	u32	`655360`	`>=1`	Pileup depth cap in matrix extraction	Set near expected extreme depth
`--umi-tag`	string	`UB`	SAM tag name	UMI field for molecule deduplication	Adjust to chemistry (
`RX`, etc.)
`--cb-tag`	string	`CB`	SAM tag name	Cell barcode field	Adjust if pipeline uses alternate tags
`--reference/-r`	path	optional	required for CRAM	FASTA for CRAM decoding	Mandatory for CRAM workflows
`--chunksize/-c`	u32	`100000`	`>=1`	Weight budget for normal loci chunks	Increase for throughput, decrease for memory control
`--chunk-size-max-depth`	u32	`2`	`>=1`	Small chunk cap for near-max-depth loci	Keep low to avoid hotspot stalls
`--matrix-density`	f64	`0.005`	`>0` practical	Sparse pre-allocation hint	Raise for dense targeted panels
`--allcontigs/-A`	bool	`false`	n/a	Include non-canonical contigs	Enable for non-standard references

6.3 `call` parameters

Parameter	Type	Default	Range / constraints	Function and biological meaning	Tuning guidance
`--input`	string path	required	`.h5ad`	Base-count AnnData input	Output of `bam2mtx`
`--output`	string path	required	`.h5ad`	Annotated editing output	Existing output is overwritten
`--fa`	string path	required	FASTA + `.fai` required	Reference base retrieval per site	Must match alignment reference build
`--site-white-list`	string path	required	TSV/TSV.GZ, CHR/POS leading columns	Known candidate editing sites	Use curated catalogs + project-specific blacklist logic
`--editingtype`	enum	`ag`	`ag/ac/at/ca/cg/ct`	Defines expected ref→alt and strand-aware complements	Use `ag` for A-to-I focused studies; `ct` for C-to-U hypotheses
`--max-other-threshold`	f64	`0.1`	validated `0..1`	Max tolerated non-ref/non-alt fraction	Lower for specificity in noisy cohorts
`--min-edited-threshold`	f64	`0.001`	validated `0..1`	Minimum edited-base fraction	Raise for stronger effect-size confidence
`--min-ref-threshold`	f64	`0.001`	validated `0..1`	Minimum reference-base fraction	Raise to avoid low-reference unstable sites
`--chunksize/-c`	usize	`100000`	`>=1`	Pipeline chunking granularity	Decrease if memory constrained
`--threads/-t`	Option	internal fallback `2`	`>=1` recommended	Rayon pool size for pipeline	Use core count for full throughput
`--min-coverage`	u16	`2`	`>=1`	Site-level minimum coverage	Increase for stringent single-cell confidence
`--verbose/-v`	bool	`false`	n/a	Verbose logging flag	Enable for debugging
`--dry-run`	bool	`false`	n/a	Validate inputs only	Always run before large cohorts

6.4 Mismatch decision rule used in `call`

For each site with coverage $C$:

$other_max = \max(\lceil C \cdot max_other_threshold \rceil, 2)$
$edited_min = \max(\lceil C \cdot min_edited_threshold \rceil, 1)$
$ref_min = \max(\lceil C \cdot min_ref_threshold \rceil, 1)$

A site passes if:

reference base is valid for selected editing type,
reference count $\ge ref_min$,
edited count $\ge edited_min$,
sum of other bases $\le other_max$,
site coverage $\ge min_coverage$.

7) Reproducible workflow (recommended)

Generate candidate loci
- redicat bulk with stringent quality filters.
Build base-count matrix
- redicat bam2mtx --two-pass (or provide validated TSV).
Call editing signal
- redicat call with matched reference build and curated whitelist.
Post-analysis
- Use obs['CEI'], var['filter_pass'], and <XY>_alt/<XY>_ref for downstream statistics.

For publication-grade reproducibility, record:

full command lines,
software version (redicat --version),
reference genome build and checksum,
whitelist provenance and version,
all threshold parameters.

8) Result interpretation

8.1 Cell-level metrics

ref, alt, others in obs: aggregated counts across retained sites.
CEI: cell editing index, interpreted as editing burden proxy:

$$ CEI = \frac{alt}{ref + alt} $$

Practical guidance:

Compare CEI across cell states/conditions with coverage-aware covariates.
Exclude cells with very low total informative counts (ref + alt) before differential analyses.

8.2 Site-level metrics

In var, fields such as AG_ref, AG_alt, AG_others (for editingtype=ag) summarize cohort-level support at each site.

Recommended biological interpretation:

prioritize sites with high *_alt, low *_others, and filter_pass=true;
remove known SNP loci (dbSNP/gnomAD) in downstream curation;
inspect strand consistency for expected editing class behavior.

9) Performance and complexity

Let:

$R$ = number of aligned reads inspected,
$P$ = number of queried positions,
$U$ = number of unique (cell, UMI, site) molecules after filtering,
$nnz$ = non-zeros in sparse matrices,
$T$ = worker threads.

9.1 `bulk`

Pileup traversal is approximately $O(R_{region})$ over processed chunks.
Parallel scheduling reduces wall-time to roughly $\sim O(R/T)$ in balanced workloads.

9.2 `bam2mtx`

Site processing: $O(R_{sites})$ for pileup reads intersecting selected loci.
UMI consensus hash aggregation: expected $O(U)$.
Sparse assembly: near $O(nnz)$ plus sorting/merge costs for triplet consolidation.

9.3 `call`

Coverage/base summaries and sparse reductions: near $O(nnz)$.
Site annotation/filtering/reference lookup: $O(P)$ (with caching effects for FASTA access).
Overall dominated by sparse matrix operations and site count.

In practice, REDICAT is throughput-optimized for sparse single-cell matrices with multi-core scaling and bounded-memory spill strategies in matrix assembly.

10) Notes and current implementation caveats

Canonical contig filtering defaults to chr1..chr22, chrX, chrY, chrM; use --allcontigs for non-canonical references.
bam2mtx --two-pass internally sets first-pass bulk max depth to 8000.
call writes priority layers (ref, alt, others, coverage) and annotations; preserve intermediate matrices if additional layers are required for custom methods.

redicat 0.3.8