Expand description
Subsample-based splice-junction saturation analysis.
Algorithm (reimplemented from RSeQC junction_saturation.py):
- Parse BED12 gene models; extract all annotated splice sites (junction donor and acceptor positions) and annotated junctions (donor, acceptor pairs) into hash sets.
- Load all mapped, primary reads from the BAM.
- Shuffle all read indices once with a seeded ChaCha12 RNG.
- For each fraction F in [lower..upper] step S:
a. Take the first ⌊F% × total_reads⌋ indices from the shuffled order.
b. For each selected read, extract introns from the CIGAR
Noperations. c. Classify each observed junction as:- known: both donor and acceptor in the annotated junction set
- partial novel: one of donor/acceptor is annotated
- complete novel: neither is annotated
- Write one TSV file
<prefix>.junction_saturation.txtwith columns:pct\tknown\tpartial_novel\tcomplete_novel
Using a single shuffle (prefix-based sampling) guarantees monotonicity:
the read set at fraction F1 is always a subset of the set at F2 > F1.
RSeQC’s subsampling is non-deterministic (Python random). We use a
seedable ChaCha12 RNG so results are reproducible when --seed is given.
Structs§
- Fraction
Result - Per-fraction junction counts.
- Junction
Saturation Opts - Options for junction saturation analysis.
Functions§
- run
- Run junction saturation analysis.