8/25/19
- Fixed stats and added stats to groups. Still a little different than how umi_tools
does things, but I think it's reasonable
- ToDo's benchmark on some large datasets
- Clean up the code a bit
- Set it free
8/23/19
- First pass at adding stats, very much broken on the reads_unmapped
8/22/19
- Added first pass at support for paired end reads.
- Next, add tests for paired end reads? maybe not worth it
- Investigate diffs with umi_tools, expecially around default settings for skipping / tlen
- Collect some stats similar to umi_tools
8/21/19
- Added test to make sure that the determine_umi step couldn't assign a umi to multiple
masters, which results in the read showing up twice in the group_only option.
- Restored rayon for run_dedup and run_group
- Check that the groups are correct still in their numbering
8/20/19
- Working on finding why rumi gets more reads on the example.bam file than umi_tools does.
It seems like umi_tools is in some cases pulling reads that are dist 3 away into groups
where they may not belong. The following is the ouput of compare_reads.pl
```bash
$ perl ./scripts/compare_reads.pl (samtools view /mnt/d/dev/UMI-tools/tests_out/example_umitools.bam | psub) (samtools view /mnt/d/dev/UMI-tools/tests_out/example_rumi_deduped.bam | psub) (samtools view /mnt/d/dev/UMI-tools/tests_out/example_group.bam|psub) (samtools view /mnt/d/dev/UMI-tools/tests_out/example_rumi.bam | psub)
FOUND: SRR2057595.3345647_TTTGGTTTA 16 chr8 82003435 255 21M * 0 0 * * XA:i:2 MD:Z:1G2T16 NM:i:2 BX:Z:TTTGGTTTA UG:i:15127
EXPECTED: SRR2057595.3345647_TTTGGTTTA 16 chr8 82003435 255 21M * 0 0 * * XA:i:2 MD:Z:1G2T16 NM:i:2 UG:i:15059 BX:Z:TGAGGTTGA
FOUND: SRR2057595.7255940_TGTGGTTAC 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 BX:Z:TGTGGTTAC UG:i:15122
EXPECTED: SRR2057595.7255940_TGTGGTTAC 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 UG:i:15059 BX:Z:TGAGGTTGA
FOUND: SRR2057595.7609280_GCCGGTTTT 16 chr6 128748879 255 48M * 0 0 * * XA:i:1 MD:Z:0A47 NM:i:1 BX:Z:GCCGGTTTT UG:i:13709
EXPECTED: SRR2057595.7609280_GCCGGTTTT 16 chr6 128748879 255 48M * 0 0 * * XA:i:1 MD:Z:0A47 NM:i:1 UG:i:13685 BX:Z:GTAGGTTTC
FOUND: SRR2057595.5016607_GCAGGTTTA 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 BX:Z:GCAGGTTTA UG:i:15129
EXPECTED: SRR2057595.5016607_GCAGGTTTA 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 UG:i:15059 BX:Z:TGAGGTTGA
FOUND: SRR2057595.1514218_AAGGGTTAT 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 BX:Z:AAGGGTTAT UG:i:15125
EXPECTED: SRR2057595.1514218_AAGGGTTAT 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 UG:i:15060 BX:Z:ATGGGTTGA
FOUND: SRR2057595.897659_ATAGGTTTC 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 BX:Z:ATAGGTTTC UG:i:15128
EXPECTED: SRR2057595.897659_ATAGGTTTC 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 UG:i:15060 BX:Z:ATGGGTTGA
FOUND: SRR2057595.3245577_GGAGGTTCT 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 BX:Z:GGAGGTTCT UG:i:15130
EXPECTED: SRR2057595.3245577_GGAGGTTCT 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 UG:i:15059 BX:Z:TGAGGTTGA
FOUND: SRR2057595.13317470_TTGGGTTAA 16 chr3 96263947 255 38M * 0 0 * * XA:i:1 MD:Z:11G26 NM:i:1 BX:Z:TTGGGTTAA UG:i:10683
EXPECTED: SRR2057595.13317470_TTGGGTTAA 16 chr3 96263947 255 38M * 0 0 * * XA:i:1 MD:Z:11G26 NM:i:1 UG:i:10632 BX:Z:TCAGGTTCA
FOUND: SRR2057595.8903949_GCTGGTTCT 16 chr11 101507078 255 67M * 0 0 * * XA:i:2 MD:Z:56T1C8 NM:i:2 BX:Z:GCTGGTTAT UG:i:3777
EXPECTED: SRR2057595.8903949_GCTGGTTCT 16 chr11 101507078 255 67M * 0 0 * * XA:i:2 MD:Z:56T1C8 NM:i:2 UG:i:3735 BX:Z:ATGGGTTAT
FOUND: SRR2057595.6107476_TCGGGTTAC 0 chr11 83085100 255 67M * 0 0 * * XA:i:1 MD:Z:58T8 NM:i:1 BX:Z:TCGGGTTAC UG:i:2555
EXPECTED: SRR2057595.6107476_TCGGGTTAC 0 chr11 83085100 255 67M * 0 0 * * XA:i:1 MD:Z:58T8 NM:i:1 UG:i:2527 BX:Z:TTCGGTTGC
FOUND: SRR2057595.5405752_AACGGTTGG 0 chr1 72283620 255 67M * 0 0 * * XA:i:1 MD:Z:56G10 NM:i:1 BX:Z:AACGGTTGG UG:i:412
EXPECTED: SRR2057595.5405752_AACGGTTGG 0 chr1 72283620 255 67M * 0 0 * * XA:i:1 MD:Z:56G10 NM:i:1 UG:i:376 BX:Z:ATTGGTTCG
FOUND: SRR2057595.2806735_AAAGGTTCC 0 chr11 87275932 255 67M * 0 0 * * XA:i:2 MD:Z:28C5T32 NM:i:2 BX:Z:AAAGGTTCC UG:i:3102
EXPECTED: SRR2057595.2806735_AAAGGTTCC 0 chr11 87275932 255 67M * 0 0 * * XA:i:2 MD:Z:28C5T32 NM:i:2 UG:i:3097 BX:Z:GTAGGTTAC
FOUND: SRR2057595.8391205_GGGGGTTGT 16 chr3 96263946 255 39M * 0 0 * * XA:i:2 MD:Z:0T8C29 NM:i:2 BX:Z:GGGGGTTGT UG:i:10686
EXPECTED: SRR2057595.8391205_GGGGGTTGT 16 chr3 96263946 255 39M * 0 0 * * XA:i:2 MD:Z:0T8C29 NM:i:2 UG:i:10625 BX:Z:CTGGGTTGA
FOUND: SRR2057595.482451_TCGGGTTGG 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 BX:Z:TCGGGTTGG UG:i:15110
EXPECTED: SRR2057595.482451_TCGGGTTGG 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 UG:i:15059 BX:Z:TGAGGTTGA
FOUND: SRR2057595.2938337_AATGGTTAC 16 chr9 65044379 255 27M * 0 0 * * XA:i:0 MD:Z:27 NM:i:0 BX:Z:AATGGTTAC UG:i:15687
EXPECTED: SRR2057595.2938337_AATGGTTAC 16 chr9 65044379 255 27M * 0 0 * * XA:i:0 MD:Z:27 NM:i:0 UG:i:15646 BX:Z:TCTGGTTTC
FOUND: SRR2057595.12752032_TTTGGTTGA 16 chr11 101507078 255 67M * 0 0 * * XA:i:2 MD:Z:48C9C8 NM:i:2 BX:Z:TTTGGTTGA UG:i:3776
EXPECTED: SRR2057595.12752032_TTTGGTTGA 16 chr11 101507078 255 67M * 0 0 * * XA:i:2 MD:Z:48C9C8 NM:i:2 UG:i:3749 BX:Z:ATTGGTTCG
FOUND: SRR2057595.502927_CAAGGTTAA 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 BX:Z:CAAGGTTAA UG:i:15120
EXPECTED: SRR2057595.502927_CAAGGTTAA 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 UG:i:15059 BX:Z:TGAGGTTGA
FOUND: SRR2057595.9402462_TATGGTTGG 16 chr3 96263947 255 38M * 0 0 * * XA:i:1 MD:Z:11G26 NM:i:1 BX:Z:TATGGTTGG UG:i:10684
EXPECTED: SRR2057595.9402462_TATGGTTGG 16 chr3 96263947 255 38M * 0 0 * * XA:i:1 MD:Z:11G26 NM:i:1 UG:i:10631 BX:Z:CATGGTTCT
FOUND: SRR2057595.4041993_CAAGGTTGA 0 chr3 96132476 255 38M * 0 0 * * XA:i:0 MD:Z:38 NM:i:0 BX:Z:CAAGGTTGA UG:i:10268
EXPECTED: SRR2057595.4041993_CAAGGTTGA 0 chr3 96132476 255 38M * 0 0 * * XA:i:0 MD:Z:38 NM:i:0 UG:i:10172 BX:Z:GGAGGTTAA
FOUND: SRR2057595.4828384_TCCGGTTCA 0 chr1 72290887 255 67M * 0 0 * * XA:i:1 MD:Z:37C29 NM:i:1 BX:Z:TCCGGTTCA UG:i:558
EXPECTED: SRR2057595.4828384_TCCGGTTCA 0 chr1 72290887 255 67M * 0 0 * * XA:i:1 MD:Z:37C29 NM:i:1 UG:i:517 BX:Z:CACGGTTTA
FOUND: SRR2057595.2554282_AGTGGTTCT 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 BX:Z:AGTGGTTCT UG:i:15134
EXPECTED: SRR2057595.2554282_AGTGGTTCT 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 UG:i:15059 BX:Z:TGAGGTTGA
FOUND: SRR2057595.4109672_TTGGGTTAC 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 BX:Z:TTGGGTTAC UG:i:15137
EXPECTED: SRR2057595.4109672_TTGGGTTAC 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 UG:i:15059 BX:Z:TGAGGTTGA
FOUND: SRR2057595.11515838_ATGGTTCTT 0 chr3 96132477 255 37M * 0 0 * * XA:i:0 MD:Z:37 NM:i:0 BX:Z:ATGGTTCTT UG:i:10287
EXPECTED: SRR2057595.11515838_ATGGTTCTT 0 chr3 96132477 255 37M * 0 0 * * XA:i:0 MD:Z:37 NM:i:0 UG:i:10267 BX:Z:ACGGTTACT
FOUND: SRR2057595.3345647_TTTGGTTTA 16 chr8 82003435 255 21M * 0 0 * * XA:i:2 MD:Z:1G2T16 NM:i:2 BX:Z:TTTGGTTTA UG:i:15127
EXPECTED: SRR2057595.3345647_TTTGGTTTA 16 chr8 82003435 255 21M * 0 0 * * XA:i:2 MD:Z:1G2T16 NM:i:2 UG:i:15059 BX:Z:TGAGGTTGA
FOUND: SRR2057595.7255940_TGTGGTTAC 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 BX:Z:TGTGGTTAC UG:i:15122
EXPECTED: SRR2057595.7255940_TGTGGTTAC 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 UG:i:15059 BX:Z:TGAGGTTGA
FOUND: SRR2057595.7609280_GCCGGTTTT 16 chr6 128748879 255 48M * 0 0 * * XA:i:1 MD:Z:0A47 NM:i:1 BX:Z:GCCGGTTTT UG:i:13709
EXPECTED: SRR2057595.7609280_GCCGGTTTT 16 chr6 128748879 255 48M * 0 0 * * XA:i:1 MD:Z:0A47 NM:i:1 UG:i:13685 BX:Z:GTAGGTTTC
FOUND: SRR2057595.5016607_GCAGGTTTA 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 BX:Z:GCAGGTTTA UG:i:15129
EXPECTED: SRR2057595.5016607_GCAGGTTTA 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 UG:i:15059 BX:Z:TGAGGTTGA
FOUND: SRR2057595.1514218_AAGGGTTAT 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 BX:Z:AAGGGTTAT UG:i:15125
EXPECTED: SRR2057595.1514218_AAGGGTTAT 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 UG:i:15060 BX:Z:ATGGGTTGA
FOUND: SRR2057595.897659_ATAGGTTTC 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 BX:Z:ATAGGTTTC UG:i:15128
EXPECTED: SRR2057595.897659_ATAGGTTTC 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 UG:i:15060 BX:Z:ATGGGTTGA
FOUND: SRR2057595.3245577_GGAGGTTCT 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 BX:Z:GGAGGTTCT UG:i:15130
EXPECTED: SRR2057595.3245577_GGAGGTTCT 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 UG:i:15059 BX:Z:TGAGGTTGA
FOUND: SRR2057595.13317470_TTGGGTTAA 16 chr3 96263947 255 38M * 0 0 * * XA:i:1 MD:Z:11G26 NM:i:1 BX:Z:TTGGGTTAA UG:i:10683
EXPECTED: SRR2057595.13317470_TTGGGTTAA 16 chr3 96263947 255 38M * 0 0 * * XA:i:1 MD:Z:11G26 NM:i:1 UG:i:10632 BX:Z:TCAGGTTCA
FOUND: SRR2057595.8903949_GCTGGTTCT 16 chr11 101507078 255 67M * 0 0 * * XA:i:2 MD:Z:56T1C8 NM:i:2 BX:Z:GCTGGTTAT UG:i:3777
EXPECTED: SRR2057595.8903949_GCTGGTTCT 16 chr11 101507078 255 67M * 0 0 * * XA:i:2 MD:Z:56T1C8 NM:i:2 UG:i:3735 BX:Z:ATGGGTTAT
FOUND: SRR2057595.6107476_TCGGGTTAC 0 chr11 83085100 255 67M * 0 0 * * XA:i:1 MD:Z:58T8 NM:i:1 BX:Z:TCGGGTTAC UG:i:2555
EXPECTED: SRR2057595.6107476_TCGGGTTAC 0 chr11 83085100 255 67M * 0 0 * * XA:i:1 MD:Z:58T8 NM:i:1 UG:i:2527 BX:Z:TTCGGTTGC
FOUND: SRR2057595.5405752_AACGGTTGG 0 chr1 72283620 255 67M * 0 0 * * XA:i:1 MD:Z:56G10 NM:i:1 BX:Z:AACGGTTGG UG:i:412
EXPECTED: SRR2057595.5405752_AACGGTTGG 0 chr1 72283620 255 67M * 0 0 * * XA:i:1 MD:Z:56G10 NM:i:1 UG:i:376 BX:Z:ATTGGTTCG
FOUND: SRR2057595.2806735_AAAGGTTCC 0 chr11 87275932 255 67M * 0 0 * * XA:i:2 MD:Z:28C5T32 NM:i:2 BX:Z:AAAGGTTCC UG:i:3102
EXPECTED: SRR2057595.2806735_AAAGGTTCC 0 chr11 87275932 255 67M * 0 0 * * XA:i:2 MD:Z:28C5T32 NM:i:2 UG:i:3097 BX:Z:GTAGGTTAC
FOUND: SRR2057595.8391205_GGGGGTTGT 16 chr3 96263946 255 39M * 0 0 * * XA:i:2 MD:Z:0T8C29 NM:i:2 BX:Z:GGGGGTTGT UG:i:10686
EXPECTED: SRR2057595.8391205_GGGGGTTGT 16 chr3 96263946 255 39M * 0 0 * * XA:i:2 MD:Z:0T8C29 NM:i:2 UG:i:10625 BX:Z:CTGGGTTGA
FOUND: SRR2057595.482451_TCGGGTTGG 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 BX:Z:TCGGGTTGG UG:i:15110
EXPECTED: SRR2057595.482451_TCGGGTTGG 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 UG:i:15059 BX:Z:TGAGGTTGA
FOUND: SRR2057595.2938337_AATGGTTAC 16 chr9 65044379 255 27M * 0 0 * * XA:i:0 MD:Z:27 NM:i:0 BX:Z:AATGGTTAC UG:i:15687
EXPECTED: SRR2057595.2938337_AATGGTTAC 16 chr9 65044379 255 27M * 0 0 * * XA:i:0 MD:Z:27 NM:i:0 UG:i:15646 BX:Z:TCTGGTTTC
FOUND: SRR2057595.12752032_TTTGGTTGA 16 chr11 101507078 255 67M * 0 0 * * XA:i:2 MD:Z:48C9C8 NM:i:2 BX:Z:TTTGGTTGA UG:i:3776
EXPECTED: SRR2057595.12752032_TTTGGTTGA 16 chr11 101507078 255 67M * 0 0 * * XA:i:2 MD:Z:48C9C8 NM:i:2 UG:i:3749 BX:Z:ATTGGTTCG
FOUND: SRR2057595.502927_CAAGGTTAA 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 BX:Z:CAAGGTTAA UG:i:15120
EXPECTED: SRR2057595.502927_CAAGGTTAA 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 UG:i:15059 BX:Z:TGAGGTTGA
FOUND: SRR2057595.9402462_TATGGTTGG 16 chr3 96263947 255 38M * 0 0 * * XA:i:1 MD:Z:11G26 NM:i:1 BX:Z:TATGGTTGG UG:i:10684
EXPECTED: SRR2057595.9402462_TATGGTTGG 16 chr3 96263947 255 38M * 0 0 * * XA:i:1 MD:Z:11G26 NM:i:1 UG:i:10631 BX:Z:CATGGTTCT
FOUND: SRR2057595.4041993_CAAGGTTGA 0 chr3 96132476 255 38M * 0 0 * * XA:i:0 MD:Z:38 NM:i:0 BX:Z:CAAGGTTGA UG:i:10268
EXPECTED: SRR2057595.4041993_CAAGGTTGA 0 chr3 96132476 255 38M * 0 0 * * XA:i:0 MD:Z:38 NM:i:0 UG:i:10172 BX:Z:GGAGGTTAA
FOUND: SRR2057595.4828384_TCCGGTTCA 0 chr1 72290887 255 67M * 0 0 * * XA:i:1 MD:Z:37C29 NM:i:1 BX:Z:TCCGGTTCA UG:i:558
EXPECTED: SRR2057595.4828384_TCCGGTTCA 0 chr1 72290887 255 67M * 0 0 * * XA:i:1 MD:Z:37C29 NM:i:1 UG:i:517 BX:Z:CACGGTTTA
FOUND: SRR2057595.2554282_AGTGGTTCT 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 BX:Z:AGTGGTTCT UG:i:15134
EXPECTED: SRR2057595.2554282_AGTGGTTCT 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 UG:i:15059 BX:Z:TGAGGTTGA
FOUND: SRR2057595.4109672_TTGGGTTAC 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 BX:Z:TTGGGTTAC UG:i:15137
EXPECTED: SRR2057595.4109672_TTGGGTTAC 16 chr8 82003436 255 20M * 0 0 * * XA:i:2 MD:Z:0G2T16 NM:i:2 UG:i:15059 BX:Z:TGAGGTTGA
FOUND: SRR2057595.11515838_ATGGTTCTT 0 chr3 96132477 255 37M * 0 0 * * XA:i:0 MD:Z:37 NM:i:0 BX:Z:ATGGTTCTT UG:i:10287
EXPECTED: SRR2057595.11515838_ATGGTTCTT 0 chr3 96132477 255 37M * 0 0 * * XA:i:0 MD:Z:37 NM:i:0 UG:i:10267 BX:Z:ACGGTTACT
```
This accounts for 23 of the 26 extra reads from rumi. I don't know why umi_tools does this. I'm betting the other missing reads are of a similar vein
- SRR2057595.3354975_CGGGTTGGT: rumi correctly uses the umi starting
with C since there are two reads with that umi. umi_tools uses the umi
with only a feq of 1.
- SRR2057595.4915638_TTGGTTAAA: rumi correctly chooses the read with the
decided upon umi as the best read.
- SRR2057595.5405752_AACGGTTGG: rumi correctly leaves as it's own group.
umi_tools corrects it dist 3 away to ATTGGTTCG. I expect this to be
the end source of the 30 extra reads in rumi's output. What causes
this in umi_tools?
- Next up is to add a test for making sure that reads aren't doubled up on in the
determine_umi step. Then make that step better and faster.
- At this point I feel pretty confident in the calls that rumi makes.
8/18/19
- Updated tests to work with new BTreeMap structure and read_groups types
- Took second pass at a group_only option. Currently all reads are being collapsed
in the group_reads function. Need to figure out how to keep them around, ideally
without compromising the performance of the dedup procedure itself. group_only
can be slow. dedup must be fast
Later the same da
- I have an extra 3000 reads for no reason I can figure you. Also chrY is being ordered weird
...
- The determin_umi step was double adding some reads. That has been fixed, but it's an inefficient fix.
- group_only now ouptus the right number of reads. The diffs between the two should
help figure out the remaining diffs in the dedup only.
8/17/19
- Update tests to work with new read_groups types and BTreeMap
- Then I need to add the UG and BX tags to them for the group_only option
- I think that my missing reads stem from positional grouping, not from
the directional adjacnecy. Find a way to test this?