1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
//! Split k-mer analysis (version 2) uses exact matching of split k-mer sequences to align closely related
//! sequences, typically small haploid genomes such as bacteria and viruses.
//!
//! SKA can only align SNPs further than the k-mer length apart,
//! and does not use a gap penalty approach or give alignment scores.
//! But the advantages are speed and flexibility, particularly the ability to
//! run on a reference-free manner (i.e. including accessory genome variation)
//! on both assemblies and reads.
//!
//! ## Details
//!
//! A split k-mer is a sequence of bases with the middle position removed. For
//! example, the split 9-mer pattern is `XXXX-XXXX`, and the split 9-mers of the
//! sequence `ACGAGAGTCTT` are:
//!
//! | Split k-mer | Middle base |
//! |-------------|-------------|
//! | `ACGAAGTC` | `G` |
//! | `CGAGGTCT` | `A` |
//! | `GAGATCTT` | `G` |
//!
//! Which is broadly how SKA represents sequences in `.skf` files, as a dictionary with split
//! k-mers as keys and the middle base as values. These dictionaries are merged
//! by exactly matching the split k-mers, such that the middle positions are aligned (but
//! unordered). Split k-mers can also be matched to those from a reference sequence
//! to give an ordered pseudoalignment.
//!
//! Various optimisations are used to make this as fast as possible. For a more thorough comparison with version 1.0 of SKA, see the
//! [github description](https://github.com/bacpop/ska.rust/blob/master/README.md).
//!
//! *NB As split k-mers are even lengths, it is possible that they are their
//! own reverse complement. The original version of ska would randomly pick a strand,
//! possibly introducing a SNP across samples. This version uses an ambiguous middle
//! base (W for A/T; S for C/G) to represent this case.*
//!
//! Tutorials:
//! - [From genomes to trees](https://www.bacpop.org/guides/building_trees_with_ska/).
//! - [Filtering options](https://www.bacpop.org/guides/snp_alignment_with_ska/).
//!
//! Papers:
//! - [Seamless, rapid, and accurate analyses of outbreak genomic data using split k-mer analysis ](https://genome.cshlp.org/content/34/10/1661.abstract).
//! - [skalo: using SKA split k-mers with coloured de Brujin graphs to genotype indels](https://academic.oup.com/mbe/article/42/4/msaf077/8103706).
//!
//! Command line usage follows. For API documentation and usage, see the [end of this section](#api-usage).
//!
//! # Usage
//!
//! `.skf` files represent merged split k-mers from multiple sequence files. They
//! are created with `ska build`. You can subsequently use `ska align` to write
//! out an unordered alignment from these files, or `ska map` to write an alignment
//! ordered versus a reference sequence.
//!
//! Alternatively, both `ska align` and `ska map` can take input sequence directly to obtain output alignments
//! in a single command and without saving an `.skf` file. NB: This uses the default options
//! of `ska build`, so to change these you will need to run the alignment in two steps.
//!
//! Output from `ska align` and `ska map` is to STDOUT, so you can use a redirect `>` to save to a file or pipe `|`
//! to stream into another compatible program on STDIN. You can also add an output
//! file prefix directly with `-o` (for `ska build` this is required).
//!
//! ## Important notes
//!
//! - In this version of ska the k-mer length is the _total_ split k-mer size,
//! whereas in ska1 the k-mer length is half the split length. So the default of
//! k=15 in ska1 corresponds to choosing `-k 31` in this version.
//! - If you are using FASTQ input files (reads), these must be provided as two
//! deinterleaved files using the `-f` option to `ska build`. Providing them as
//! a single file will treat them as FASTA, and not correct sequencing errors.
//! - If you are running `ska map`, it is more memory efficient to run these one at
//! a time, rather than merging a single `.skf` file. A command for doing this
//! in parallel is listed below.
//!
//! ## Common options
//!
//! Version can be viewed by running `ska -V`.
//!
//! Details and progress messages are written on STDERR. You can see more logging
//! information by adding the verbose flag `-v`.
//!
//! ## ska build
//!
//! This command creates an `.skf` file from sequence (.fasta/.fasta.gz/.fastq/.fastq.gz) input.
//! K-mer size must be odd, greater than 5, and a maximum of 63. Smaller k-mers
//! are more sensitive and can find closer positions, but are less specific so
//! may lead to more repeated split k-mers and ambiguous bases. Using k <= 31 uses
//! 64-bit integers and may be faster than 31 < k <= 63, which uses 128-bits.
//!
//! This is typically the most computationally intensive step of `ska`, and
//! multiple `--threads` can be used to split the work over multiple CPU cores.
//!
//! Build from two input FASTA files with a k-mer size of 31:
//! ```bash
//! ska build -o seqs -k 31 assemblies/seq1.fa assemblies/seq2.fa
//! ```
//! This will assume sample names of `seq1` and `seq2` –- the base path and with known fastx
//! file extensions removed. If you know the strand,
//! for example when exclusively using reference sequences or single stranded viruses, add `--single-strand`
//! to ignore reverse complements.
//!
//! ### Read files
//!
//! To use FASTQ files, or specify sample names, or more easily input a larger number of input files,
//! you can provide a tab separated file list via `-f` instead of listing files. For example
//! a file called `input_sequence.txt` to describe `sample_1` (an assembly) and `sample_2` (paired reads):
//! ```text
//! sample_1 assemblies/seq1.fa
//! sample_2 reads/seq2_1.fastq.gz reads/seq2_2.fastq.gz
//! ```
//! You'd run:
//! ```bash
//! ska build -o seqs -f input_sequence.txt
//! ```
//! Options for filtering/error correction are:
//! - `--min-count`. Specify a minimum number of appearances a k-mer must have
//! to be included. This is an effective way of filtering sequencing errors if set
//! to at least three, but higher may be appropriate for higher coverage data.
//! A two-step blocked bloom and hash table filter is used for memory efficiency.
//! - `--proportion-reads`. Specify a proportion of reads to use to build the .skf file.
//! The value of this parameter must be between 0 and 1. If you have high coverage samples
//! this can be used to reduce the build time.
//! - `--qual-filter`. `none` do not filter based on quality scores.
//! `middle` (default) filter k-mers where the middle base is below the minimum quality.
//! `strict` filter k-mers where any base is below the minimum quality.
//! - `--min-qual`. Specify a minimum PHRED score to use in the filter.
//!
//! FASTQ files must be paired end. If you'd like to request more flexibility in
//! this regard please contact us.
//!
//! ### Threads
//!
//! The maximum threads actually used will be a power of two, so if you provided
//! `--threads 6` only four would be used. Additionally, at least ten samples per
//! thread are required so maximums are:
//!
//! | Samples | Maximum threads |
//! |---------|-----------------|
//! | 1-19 | 1 |
//! | 20-39 | 2 |
//! | 40-79 | 4 |
//!
//! and so on. Use `-v` to see a message with the number being used.
//!
//! Using more threads will increase memory use.
//!
//! You can also run blocks of samples independently (e.g. with snakemake or a
//! job array) then use `ska merge` to combine results.
//!
//! ## ska align
//!
//! Creates an alignment from a `.skf` file or sequence files. Sites (columns) are
//! in an arbitrary order. Two basic filters are available: `--min-freq` which
//! sets the maximum number of missing sites; `--filter` which can be set to
//! one of three options:
//! * `no-filter` -- no extra filtering
//! * `no-const` -- no constant sites
//! * `no-ambig-or-const` -- no constant sites, or sites where the only variable base is ambiguous
//!
//! `no-filter` may be useful for ascertainment bias correction in phylogenetic algorithms,
//! but note the flanking split k-mers will never be included. `no-const` is the default.
//! `no-ambig-or-const` can be used when you want to treat any ambiguity as an `N`.
//!
//! With an `.skf` file from `ska build`, constant sites, and no missing variants:
//! ```bash
//! ska align --min-freq 1 --filter no-filter -o seqs seqs.skf
//! ```
//!
//! Another example: directly from FASTA files, with default `build` and `align` settings,
//! and gzipping the output alignment
//! ```bash
//! ska align seq*.fa --threads 8 | gzip -c - > seqs.aln.gz
//! ```
//!
//! ## ska map
//!
//! Creates an alignment from a `.skf` file or sequence files by mapping its
//! split k-mers to split k-mers of a linear reference sequence. This produces pseudoalignment
//! with the same sites/columns as the reference sequence. Sites not mapped will
//! be output as missing '-'.
//!
//! Pass the FASTA file of the reference as the first argument, and the skf file as
//! the second argument:
//! ```bash
//! ska map ref.fa seqs.skf -o ref_mapped.aln
//! ```
//! Add `--repeat-mask` to mask any repeated split k-mers in the reference with 'N'.
//!
//! You can also get a VCF as output, which has rows as variants, and only has the
//! variable sites (but will include unmapped bases as missing).
//! An example command, also demonstrating that everything can be done from input sequences in a single command:
//! ```bash
//! ska map ref.fa seq1.fa seq2.fa -f vcf --threads 2 | bgzip -c - > seqs.vcf.gz
//! ```
//!
//! ## ska lo
//!
//! Converts split k-mers from a `.skf` file into a colored De Bruijn graph and infers indels from graph bubbles and SNPs from variant groups in
//! reference-free mode (as with `ska align`). SNPs are only composed of ATGC variants (no ambigous nucleotides). The same
//! filtering applies to indels.
//! Multithreading ('-t' argument) is not fully optimised - it usually takes 4 threads to halve runtimes.
//!
//! To generate a SNP alignment and an indel VCF file (here named 'test_snps.fas' and 'test_indels.vcf'):
//! ```bash
//! ska lo seqs.skf test
//! ```
//!
//! It can also position SNPs on a reference genome if provided using the '-r' argument. The reference genome should
//! be in FASTA format and composed of a unique sequence. skalo lo will then generate, in addition to the SNP alignment and the indel
//! VCF file, a SNP VCF file and a pseudo-genome alignment (as with `ska map`) that can be used for recombination analyses.
//! In such use case, we recommmend to increase the maximum proportion of missing data allowed per variant ('-m' argument), but not above 0.5.
//! ```bash
//! ska lo seqs.skf test -r reference.fas -m 0.4
//! ```
//! Please note that at the moment indels cannot be positioned on a reference genome.
//!
//!
//! ### Efficiency
//!
//! As `ska map` is independent for each input file, the most efficient way to
//! run this on a number of samples is with gnu parallel, for example:
//! ```bash
//! cat ${FILE_LIST} | parallel -j 8 "ID=\$(basename {} | sed 's/_a.*fasta/_temp_skf/g');
//! ska build -o \$ID -k 31 {} ;
//! MAPOUT=\$(basename {} | sed 's/_a.*fasta/_aln/g');
//! ska map $REFERENCE_LOC \${ID}.skf -o \${MAPOUT}.aln"
//! cat *.aln > ska_map.aln
//! ```
//!
//! ## ska distance
//!
//! Use to calculate distances between all samples within an `.skf` file. The output
//! will contain the number of SNP differences between all pairs of samples, as
//! well as the proportion of matching k-mers.
//!
//! ```bash
//! ska distance -o distances.txt seqs.skf
//! ```
//!
//! Mismatches is an estimate of the alignable genome fraction based on the unfiltered
//! k-mers.
//!
//! Also consider ambiguous bases by adding `--allow-ambiguous` flag, which will give partial SNP distances.
//! Note that ambiguous bases may overestimate distances due to repeat k-mers. Use `--min-freq` to
//! ignore k-mers only found in some samples at the population level (default = 0.0). F
//! or finer control over filtering, first run `ska weed`
//! on the input .skf.
//!
//! Multiple threads can be used, but this will only be effective with large numbers of samples.
//!
//! The companion script in `scripts/cluster_dists.py` (requires `networkx`) can
//! be used to make single linkage clusters from these distances at given thresholds,
//! and create a visualisation in [Microreact](https://microreact.org/):
//! ```bash
//! ska distance seqs.skf > distances.txt
//! python scripts/cluster_dists.py distances.txt --snps 20 --mismatches 0.05
//! ```
//! If you install `rapidnj` and `biopython` you can also draw an NJ tree from
//! these distances, which will be displayed in Microreact. Use your `--api-key`
//! to directly upload and get the URL printed on the terminal.
//!
//! ## ska merge
//!
//! Use to combine multiple `.skf` files into one, typically for subsequent use in `align` or `map`.
//! This may be particularly useful if `ska build` was run on multiple input files
//! one at a time (for example in a job array).
//!
//! ```bash
//! ska merge -o all_samples sample1.skf sample2.skf sample3.skf
//! ```
//!
//! ## ska delete
//!
//! Remove samples by name from an `.skf` file. All sample names must exist in the
//! file, or an error will be thrown. After removing samples, the input `.skf` file will be overwritten,
//! unless you provide a new one with `-o`.
//! Any split k-mers which have no observed missing bases after removal will also be deleted.
//!
//! ```bash
//! ska delete --skf-file all_samples.skf sample_1 sample_3
//! ```
//! If you wish to delete many samples, you can use `-f` to provide their names
//! by line via an input file.
//!
//! ## ska weed
//!
//! Remove (weed) split k-mers from an `.skf` file. The split k-mers to be removed
//! are generated from a FASTA file, which may for example contain known transposons or
//! other repeat sequences. As with `delete`, the input `.skf` file is overwritten by default.
//! Use `-o` to write the results to a new file instead of overwriting the input `.skf`.
//!
//! ```bash
//! ska weed all_samples.skf MGEs.fa
//! ```
//!
//! In addition, you can also use `ska weed` to filter split k-mers by proportion of samples
//! they appear in, and constant or amibguous middle cases, which does
//! not require a list of split k-mers to be removed (but both can be used together, if you wish).
//!
//! For example, you may want to reduce the size of an `.skf` file for online use.
//! You can do this by removing any constant sites or ambiguous-only s0tes (which are typically unused), and by hard-filtering
//! by frequency (i.e. before writing output):
//! ```bash
//! ska weed --filter no-ambig-or-const --min-freq 0.9 all_samples.skf
//! ```
//!
//! ## ska nk
//! Return information on the *n*umber of *k*-mers in an `.skf` file. This will
//! print on STDOUT:
//! - The version of ska used to build the file.
//! - The k-mer size.
//! - Number of bits used for the split k-mer (64 or 128).
//! - Whether reverse complements were used.
//! - The total number of split k-mers.
//! - The total number of samples.
//! - The sample names.
//! - The number of split k-mers found in each sample.
//!
//! ```bash
//! ska nk all_samples.skf
//! ska_version=0.3.1
//! k=21
//! k_bits=64
//! rc=true
//! k-mers=3228084
//! samples=28
//! sample_names=["19183_4#73", "12673_8#29", "12673_8#31", "12754_5#61", "12754_5#89", "12754_5#47", "12754_5#32", "12754_5#78", "12754_4#85", "12754_5#74", "19183_4#57", "12754_5#36", "19183_4#49", "19183_4#79", "19183_4#60", "12754_5#24", "12754_5#22", "12754_5#71", "12673_8#26", "12754_5#95", "12754_5#86", "12673_8#24", "19183_4#61", "12673_8#41", "12754_4#66", "12754_5#80", "12754_5#84", "19183_4#78"]
//! sample_kmers=[2872587, 2997448, 2949719, 2997496, 2997178, 2912749, 2996491, 2997221, 2949102, 2997454, 2914109, 2912237, 2872518, 2869957, 2872470, 2997992, 2997647, 2958512, 2998099, 2997290, 2950253, 3027707, 2997881, 2907920, 2911447, 2997644, 2944830, 2915080]
//! ```
//!
//! If you add `--full-info`, the split k-mer dictionary will be decoded and printed
//! to STDOUT after the baseline information, for example:
//! ```bash
//! TAAATATC TAAACGCC A,-
//! AGACTCTC TACAGCTA G,G
//! AAACCATC AAACACTC A,-
//! TTAAAAGA TCTCGTAC C,C
//! ```
//! This is tab-separated, showing the first and second half of the split k-mer
//! and the middle bases of each sample (comma seperated).
//!
//! NB: Only one split k-mer is shown even if the reverse complement was used.
//! These are not precisely canonical k-mers, as the encoding order `{A, C, T, G}` is used internally.
//! But if you can't find a sequence in your input file, you will find its reverse complement.
//!
//! ## ska cov
//!
//! Estimate a coverage cutoff for use with read data. This will count the split
//! k-mers in a pair of FASTQ samples, and create a histogram of these counts.
//! A mixture model is then fitted to this histogram using maximum likelihood,
//! which can give a suggested cutoff with noisy data.
//!
//! The cutoff will be printed to STDERR. Use `-v` to get additional information on the
//! optimisation progress and result. A table of counts and the fit will be printed
//! to STDOUT, which can then be plotted by the companion script in
//! `scripts/plot_cov.py` (requires `matplotlib`):
//! ```bash
//! ska cov reads_1.fastq.gz reads_2.fastq.gz -k 31 -v > cov_plot.txt
//! python scripts/plot_cov.py cov_plot.txt
//! ```
//!
//! The cutoff can be used with the `--min-count` parameter of `ska build`. For
//! a set of sequence experiments with similar characteristics it may be sufficient
//! to use the same cutoff. Alternatively `ska cov` can be run on every sample
//! independently (`gnu parallel` would be an efficient way to do this).
//!
//! ## ska lo
//!
//! Runs "skalo", a graph-based algorithm for inferring variants from WGS outbreak data.
//! Skalo takes as input a `.skf` file generated by the `ska build` command.
//!
//! To run skalo on a file named `my_file.skf` use the following command `ska lo my_file.skf skalo_results`.
//!
//! The skalo method
//! can be modified using the following flags:
//! * `-r` the path to a reference genome that the detected SNPs can be mapped onto
//! * `-m` the maximum fraction of missing data permissible for a site where a possible SNP has been detected (default = 0.2)
//! * `-d` the depth of the recursive path allowed during graph traversal (default = 4)
//! * `-n` maximum number of internal indel k-mers (default = 2)
//! * `-t` number of threads to use for parallel computation (default = 1)
//!
//!
//! # API usage
//!
//! See the submodule documentation linked below.
//!
//! To use `ska build` in other rust code:
//! ```rust
//! use ska::merge_ska_dict::{InputFastx, build_and_merge};
//! use ska::merge_ska_array::MergeSkaArray;
//! use ska::{QualOpts, QualFilter};
//!
//! // Build, merge
//! let input_files: [InputFastx; 2] = [("test1".to_string(),
//! "tests/test_files_in/test_1.fa".to_string(),
//! None),
//! ("test2".to_string(),
//! "tests/test_files_in/test_2.fa".to_string(),
//! None)];
//! let rc = true;
//! let k = 31;
//! let quality = QualOpts {min_count: 1, min_qual: 0, qual_filter: QualFilter::NoFilter};
//! let threads = 2;
//! let proportion_reads = None;
//! // NB u64 for k<=31, u128 for k<=63
//! let merged_dict =
//! build_and_merge::<u64>(&input_files, k, rc, &quality, threads, proportion_reads);
//!
//! // Save
//! let ska_array = MergeSkaArray::new(&merged_dict);
//! ska_array
//! .save(format!("merged.skf").as_str())
//! .expect("Failed to save output file");
//! ```
//!
//! To use `ska align` in other rust code:
//! ```rust
//! use ska::io_utils::{load_array, set_ostream};
//! use ska::cli::FilterType;
//!
//! // Load an .skf file
//! let threads = 4;
//! let input = vec!["tests/test_files_in/merge.skf".to_string()];
//! let mut ska_array = load_array::<u64>(&input, threads).expect("Could not open input as u64");
//!
//! // Apply filters
//! let min_count = 2;
//! let filter_ambig_as_missing = false;
//! let update_kmers = false;
//! let filter = FilterType::NoConst;
//! let ignore_const_gaps = false;
//! let ambig_mask = false;
//! ska_array.filter(min_count, filter_ambig_as_missing, &filter, ambig_mask, ignore_const_gaps, update_kmers);
//!
//! // Write out to stdout
//! let mut out_stream = set_ostream(&None);
//! ska_array
//! .write_fasta(&mut out_stream)
//! .expect("Couldn't write output fasta");
//! ```
//!
//! To use `ska map` in other rust code, see the example for [`ska_ref`].
//!
//! ## Other APIs
//!
//! It would be possible to make a python API using [`maturin`](https://github.com/PyO3/maturin).
//! Please submit a feature request if you would find this useful.
//!
use fmt;
use Instant;
use ValueEnum;
extern crate num_cpus;
use cratebuild_and_merge;
use crateRefSka;
use crateMergeSkaArray;
use crate*;
use crate*;
use crate*;
use crateCoverageHistogram;
use crateload_array;
use crateConfig;
/// Possible quality score filters when building with reads
/// Quality filtering options for FASTQ files