ska/lib.rs
1//! Split k-mer analysis (version 2) uses exact matching of split k-mer sequences to align closely related
2//! sequences, typically small haploid genomes such as bacteria and viruses.
3//!
4//! SKA can only align SNPs further than the k-mer length apart,
5//! and does not use a gap penalty approach or give alignment scores.
6//! But the advantages are speed and flexibility, particularly the ability to
7//! run on a reference-free manner (i.e. including accessory genome variation)
8//! on both assemblies and reads.
9//!
10//! ## Details
11//!
12//! A split k-mer is a sequence of bases with the middle position removed. For
13//! example, the split 9-mer pattern is `XXXX-XXXX`, and the split 9-mers of the
14//! sequence `ACGAGAGTCTT` are:
15//!
16//! | Split k-mer | Middle base |
17//! |-------------|-------------|
18//! | `ACGAAGTC` | `G` |
19//! | `CGAGGTCT` | `A` |
20//! | `GAGATCTT` | `G` |
21//!
22//! Which is broadly how SKA represents sequences in `.skf` files, as a dictionary with split
23//! k-mers as keys and the middle base as values. These dictionaries are merged
24//! by exactly matching the split k-mers, such that the middle positions are aligned (but
25//! unordered). Split k-mers can also be matched to those from a reference sequence
26//! to give an ordered pseudoalignment.
27//!
28//! Various optimisations are used to make this as fast as possible. For a more thorough comparison with version 1.0 of SKA, see the
29//! [github description](https://github.com/bacpop/ska.rust/blob/master/README.md).
30//!
31//! *NB As split k-mers are even lengths, it is possible that they are their
32//! own reverse complement. The original version of ska would randomly pick a strand,
33//! possibly introducing a SNP across samples. This version uses an ambiguous middle
34//! base (W for A/T; S for C/G) to represent this case.*
35//!
36//! Tutorials:
37//! - [From genomes to trees](https://www.bacpop.org/guides/building_trees_with_ska/).
38//! - [Filtering options](https://www.bacpop.org/guides/snp_alignment_with_ska/).
39//!
40//! Papers:
41//! - [Seamless, rapid, and accurate analyses of outbreak genomic data using split k-mer analysis ](https://genome.cshlp.org/content/34/10/1661.abstract).
42//! - [skalo: using SKA split k-mers with coloured de Brujin graphs to genotype indels](https://academic.oup.com/mbe/article/42/4/msaf077/8103706).
43//!
44//! Command line usage follows. For API documentation and usage, see the [end of this section](#api-usage).
45//!
46//! # Usage
47//!
48//! `.skf` files represent merged split k-mers from multiple sequence files. They
49//! are created with `ska build`. You can subsequently use `ska align` to write
50//! out an unordered alignment from these files, or `ska map` to write an alignment
51//! ordered versus a reference sequence.
52//!
53//! Alternatively, both `ska align` and `ska map` can take input sequence directly to obtain output alignments
54//! in a single command and without saving an `.skf` file. NB: This uses the default options
55//! of `ska build`, so to change these you will need to run the alignment in two steps.
56//!
57//! Output from `ska align` and `ska map` is to STDOUT, so you can use a redirect `>` to save to a file or pipe `|`
58//! to stream into another compatible program on STDIN. You can also add an output
59//! file prefix directly with `-o` (for `ska build` this is required).
60//!
61//! ## Important notes
62//!
63//! - In this version of ska the k-mer length is the _total_ split k-mer size,
64//! whereas in ska1 the k-mer length is half the split length. So the default of
65//! k=15 in ska1 corresponds to choosing `-k 31` in this version.
66//! - If you are using FASTQ input files (reads), these must be provided as two
67//! deinterleaved files using the `-f` option to `ska build`. Providing them as
68//! a single file will treat them as FASTA, and not correct sequencing errors.
69//! - If you are running `ska map`, it is more memory efficient to run these one at
70//! a time, rather than merging a single `.skf` file. A command for doing this
71//! in parallel is listed below.
72//!
73//! ## Common options
74//!
75//! Version can be viewed by running `ska -V`.
76//!
77//! Details and progress messages are written on STDERR. You can see more logging
78//! information by adding the verbose flag `-v`.
79//!
80//! ## ska build
81//!
82//! This command creates an `.skf` file from sequence (.fasta/.fasta.gz/.fastq/.fastq.gz) input.
83//! K-mer size must be odd, greater than 5, and a maximum of 63. Smaller k-mers
84//! are more sensitive and can find closer positions, but are less specific so
85//! may lead to more repeated split k-mers and ambiguous bases. Using k <= 31 uses
86//! 64-bit integers and may be faster than 31 < k <= 63, which uses 128-bits.
87//!
88//! This is typically the most computationally intensive step of `ska`, and
89//! multiple `--threads` can be used to split the work over multiple CPU cores.
90//!
91//! Build from two input FASTA files with a k-mer size of 31:
92//! ```bash
93//! ska build -o seqs -k 31 assemblies/seq1.fa assemblies/seq2.fa
94//! ```
95//! This will assume sample names of `seq1` and `seq2` –- the base path and with known fastx
96//! file extensions removed. If you know the strand,
97//! for example when exclusively using reference sequences or single stranded viruses, add `--single-strand`
98//! to ignore reverse complements.
99//!
100//! ### Read files
101//!
102//! To use FASTQ files, or specify sample names, or more easily input a larger number of input files,
103//! you can provide a tab separated file list via `-f` instead of listing files. For example
104//! a file called `input_sequence.txt` to describe `sample_1` (an assembly) and `sample_2` (paired reads):
105//! ```text
106//! sample_1 assemblies/seq1.fa
107//! sample_2 reads/seq2_1.fastq.gz reads/seq2_2.fastq.gz
108//! ```
109//! You'd run:
110//! ```bash
111//! ska build -o seqs -f input_sequence.txt
112//! ```
113//! Options for filtering/error correction are:
114//! - `--min-count`. Specify a minimum number of appearances a k-mer must have
115//! to be included. This is an effective way of filtering sequencing errors if set
116//! to at least three, but higher may be appropriate for higher coverage data.
117//! A two-step blocked bloom and hash table filter is used for memory efficiency.
118//! - `--proportion-reads`. Specify a proportion of reads to use to build the .skf file.
119//! The value of this parameter must be between 0 and 1. If you have high coverage samples
120//! this can be used to reduce the build time.
121//! - `--qual-filter`. `none` do not filter based on quality scores.
122//! `middle` (default) filter k-mers where the middle base is below the minimum quality.
123//! `strict` filter k-mers where any base is below the minimum quality.
124//! - `--min-qual`. Specify a minimum PHRED score to use in the filter.
125//!
126//! FASTQ files must be paired end. If you'd like to request more flexibility in
127//! this regard please contact us.
128//!
129//! ### Threads
130//!
131//! The maximum threads actually used will be a power of two, so if you provided
132//! `--threads 6` only four would be used. Additionally, at least ten samples per
133//! thread are required so maximums are:
134//!
135//! | Samples | Maximum threads |
136//! |---------|-----------------|
137//! | 1-19 | 1 |
138//! | 20-39 | 2 |
139//! | 40-79 | 4 |
140//!
141//! and so on. Use `-v` to see a message with the number being used.
142//!
143//! Using more threads will increase memory use.
144//!
145//! You can also run blocks of samples independently (e.g. with snakemake or a
146//! job array) then use `ska merge` to combine results.
147//!
148//! ## ska align
149//!
150//! Creates an alignment from a `.skf` file or sequence files. Sites (columns) are
151//! in an arbitrary order. Two basic filters are available: `--min-freq` which
152//! sets the maximum number of missing sites; `--filter` which can be set to
153//! one of three options:
154//! * `no-filter` -- no extra filtering
155//! * `no-const` -- no constant sites
156//! * `no-ambig-or-const` -- no constant sites, or sites where the only variable base is ambiguous
157//!
158//! `no-filter` may be useful for ascertainment bias correction in phylogenetic algorithms,
159//! but note the flanking split k-mers will never be included. `no-const` is the default.
160//! `no-ambig-or-const` can be used when you want to treat any ambiguity as an `N`.
161//!
162//! With an `.skf` file from `ska build`, constant sites, and no missing variants:
163//! ```bash
164//! ska align --min-freq 1 --filter no-filter -o seqs seqs.skf
165//! ```
166//!
167//! Another example: directly from FASTA files, with default `build` and `align` settings,
168//! and gzipping the output alignment
169//! ```bash
170//! ska align seq*.fa --threads 8 | gzip -c - > seqs.aln.gz
171//! ```
172//!
173//! ## ska map
174//!
175//! Creates an alignment from a `.skf` file or sequence files by mapping its
176//! split k-mers to split k-mers of a linear reference sequence. This produces pseudoalignment
177//! with the same sites/columns as the reference sequence. Sites not mapped will
178//! be output as missing '-'.
179//!
180//! Pass the FASTA file of the reference as the first argument, and the skf file as
181//! the second argument:
182//! ```bash
183//! ska map ref.fa seqs.skf -o ref_mapped.aln
184//! ```
185//! Add `--repeat-mask` to mask any repeated split k-mers in the reference with 'N'.
186//!
187//! You can also get a VCF as output, which has rows as variants, and only has the
188//! variable sites (but will include unmapped bases as missing).
189//! An example command, also demonstrating that everything can be done from input sequences in a single command:
190//! ```bash
191//! ska map ref.fa seq1.fa seq2.fa -f vcf --threads 2 | bgzip -c - > seqs.vcf.gz
192//! ```
193//!
194//! ## ska lo
195//!
196//! Converts split k-mers from a `.skf` file into a colored De Bruijn graph and infers indels from graph bubbles and SNPs from variant groups in
197//! reference-free mode (as with `ska align`). SNPs are only composed of ATGC variants (no ambigous nucleotides). The same
198//! filtering applies to indels.
199//! Multithreading ('-t' argument) is not fully optimised - it usually takes 4 threads to halve runtimes.
200//!
201//! To generate a SNP alignment and an indel VCF file (here named 'test_snps.fas' and 'test_indels.vcf'):
202//! ```bash
203//! ska lo seqs.skf test
204//! ```
205//!
206//! It can also position SNPs on a reference genome if provided using the '-r' argument. The reference genome should
207//! be in FASTA format and composed of a unique sequence. skalo lo will then generate, in addition to the SNP alignment and the indel
208//! VCF file, a SNP VCF file and a pseudo-genome alignment (as with `ska map`) that can be used for recombination analyses.
209//! In such use case, we recommmend to increase the maximum proportion of missing data allowed per variant ('-m' argument), but not above 0.5.
210//! ```bash
211//! ska lo seqs.skf test -r reference.fas -m 0.4
212//! ```
213//! Please note that at the moment indels cannot be positioned on a reference genome.
214//!
215//!
216//! ### Efficiency
217//!
218//! As `ska map` is independent for each input file, the most efficient way to
219//! run this on a number of samples is with gnu parallel, for example:
220//! ```bash
221//! cat ${FILE_LIST} | parallel -j 8 "ID=\$(basename {} | sed 's/_a.*fasta/_temp_skf/g');
222//! ska build -o \$ID -k 31 {} ;
223//! MAPOUT=\$(basename {} | sed 's/_a.*fasta/_aln/g');
224//! ska map $REFERENCE_LOC \${ID}.skf -o \${MAPOUT}.aln"
225//! cat *.aln > ska_map.aln
226//! ```
227//!
228//! ## ska distance
229//!
230//! Use to calculate distances between all samples within an `.skf` file. The output
231//! will contain the number of SNP differences between all pairs of samples, as
232//! well as the proportion of matching k-mers.
233//!
234//! ```bash
235//! ska distance -o distances.txt seqs.skf
236//! ```
237//!
238//! Mismatches is an estimate of the alignable genome fraction based on the unfiltered
239//! k-mers.
240//!
241//! Also consider ambiguous bases by adding `--allow-ambiguous` flag, which will give partial SNP distances.
242//! Note that ambiguous bases may overestimate distances due to repeat k-mers. Use `--min-freq` to
243//! ignore k-mers only found in some samples at the population level (default = 0.0). F
244//! or finer control over filtering, first run `ska weed`
245//! on the input .skf.
246//!
247//! Multiple threads can be used, but this will only be effective with large numbers of samples.
248//!
249//! The companion script in `scripts/cluster_dists.py` (requires `networkx`) can
250//! be used to make single linkage clusters from these distances at given thresholds,
251//! and create a visualisation in [Microreact](https://microreact.org/):
252//! ```bash
253//! ska distance seqs.skf > distances.txt
254//! python scripts/cluster_dists.py distances.txt --snps 20 --mismatches 0.05
255//! ```
256//! If you install `rapidnj` and `biopython` you can also draw an NJ tree from
257//! these distances, which will be displayed in Microreact. Use your `--api-key`
258//! to directly upload and get the URL printed on the terminal.
259//!
260//! ## ska merge
261//!
262//! Use to combine multiple `.skf` files into one, typically for subsequent use in `align` or `map`.
263//! This may be particularly useful if `ska build` was run on multiple input files
264//! one at a time (for example in a job array).
265//!
266//! ```bash
267//! ska merge -o all_samples sample1.skf sample2.skf sample3.skf
268//! ```
269//!
270//! ## ska delete
271//!
272//! Remove samples by name from an `.skf` file. All sample names must exist in the
273//! file, or an error will be thrown. After removing samples, the input `.skf` file will be overwritten,
274//! unless you provide a new one with `-o`.
275//! Any split k-mers which have no observed missing bases after removal will also be deleted.
276//!
277//! ```bash
278//! ska delete --skf-file all_samples.skf sample_1 sample_3
279//! ```
280//! If you wish to delete many samples, you can use `-f` to provide their names
281//! by line via an input file.
282//!
283//! ## ska weed
284//!
285//! Remove (weed) split k-mers from an `.skf` file. The split k-mers to be removed
286//! are generated from a FASTA file, which may for example contain known transposons or
287//! other repeat sequences. As with `delete`, the input `.skf` file is overwritten by default.
288//! Use `-o` to write the results to a new file instead of overwriting the input `.skf`.
289//!
290//! ```bash
291//! ska weed all_samples.skf MGEs.fa
292//! ```
293//!
294//! In addition, you can also use `ska weed` to filter split k-mers by proportion of samples
295//! they appear in, and constant or amibguous middle cases, which does
296//! not require a list of split k-mers to be removed (but both can be used together, if you wish).
297//!
298//! For example, you may want to reduce the size of an `.skf` file for online use.
299//! You can do this by removing any constant sites or ambiguous-only s0tes (which are typically unused), and by hard-filtering
300//! by frequency (i.e. before writing output):
301//! ```bash
302//! ska weed --filter no-ambig-or-const --min-freq 0.9 all_samples.skf
303//! ```
304//!
305//! ## ska nk
306//! Return information on the *n*umber of *k*-mers in an `.skf` file. This will
307//! print on STDOUT:
308//! - The version of ska used to build the file.
309//! - The k-mer size.
310//! - Number of bits used for the split k-mer (64 or 128).
311//! - Whether reverse complements were used.
312//! - The total number of split k-mers.
313//! - The total number of samples.
314//! - The sample names.
315//! - The number of split k-mers found in each sample.
316//!
317//! ```bash
318//! ska nk all_samples.skf
319//! ska_version=0.3.1
320//! k=21
321//! k_bits=64
322//! rc=true
323//! k-mers=3228084
324//! samples=28
325//! sample_names=["19183_4#73", "12673_8#29", "12673_8#31", "12754_5#61", "12754_5#89", "12754_5#47", "12754_5#32", "12754_5#78", "12754_4#85", "12754_5#74", "19183_4#57", "12754_5#36", "19183_4#49", "19183_4#79", "19183_4#60", "12754_5#24", "12754_5#22", "12754_5#71", "12673_8#26", "12754_5#95", "12754_5#86", "12673_8#24", "19183_4#61", "12673_8#41", "12754_4#66", "12754_5#80", "12754_5#84", "19183_4#78"]
326//! sample_kmers=[2872587, 2997448, 2949719, 2997496, 2997178, 2912749, 2996491, 2997221, 2949102, 2997454, 2914109, 2912237, 2872518, 2869957, 2872470, 2997992, 2997647, 2958512, 2998099, 2997290, 2950253, 3027707, 2997881, 2907920, 2911447, 2997644, 2944830, 2915080]
327//! ```
328//!
329//! If you add `--full-info`, the split k-mer dictionary will be decoded and printed
330//! to STDOUT after the baseline information, for example:
331//! ```bash
332//! TAAATATC TAAACGCC A,-
333//! AGACTCTC TACAGCTA G,G
334//! AAACCATC AAACACTC A,-
335//! TTAAAAGA TCTCGTAC C,C
336//! ```
337//! This is tab-separated, showing the first and second half of the split k-mer
338//! and the middle bases of each sample (comma seperated).
339//!
340//! NB: Only one split k-mer is shown even if the reverse complement was used.
341//! These are not precisely canonical k-mers, as the encoding order `{A, C, T, G}` is used internally.
342//! But if you can't find a sequence in your input file, you will find its reverse complement.
343//!
344//! ## ska cov
345//!
346//! Estimate a coverage cutoff for use with read data. This will count the split
347//! k-mers in a pair of FASTQ samples, and create a histogram of these counts.
348//! A mixture model is then fitted to this histogram using maximum likelihood,
349//! which can give a suggested cutoff with noisy data.
350//!
351//! The cutoff will be printed to STDERR. Use `-v` to get additional information on the
352//! optimisation progress and result. A table of counts and the fit will be printed
353//! to STDOUT, which can then be plotted by the companion script in
354//! `scripts/plot_cov.py` (requires `matplotlib`):
355//! ```bash
356//! ska cov reads_1.fastq.gz reads_2.fastq.gz -k 31 -v > cov_plot.txt
357//! python scripts/plot_cov.py cov_plot.txt
358//! ```
359//!
360//! The cutoff can be used with the `--min-count` parameter of `ska build`. For
361//! a set of sequence experiments with similar characteristics it may be sufficient
362//! to use the same cutoff. Alternatively `ska cov` can be run on every sample
363//! independently (`gnu parallel` would be an efficient way to do this).
364//!
365//! ## ska lo
366//!
367//! Runs "skalo", a graph-based algorithm for inferring variants from WGS outbreak data.
368//! Skalo takes as input a `.skf` file generated by the `ska build` command.
369//!
370//! To run skalo on a file named `my_file.skf` use the following command `ska lo my_file.skf skalo_results`.
371//!
372//! The skalo method
373//! can be modified using the following flags:
374//! * `-r` the path to a reference genome that the detected SNPs can be mapped onto
375//! * `-m` the maximum fraction of missing data permissible for a site where a possible SNP has been detected (default = 0.2)
376//! * `-d` the depth of the recursive path allowed during graph traversal (default = 4)
377//! * `-n` maximum number of internal indel k-mers (default = 2)
378//! * `-t` number of threads to use for parallel computation (default = 1)
379//!
380//!
381//! # API usage
382//!
383//! See the submodule documentation linked below.
384//!
385//! To use `ska build` in other rust code:
386//! ```rust
387//! use ska::merge_ska_dict::{InputFastx, build_and_merge};
388//! use ska::merge_ska_array::MergeSkaArray;
389//! use ska::{QualOpts, QualFilter};
390//!
391//! // Build, merge
392//! let input_files: [InputFastx; 2] = [("test1".to_string(),
393//! "tests/test_files_in/test_1.fa".to_string(),
394//! None),
395//! ("test2".to_string(),
396//! "tests/test_files_in/test_2.fa".to_string(),
397//! None)];
398//! let rc = true;
399//! let k = 31;
400//! let quality = QualOpts {min_count: 1, min_qual: 0, qual_filter: QualFilter::NoFilter};
401//! let threads = 2;
402//! let proportion_reads = None;
403//! // NB u64 for k<=31, u128 for k<=63
404//! let merged_dict =
405//! build_and_merge::<u64>(&input_files, k, rc, &quality, threads, proportion_reads);
406//!
407//! // Save
408//! let ska_array = MergeSkaArray::new(&merged_dict);
409//! ska_array
410//! .save(format!("merged.skf").as_str())
411//! .expect("Failed to save output file");
412//! ```
413//!
414//! To use `ska align` in other rust code:
415//! ```rust
416//! use ska::io_utils::{load_array, set_ostream};
417//! use ska::cli::FilterType;
418//!
419//! // Load an .skf file
420//! let threads = 4;
421//! let input = vec!["tests/test_files_in/merge.skf".to_string()];
422//! let mut ska_array = load_array::<u64>(&input, threads).expect("Could not open input as u64");
423//!
424//! // Apply filters
425//! let min_count = 2;
426//! let filter_ambig_as_missing = false;
427//! let update_kmers = false;
428//! let filter = FilterType::NoConst;
429//! let ignore_const_gaps = false;
430//! let ambig_mask = false;
431//! ska_array.filter(min_count, filter_ambig_as_missing, &filter, ambig_mask, ignore_const_gaps, update_kmers);
432//!
433//! // Write out to stdout
434//! let mut out_stream = set_ostream(&None);
435//! ska_array
436//! .write_fasta(&mut out_stream)
437//! .expect("Couldn't write output fasta");
438//! ```
439//!
440//! To use `ska map` in other rust code, see the example for [`ska_ref`].
441//!
442//! ## Other APIs
443//!
444//! It would be possible to make a python API using [`maturin`](https://github.com/PyO3/maturin).
445//! Please submit a feature request if you would find this useful.
446//!
447
448#![warn(missing_docs)]
449use std::fmt;
450use std::time::Instant;
451
452use clap::ValueEnum;
453extern crate num_cpus;
454
455pub mod merge_ska_dict;
456pub mod ska_dict;
457use crate::merge_ska_dict::build_and_merge;
458
459pub mod ska_ref;
460use crate::ska_ref::RefSka;
461pub mod merge_ska_array;
462use crate::merge_ska_array::MergeSkaArray;
463
464pub mod generic_modes;
465use crate::generic_modes::*;
466
467pub mod cli;
468use crate::cli::*;
469
470pub mod io_utils;
471use crate::io_utils::*;
472
473pub mod coverage;
474use crate::coverage::CoverageHistogram;
475
476pub mod skalo;
477use crate::io_utils::load_array;
478use crate::skalo::utils::Config;
479
480/// Possible quality score filters when building with reads
481#[derive(Copy, Clone, Debug, PartialEq, Eq, PartialOrd, Ord, ValueEnum)]
482pub enum QualFilter {
483 /// Ignore quality scores in reads
484 NoFilter,
485 /// Filter middle bases below quality threshold
486 Middle,
487 /// Filter entire k-mer when any base below quality threshold
488 Strict,
489}
490
491impl fmt::Display for QualFilter {
492 fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
493 match *self {
494 Self::NoFilter => write!(f, "No quality filtering"),
495 Self::Middle => write!(f, "Middle base quality filtering"),
496 Self::Strict => write!(f, "Whole k-mer quality filtering"),
497 }
498 }
499}
500
501/// Quality filtering options for FASTQ files
502pub struct QualOpts {
503 /// Minimum k-mer count across reads to be added
504 pub min_count: u16,
505 /// Minimum base quality to be added
506 pub min_qual: u8,
507 /// [`QualFilter`]: apply quality across whole k-mer, just middle base, or not at all
508 pub qual_filter: QualFilter,
509}
510
511impl fmt::Display for QualOpts {
512 fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
513 write!(
514 f,
515 "min count: {}; minimum quality {} ({}); filter applied: {}",
516 self.min_count,
517 self.min_qual,
518 (self.min_qual + 33) as char,
519 self.qual_filter
520 )
521 }
522}
523
524#[doc(hidden)]
525pub fn main() {
526 let args = cli_args();
527 if args.verbose {
528 simple_logger::init_with_level(log::Level::Info).unwrap();
529 } else {
530 simple_logger::init_with_level(log::Level::Warn).unwrap();
531 }
532
533 eprintln!("SKA: Split K-mer Analysis (the alignment-free aligner)");
534 let start = Instant::now();
535 match &args.command {
536 Commands::Build {
537 seq_files,
538 file_list,
539 output,
540 k,
541 single_strand,
542 min_count,
543 min_qual,
544 qual_filter,
545 threads,
546 proportion_reads,
547 } => {
548 check_threads(*threads);
549
550 // Read input
551 let input_files = get_input_list(file_list, seq_files);
552 let rc = !*single_strand;
553 // Build, merge
554 // check for 64 or 128 bit representation
555 let mut quality = QualOpts {
556 min_count: 0,
557 min_qual: *min_qual,
558 qual_filter: *qual_filter,
559 };
560 if k.le(&31) {
561 log::info!("k={}: using 64-bit representation", *k);
562 quality.min_count =
563 kmer_min_cutoff::<u64>(min_count, &input_files, k, rc, args.verbose);
564 let merged_dict = build_and_merge::<u64>(
565 &input_files,
566 *k,
567 rc,
568 &quality,
569 *threads,
570 *proportion_reads,
571 );
572
573 // Save
574 save_skf(&merged_dict, output);
575 } else {
576 log::info!("k={}: using 128-bit representation", *k);
577 quality.min_count =
578 kmer_min_cutoff::<u128>(min_count, &input_files, k, rc, args.verbose);
579 let merged_dict = build_and_merge::<u128>(
580 &input_files,
581 *k,
582 rc,
583 &quality,
584 *threads,
585 *proportion_reads,
586 );
587
588 // Save
589 save_skf(&merged_dict, output);
590 }
591 }
592 Commands::Align {
593 input,
594 output,
595 min_freq,
596 filter_ambig_as_missing,
597 filter,
598 ambig_mask,
599 no_gap_only_sites,
600 threads,
601 } => {
602 check_threads(*threads);
603 if let Ok(mut ska_array) = load_array::<u64>(input, *threads) {
604 // In debug mode (cannot be set from CLI, give details)
605 log::debug!("{ska_array}");
606 align(
607 &mut ska_array,
608 output,
609 filter,
610 *ambig_mask,
611 *no_gap_only_sites,
612 *min_freq,
613 *filter_ambig_as_missing,
614 );
615 } else if let Ok(mut ska_array) = load_array::<u128>(input, *threads) {
616 // In debug mode (cannot be set from CLI, give details)
617 log::debug!("{ska_array}");
618 align(
619 &mut ska_array,
620 output,
621 filter,
622 *ambig_mask,
623 *no_gap_only_sites,
624 *min_freq,
625 *filter_ambig_as_missing,
626 );
627 } else {
628 panic!("Could not read input file(s): {input:?}");
629 }
630 }
631 Commands::Map {
632 reference,
633 input,
634 output,
635 format,
636 ambig_mask,
637 repeat_mask,
638 threads,
639 } => {
640 check_threads(*threads);
641
642 log::info!("Loading skf as dictionary");
643 if let Ok(ska_array) = load_array::<u64>(input, *threads) {
644 log::info!(
645 "Making skf of reference k={} rc={}",
646 ska_array.kmer_len(),
647 ska_array.rc()
648 );
649 let mut ska_ref = RefSka::<u64>::new(
650 ska_array.kmer_len(),
651 reference,
652 ska_array.rc(),
653 *ambig_mask,
654 *repeat_mask,
655 );
656 map(&ska_array, &mut ska_ref, output, format, *threads);
657 } else if let Ok(ska_array) = load_array::<u128>(input, *threads) {
658 log::info!(
659 "Making skf of reference k={} rc={}",
660 ska_array.kmer_len(),
661 ska_array.rc()
662 );
663 let mut ska_ref = RefSka::<u128>::new(
664 ska_array.kmer_len(),
665 reference,
666 ska_array.rc(),
667 *ambig_mask,
668 *repeat_mask,
669 );
670 map(&ska_array, &mut ska_ref, output, format, *threads);
671 } else {
672 panic!("Could not read input file(s): {input:?}");
673 }
674 }
675 Commands::Distance {
676 skf_file,
677 output,
678 min_freq,
679 allow_ambiguous,
680 threads,
681 } => {
682 check_threads(*threads);
683 let filter_ambiguous = !*allow_ambiguous;
684 if let Ok(mut ska_array) = MergeSkaArray::<u64>::load(skf_file) {
685 // In debug mode (cannot be set from CLI, give details)
686 log::debug!("{ska_array}");
687 distance(&mut ska_array, output, *min_freq, filter_ambiguous);
688 } else if let Ok(mut ska_array) = MergeSkaArray::<u128>::load(skf_file) {
689 // In debug mode (cannot be set from CLI, give details)
690 log::debug!("{ska_array}");
691 distance(&mut ska_array, output, *min_freq, filter_ambiguous);
692 } else {
693 panic!("Could not read input file(s): {skf_file}");
694 }
695 }
696 Commands::Merge { skf_files, output } => {
697 if skf_files.len() < 2 {
698 panic!("Need at least two files to merge");
699 }
700
701 log::info!("Loading first alignment");
702 if let Ok(first_array) = MergeSkaArray::<u64>::load(&skf_files[0]) {
703 merge(&first_array, &skf_files[1..], output);
704 } else if let Ok(first_array) = MergeSkaArray::<u128>::load(&skf_files[0]) {
705 merge(&first_array, &skf_files[1..], output);
706 } else {
707 panic!("Could not read input file: {}", skf_files[0]);
708 }
709 }
710 Commands::Delete {
711 skf_file,
712 output,
713 file_list,
714 names,
715 } => {
716 let input_files = get_input_list(file_list, names);
717 let input_names: Vec<&str> = input_files.iter().map(|t| &*t.0).collect();
718 let output_file = output.clone().unwrap_or(skf_file.to_string());
719 log::info!("Loading skf file");
720 if let Ok(mut ska_array) = MergeSkaArray::<u64>::load(skf_file) {
721 delete(&mut ska_array, &input_names, &output_file);
722 } else if let Ok(mut ska_array) = MergeSkaArray::<u128>::load(skf_file) {
723 delete(&mut ska_array, &input_names, &output_file);
724 } else {
725 panic!("Could not read input file: {skf_file}");
726 }
727 }
728 Commands::Weed {
729 skf_file,
730 weed_file,
731 output,
732 reverse,
733 min_freq,
734 filter_ambig_as_missing,
735 ambig_mask,
736 no_gap_only_sites,
737 filter,
738 } => {
739 log::info!("Loading skf file");
740 if let Ok(mut ska_array) = MergeSkaArray::<u64>::load(skf_file) {
741 weed(
742 &mut ska_array,
743 weed_file,
744 *reverse,
745 *min_freq,
746 *filter_ambig_as_missing,
747 filter,
748 *ambig_mask,
749 *no_gap_only_sites,
750 if output.is_none() {
751 skf_file
752 } else {
753 output.as_ref().unwrap().as_str()
754 },
755 );
756 } else if let Ok(mut ska_array) = MergeSkaArray::<u128>::load(skf_file) {
757 weed(
758 &mut ska_array,
759 weed_file,
760 *reverse,
761 *min_freq,
762 *filter_ambig_as_missing,
763 filter,
764 *ambig_mask,
765 *no_gap_only_sites,
766 if output.is_none() {
767 skf_file
768 } else {
769 output.as_ref().unwrap().as_str()
770 },
771 );
772 } else {
773 panic!("Could not read input file: {skf_file}");
774 }
775 }
776 Commands::Nk {
777 skf_file,
778 full_info,
779 } => {
780 if let Ok(ska_array) = MergeSkaArray::<u64>::load(skf_file) {
781 println!("{ska_array}");
782 if *full_info {
783 log::info!("Printing full info");
784 println!("{ska_array:?}");
785 }
786 } else if let Ok(ska_array) = MergeSkaArray::<u128>::load(skf_file) {
787 println!("{ska_array}");
788 if *full_info {
789 log::info!("Printing full info");
790 println!("{ska_array:?}");
791 }
792 } else {
793 panic!("Could not read input file: {skf_file}");
794 }
795 }
796 Commands::Cov {
797 fastq_fwd,
798 fastq_rev,
799 k,
800 single_strand,
801 } => {
802 // Build, merge
803 let rc = !*single_strand;
804 let cutoff;
805 if *k <= 31 {
806 log::info!("k={}: using 64-bit representation", *k);
807 let mut cov =
808 CoverageHistogram::<u64>::new(fastq_fwd, fastq_rev, *k, rc, args.verbose);
809 cutoff = cov.fit_histogram().expect("Couldn't fit coverage model");
810 cov.plot_hist();
811 } else {
812 log::info!("k={}: using 128-bit representation", *k);
813 let mut cov =
814 CoverageHistogram::<u128>::new(fastq_fwd, fastq_rev, *k, rc, args.verbose);
815 cutoff = cov.fit_histogram().expect("Couldn't fit coverage model");
816 cov.plot_hist();
817 }
818 eprintln!("Estimated cutoff\t{cutoff}");
819 }
820 Commands::Lo {
821 input_skf,
822 reference,
823 output,
824 missing,
825 depth,
826 indel_kmers,
827 threads,
828 } => {
829 let config = Config {
830 input_file: input_skf.clone(),
831 output_name: output.clone(),
832 max_missing: *missing,
833 max_depth: *depth,
834 max_indel_kmers: *indel_kmers,
835 nb_threads: *threads,
836 reference_genome: reference.clone(),
837 };
838
839 if let Ok(ska_array) = load_array::<u64>(std::slice::from_ref(input_skf), *threads) {
840 log::info!("Reading file {input_skf}");
841 log::info!("Using 64-bit representation");
842 skalo(ska_array, config);
843 } else if let Ok(ska_array) =
844 load_array::<u128>(std::slice::from_ref(input_skf), *threads)
845 {
846 log::info!("Reading file {input_skf}");
847 log::info!("Using 128-bit representation");
848 skalo(ska_array, config);
849 } else {
850 panic!("Could not read input file(s): {input_skf:?}");
851 }
852 }
853 }
854 let end = Instant::now();
855
856 eprintln!("SKA done in {}s", end.duration_since(start).as_secs());
857 eprintln!("⬛⬜⬛⬜⬛⬜⬛");
858 eprintln!("⬜⬛⬜⬛⬜⬛⬜");
859 log::info!("Complete");
860}