ska/
lib.rs

1//! Split k-mer analysis (version 2) uses exact matching of split k-mer sequences to align closely related
2//! sequences, typically small haploid genomes such as bacteria and viruses.
3//!
4//! SKA can only align SNPs further than the k-mer length apart,
5//! and does not use a gap penalty approach or give alignment scores.
6//! But the advantages are speed and flexibility, particularly the ability to
7//! run on a reference-free manner (i.e. including accessory genome variation)
8//! on both assemblies and reads.
9//!
10//! ## Details
11//!
12//! A split k-mer is a sequence of bases with the middle position removed. For
13//! example, the split 9-mer pattern is `XXXX-XXXX`, and the split 9-mers of the
14//! sequence `ACGAGAGTCTT` are:
15//!
16//! | Split k-mer | Middle base |
17//! |-------------|-------------|
18//! | `ACGAAGTC`  | `G`         |
19//! | `CGAGGTCT`  | `A`         |
20//! | `GAGATCTT`  | `G`         |
21//!
22//! Which is broadly how SKA represents sequences in `.skf` files, as a dictionary with split
23//! k-mers as keys and the middle base as values. These dictionaries are merged
24//! by exactly matching the split k-mers, such that the middle positions are aligned (but
25//! unordered). Split k-mers can also be matched to those from a reference sequence
26//! to give an ordered pseudoalignment.
27//!
28//! Various optimisations are used to make this as fast as possible. For a more thorough comparison with version 1.0 of SKA, see the
29//! [github description](https://github.com/bacpop/ska.rust/blob/master/README.md).
30//!
31//! *NB As split k-mers are even lengths, it is possible that they are their
32//! own reverse complement. The original version of ska would randomly pick a strand,
33//! possibly introducing a SNP across samples. This version uses an ambiguous middle
34//! base (W for A/T; S for C/G) to represent this case.*
35//!
36//! Tutorials:
37//! - [From genomes to trees](https://www.bacpop.org/guides/building_trees_with_ska/).
38//! - [Filtering options](https://www.bacpop.org/guides/snp_alignment_with_ska/).
39//!
40//! Papers:
41//! - [Seamless, rapid, and accurate analyses of outbreak genomic data using split k-mer analysis ](https://genome.cshlp.org/content/34/10/1661.abstract).
42//! - [skalo: using SKA split k-mers with coloured de Brujin graphs to genotype indels](https://academic.oup.com/mbe/article/42/4/msaf077/8103706).
43//!
44//! Command line usage follows. For API documentation and usage, see the [end of this section](#api-usage).
45//!
46//! # Usage
47//!
48//! `.skf` files represent merged split k-mers from multiple sequence files. They
49//! are created with `ska build`. You can subsequently use `ska align` to write
50//! out an unordered alignment from these files, or `ska map` to write an alignment
51//! ordered versus a reference sequence.
52//!
53//! Alternatively, both `ska align` and `ska map` can take input sequence directly to obtain output alignments
54//! in a single command and without saving an `.skf` file. NB: This uses the default options
55//! of `ska build`, so to change these you will need to run the alignment in two steps.
56//!
57//! Output from `ska align` and `ska map` is to STDOUT, so you can use a redirect `>` to save to a file or pipe `|`
58//! to stream into another compatible program on STDIN. You can also add an output
59//! file prefix directly with `-o` (for `ska build` this is required).
60//!
61//! ## Important notes
62//!
63//! - In this version of ska the k-mer length is the _total_ split k-mer size,
64//!   whereas in ska1 the k-mer length is half the split length. So the default of
65//!   k=15 in ska1 corresponds to choosing `-k 31` in this version.
66//! - If you are using FASTQ input files (reads), these must be provided as two
67//!   deinterleaved files using the `-f` option to `ska build`. Providing them as
68//!   a single file will treat them as FASTA, and not correct sequencing errors.
69//! - If you are running `ska map`, it is more memory efficient to run these one at
70//!   a time, rather than merging a single `.skf` file. A command for doing this
71//!   in parallel is listed below.
72//!
73//! ## Common options
74//!
75//! Version can be viewed by running `ska -V`.
76//!
77//! Details and progress messages are written on STDERR. You can see more logging
78//! information by adding the verbose flag `-v`.
79//!
80//! ## ska build
81//!
82//! This command creates an `.skf` file from sequence (.fasta/.fasta.gz/.fastq/.fastq.gz) input.
83//! K-mer size must be odd, greater than 5, and a maximum of 63. Smaller k-mers
84//! are more sensitive and can find closer positions, but are less specific so
85//! may lead to more repeated split k-mers and ambiguous bases. Using k <= 31 uses
86//! 64-bit integers and may be faster than 31 < k <= 63, which uses 128-bits.
87//!
88//! This is typically the most computationally intensive step of `ska`, and
89//! multiple `--threads` can be used to split the work over multiple CPU cores.
90//!
91//! Build from two input FASTA files with a k-mer size of 31:
92//! ```bash
93//! ska build -o seqs -k 31 assemblies/seq1.fa assemblies/seq2.fa
94//! ```
95//! This will assume sample names of `seq1` and `seq2` –- the base path and with known fastx
96//! file extensions removed. If you know the strand,
97//! for example when exclusively using reference sequences or single stranded viruses, add `--single-strand`
98//! to ignore reverse complements.
99//!
100//! ### Read files
101//!
102//! To use FASTQ files, or specify sample names, or more easily input a larger number of input files,
103//! you can provide a tab separated file list via `-f` instead of listing files. For example
104//! a file called `input_sequence.txt` to describe `sample_1` (an assembly) and `sample_2` (paired reads):
105//! ```text
106//! sample_1    assemblies/seq1.fa
107//! sample_2    reads/seq2_1.fastq.gz   reads/seq2_2.fastq.gz
108//! ```
109//! You'd run:
110//! ```bash
111//! ska build -o seqs -f input_sequence.txt
112//! ```
113//! Options for filtering/error correction are:
114//! - `--min-count`. Specify a minimum number of appearances a k-mer must have
115//!   to be included. This is an effective way of filtering sequencing errors if set
116//!   to at least three, but higher may be appropriate for higher coverage data.
117//!   A two-step blocked bloom and hash table filter is used for memory efficiency.
118//! - `--proportion-reads`. Specify a proportion of reads to use to build the .skf file.
119//!   The value of this parameter must be between 0 and 1. If you have high coverage samples
120//!   this can be used to reduce the build time.
121//! - `--qual-filter`. `none` do not filter based on quality scores.
122//!   `middle` (default) filter k-mers where the middle base is below the minimum quality.
123//!   `strict` filter k-mers where any base is below the minimum quality.
124//! - `--min-qual`. Specify a minimum PHRED score to use in the filter.
125//!
126//! FASTQ files must be paired end. If you'd like to request more flexibility in
127//! this regard please contact us.
128//!
129//! ### Threads
130//!
131//! The maximum threads actually used will be a power of two, so if you provided
132//! `--threads 6` only four would be used. Additionally, at least ten samples per
133//! thread are required so maximums are:
134//!
135//! | Samples | Maximum threads |
136//! |---------|-----------------|
137//! | 1-19    | 1               |
138//! | 20-39   | 2               |
139//! | 40-79   | 4               |
140//!
141//! and so on. Use `-v` to see a message with the number being used.
142//!
143//! Using more threads will increase memory use.
144//!
145//! You can also run blocks of samples independently (e.g. with snakemake or a
146//! job array) then use `ska merge` to combine results.
147//!
148//! ## ska align
149//!
150//! Creates an alignment from a `.skf` file or sequence files. Sites (columns) are
151//! in an arbitrary order. Two basic filters are available: `--min-freq` which
152//! sets the maximum number of missing sites; `--filter` which can be set to
153//! one of three options:
154//! * `no-filter` -- no extra filtering
155//! * `no-const` -- no constant sites
156//! * `no-ambig-or-const` -- no constant sites, or sites where the only variable base is ambiguous
157//!
158//! `no-filter` may be useful for ascertainment bias correction in phylogenetic algorithms,
159//! but note the flanking split k-mers will never be included. `no-const` is the default.
160//! `no-ambig-or-const` can be used when you want to treat any ambiguity as an `N`.
161//!
162//! With an `.skf` file from `ska build`, constant sites, and no missing variants:
163//! ```bash
164//! ska align --min-freq 1 --filter no-filter -o seqs seqs.skf
165//! ```
166//!
167//! Another example: directly from FASTA files, with default `build` and `align` settings,
168//! and gzipping the output alignment
169//! ```bash
170//! ska align seq*.fa --threads 8 | gzip -c - > seqs.aln.gz
171//! ```
172//!
173//! ## ska map
174//!
175//! Creates an alignment from a `.skf` file or sequence files by mapping its
176//! split k-mers to split k-mers of a linear reference sequence. This produces pseudoalignment
177//! with the same sites/columns as the reference sequence. Sites not mapped will
178//! be output as missing '-'.
179//!
180//! Pass the FASTA file of the reference as the first argument, and the skf file as
181//! the second argument:
182//! ```bash
183//! ska map ref.fa seqs.skf -o ref_mapped.aln
184//! ```
185//! Add `--repeat-mask` to mask any repeated split k-mers in the reference with 'N'.
186//!
187//! You can also get a VCF as output, which has rows as variants, and only has the
188//! variable sites (but will include unmapped bases as missing).
189//! An example command, also demonstrating that everything can be done from input sequences in a single command:
190//! ```bash
191//! ska map ref.fa seq1.fa seq2.fa -f vcf --threads 2 | bgzip -c - > seqs.vcf.gz
192//! ```
193//!
194//! ## ska lo
195//!
196//! Converts split k-mers from a `.skf` file into a colored De Bruijn graph and infers indels from graph bubbles and SNPs from variant groups in
197//! reference-free mode (as with `ska align`). SNPs are only composed of ATGC variants (no ambigous nucleotides). The same
198//! filtering applies to indels.
199//! Multithreading ('-t' argument) is not fully optimised - it usually takes 4 threads to halve runtimes.
200//!
201//! To generate a SNP alignment and an indel VCF file (here named 'test_snps.fas' and 'test_indels.vcf'):
202//! ```bash
203//! ska lo seqs.skf test
204//! ```
205//!
206//! It can also position SNPs on a reference genome if provided using the '-r' argument. The reference genome should
207//! be in FASTA format and composed of a unique sequence. skalo lo will then generate, in addition to the SNP alignment and the indel
208//! VCF file, a SNP VCF file and a pseudo-genome alignment (as with `ska map`) that can be used for recombination analyses.
209//! In such use case, we recommmend to increase the maximum proportion of missing data allowed per variant ('-m' argument), but not above 0.5.
210//! ```bash
211//! ska lo seqs.skf test -r reference.fas -m 0.4
212//! ```
213//! Please note that at the moment indels cannot be positioned on a reference genome.
214//!
215//!
216//! ### Efficiency
217//!
218//! As `ska map` is independent for each input file, the most efficient way to
219//! run this on a number of samples is with gnu parallel, for example:
220//! ```bash
221//! cat ${FILE_LIST} | parallel -j 8 "ID=\$(basename {} | sed 's/_a.*fasta/_temp_skf/g');
222//! ska build -o \$ID -k 31 {} ;
223//! MAPOUT=\$(basename {} | sed 's/_a.*fasta/_aln/g');
224//!  ska map $REFERENCE_LOC \${ID}.skf -o \${MAPOUT}.aln"
225//! cat *.aln > ska_map.aln
226//! ```
227//!
228//! ## ska distance
229//!
230//! Use to calculate distances between all samples within an `.skf` file. The output
231//! will contain the number of SNP differences between all pairs of samples, as
232//! well as the proportion of matching k-mers.
233//!
234//! ```bash
235//! ska distance -o distances.txt seqs.skf
236//! ```
237//!
238//! Mismatches is an estimate of the alignable genome fraction based on the unfiltered
239//! k-mers.
240//!
241//! Also consider ambiguous bases by adding `--allow-ambiguous` flag, which will give partial SNP distances.
242//! Note that ambiguous bases may overestimate distances due to repeat k-mers. Use `--min-freq` to
243//! ignore k-mers only found in some samples at the population level (default = 0.0). F
244//! or finer control over filtering, first run `ska weed`
245//! on the input .skf.
246//!
247//! Multiple threads can be used, but this will only be effective with large numbers of samples.
248//!
249//! The companion script in `scripts/cluster_dists.py` (requires `networkx`) can
250//! be used to make single linkage clusters from these distances at given thresholds,
251//! and create a visualisation in [Microreact](https://microreact.org/):
252//! ```bash
253//! ska distance seqs.skf > distances.txt
254//! python scripts/cluster_dists.py distances.txt --snps 20 --mismatches 0.05
255//! ```
256//! If you install `rapidnj` and `biopython` you can also draw an NJ tree from
257//! these distances, which will be displayed in Microreact. Use your `--api-key`
258//! to directly upload and get the URL printed on the terminal.
259//!
260//! ## ska merge
261//!
262//! Use to combine multiple `.skf` files into one, typically for subsequent use in `align` or `map`.
263//! This may be particularly useful if `ska build` was run on multiple input files
264//! one at a time (for example in a job array).
265//!
266//! ```bash
267//! ska merge -o all_samples sample1.skf sample2.skf sample3.skf
268//! ```
269//!
270//! ## ska delete
271//!
272//! Remove samples by name from an `.skf` file. All sample names must exist in the
273//! file, or an error will be thrown. After removing samples, the input `.skf` file will be overwritten,
274//! unless you provide a new one with `-o`.
275//! Any split k-mers which have no observed missing bases after removal will also be deleted.
276//!
277//! ```bash
278//! ska delete --skf-file all_samples.skf sample_1 sample_3
279//! ```
280//! If you wish to delete many samples, you can use `-f` to provide their names
281//! by line via an input file.
282//!
283//! ## ska weed
284//!
285//! Remove (weed) split k-mers from an `.skf` file. The split k-mers to be removed
286//! are generated from a FASTA file, which may for example contain known transposons or
287//! other repeat sequences. As with `delete`, the input `.skf` file is overwritten by default.
288//! Use `-o` to write the results to a new file instead of overwriting the input `.skf`.
289//!
290//! ```bash
291//! ska weed all_samples.skf MGEs.fa
292//! ```
293//!
294//! In addition, you can also use `ska weed` to filter split k-mers by proportion of samples
295//! they appear in, and constant or amibguous middle cases, which does
296//! not require a list of split k-mers to be removed (but both can be used together, if you wish).
297//!
298//! For example, you may want to reduce the size of an `.skf` file for online use.
299//! You can do this by removing any constant sites or ambiguous-only s0tes (which are typically unused), and by hard-filtering
300//! by frequency (i.e. before writing output):
301//! ```bash
302//! ska weed --filter no-ambig-or-const --min-freq 0.9 all_samples.skf
303//! ```
304//!
305//! ## ska nk
306//! Return information on the *n*umber of *k*-mers in an `.skf` file. This will
307//! print on STDOUT:
308//! - The version of ska used to build the file.
309//! - The k-mer size.
310//! - Number of bits used for the split k-mer (64 or 128).
311//! - Whether reverse complements were used.
312//! - The total number of split k-mers.
313//! - The total number of samples.
314//! - The sample names.
315//! - The number of split k-mers found in each sample.
316//!
317//! ```bash
318//! ska nk all_samples.skf
319//! ska_version=0.3.1
320//! k=21
321//! k_bits=64
322//! rc=true
323//! k-mers=3228084
324//! samples=28
325//! sample_names=["19183_4#73", "12673_8#29", "12673_8#31", "12754_5#61", "12754_5#89", "12754_5#47", "12754_5#32", "12754_5#78", "12754_4#85", "12754_5#74", "19183_4#57", "12754_5#36", "19183_4#49", "19183_4#79", "19183_4#60", "12754_5#24", "12754_5#22", "12754_5#71", "12673_8#26", "12754_5#95", "12754_5#86", "12673_8#24", "19183_4#61", "12673_8#41", "12754_4#66", "12754_5#80", "12754_5#84", "19183_4#78"]
326//! sample_kmers=[2872587, 2997448, 2949719, 2997496, 2997178, 2912749, 2996491, 2997221, 2949102, 2997454, 2914109, 2912237, 2872518, 2869957, 2872470, 2997992, 2997647, 2958512, 2998099, 2997290, 2950253, 3027707, 2997881, 2907920, 2911447, 2997644, 2944830, 2915080]
327//! ```
328//!
329//! If you add `--full-info`, the split k-mer dictionary will be decoded and printed
330//! to STDOUT after the baseline information, for example:
331//! ```bash
332//! TAAATATC        TAAACGCC        A,-
333//! AGACTCTC        TACAGCTA        G,G
334//! AAACCATC        AAACACTC        A,-
335//! TTAAAAGA        TCTCGTAC        C,C
336//! ```
337//! This is tab-separated, showing the first and second half of the split k-mer
338//! and the middle bases of each sample (comma seperated).
339//!
340//! NB: Only one split k-mer is shown even if the reverse complement was used.
341//! These are not precisely canonical k-mers, as the encoding order `{A, C, T, G}` is used internally.
342//! But if you can't find a sequence in your input file, you will find its reverse complement.
343//!
344//! ## ska cov
345//!
346//! Estimate a coverage cutoff for use with read data. This will count the split
347//! k-mers in a pair of FASTQ samples, and create a histogram of these counts.
348//! A mixture model is then fitted to this histogram using maximum likelihood,
349//! which can give a suggested cutoff with noisy data.
350//!
351//! The cutoff will be printed to STDERR. Use `-v` to get additional information on the
352//! optimisation progress and result. A table of counts and the fit will be printed
353//! to STDOUT, which can then be plotted by the companion script in
354//! `scripts/plot_cov.py` (requires `matplotlib`):
355//! ```bash
356//! ska cov reads_1.fastq.gz reads_2.fastq.gz -k 31 -v > cov_plot.txt
357//! python scripts/plot_cov.py cov_plot.txt
358//! ```
359//!
360//! The cutoff can be used with the `--min-count` parameter of `ska build`. For
361//! a set of sequence experiments with similar characteristics it may be sufficient
362//! to use the same cutoff. Alternatively `ska cov` can be run on every sample
363//! independently (`gnu parallel` would be an efficient way to do this).
364//!
365//! ## ska lo
366//!
367//! Runs "skalo", a graph-based algorithm for inferring variants from WGS outbreak data.
368//! Skalo takes as input a `.skf` file generated by the `ska build` command.
369//!
370//! To run skalo on a file named `my_file.skf` use the following command `ska lo my_file.skf skalo_results`.
371//!
372//! The skalo method
373//! can be modified using the following flags:
374//! * `-r` the path to a reference genome that the detected SNPs can be mapped onto
375//! * `-m` the maximum fraction of missing data permissible for a site where a possible SNP has been detected (default = 0.2)
376//! * `-d` the depth of the recursive path allowed during graph traversal (default = 4)
377//! * `-n` maximum number of internal indel k-mers (default = 2)
378//! * `-t` number of threads to use for parallel computation (default = 1)
379//!
380//!
381//! # API usage
382//!
383//! See the submodule documentation linked below.
384//!
385//! To use `ska build` in other rust code:
386//! ```rust
387//! use ska::merge_ska_dict::{InputFastx, build_and_merge};
388//! use ska::merge_ska_array::MergeSkaArray;
389//! use ska::{QualOpts, QualFilter};
390//!
391//! // Build, merge
392//! let input_files: [InputFastx; 2] = [("test1".to_string(),
393//!                                      "tests/test_files_in/test_1.fa".to_string(),
394//!                                      None),
395//!                                     ("test2".to_string(),
396//!                                      "tests/test_files_in/test_2.fa".to_string(),
397//!                                      None)];
398//! let rc = true;
399//! let k = 31;
400//! let quality = QualOpts {min_count: 1, min_qual: 0, qual_filter: QualFilter::NoFilter};
401//! let threads = 2;
402//! let proportion_reads = None;
403//! // NB u64 for k<=31, u128 for k<=63
404//! let merged_dict =
405//!     build_and_merge::<u64>(&input_files, k, rc, &quality, threads, proportion_reads);
406//!
407//! // Save
408//! let ska_array = MergeSkaArray::new(&merged_dict);
409//! ska_array
410//!     .save(format!("merged.skf").as_str())
411//!     .expect("Failed to save output file");
412//! ```
413//!
414//! To use `ska align` in other rust code:
415//! ```rust
416//! use ska::io_utils::{load_array, set_ostream};
417//! use ska::cli::FilterType;
418//!
419//! // Load an .skf file
420//! let threads = 4;
421//! let input = vec!["tests/test_files_in/merge.skf".to_string()];
422//! let mut ska_array = load_array::<u64>(&input, threads).expect("Could not open input as u64");
423//!
424//! // Apply filters
425//! let min_count = 2;
426//! let filter_ambig_as_missing = false;
427//! let update_kmers = false;
428//! let filter = FilterType::NoConst;
429//! let ignore_const_gaps = false;
430//! let ambig_mask = false;
431//! ska_array.filter(min_count, filter_ambig_as_missing, &filter, ambig_mask, ignore_const_gaps, update_kmers);
432//!
433//! // Write out to stdout
434//! let mut out_stream = set_ostream(&None);
435//! ska_array
436//!     .write_fasta(&mut out_stream)
437//!     .expect("Couldn't write output fasta");
438//! ```
439//!
440//! To use `ska map` in other rust code, see the example for [`ska_ref`].
441//!
442//! ## Other APIs
443//!
444//! It would be possible to make a python API using [`maturin`](https://github.com/PyO3/maturin).
445//! Please submit a feature request if you would find this useful.
446//!
447
448#![warn(missing_docs)]
449use std::fmt;
450use std::time::Instant;
451
452use clap::ValueEnum;
453extern crate num_cpus;
454
455pub mod merge_ska_dict;
456pub mod ska_dict;
457use crate::merge_ska_dict::build_and_merge;
458
459pub mod ska_ref;
460use crate::ska_ref::RefSka;
461pub mod merge_ska_array;
462use crate::merge_ska_array::MergeSkaArray;
463
464pub mod generic_modes;
465use crate::generic_modes::*;
466
467pub mod cli;
468use crate::cli::*;
469
470pub mod io_utils;
471use crate::io_utils::*;
472
473pub mod coverage;
474use crate::coverage::CoverageHistogram;
475
476pub mod skalo;
477use crate::io_utils::load_array;
478use crate::skalo::utils::Config;
479
480/// Possible quality score filters when building with reads
481#[derive(Copy, Clone, Debug, PartialEq, Eq, PartialOrd, Ord, ValueEnum)]
482pub enum QualFilter {
483    /// Ignore quality scores in reads
484    NoFilter,
485    /// Filter middle bases below quality threshold
486    Middle,
487    /// Filter entire k-mer when any base below quality threshold
488    Strict,
489}
490
491impl fmt::Display for QualFilter {
492    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
493        match *self {
494            Self::NoFilter => write!(f, "No quality filtering"),
495            Self::Middle => write!(f, "Middle base quality filtering"),
496            Self::Strict => write!(f, "Whole k-mer quality filtering"),
497        }
498    }
499}
500
501/// Quality filtering options for FASTQ files
502pub struct QualOpts {
503    /// Minimum k-mer count across reads to be added
504    pub min_count: u16,
505    /// Minimum base quality to be added
506    pub min_qual: u8,
507    /// [`QualFilter`]: apply quality across whole k-mer, just middle base, or not at all
508    pub qual_filter: QualFilter,
509}
510
511impl fmt::Display for QualOpts {
512    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
513        write!(
514            f,
515            "min count: {}; minimum quality {} ({}); filter applied: {}",
516            self.min_count,
517            self.min_qual,
518            (self.min_qual + 33) as char,
519            self.qual_filter
520        )
521    }
522}
523
524#[doc(hidden)]
525pub fn main() {
526    let args = cli_args();
527    if args.verbose {
528        simple_logger::init_with_level(log::Level::Info).unwrap();
529    } else {
530        simple_logger::init_with_level(log::Level::Warn).unwrap();
531    }
532
533    eprintln!("SKA: Split K-mer Analysis (the alignment-free aligner)");
534    let start = Instant::now();
535    match &args.command {
536        Commands::Build {
537            seq_files,
538            file_list,
539            output,
540            k,
541            single_strand,
542            min_count,
543            min_qual,
544            qual_filter,
545            threads,
546            proportion_reads,
547        } => {
548            check_threads(*threads);
549
550            // Read input
551            let input_files = get_input_list(file_list, seq_files);
552            let rc = !*single_strand;
553            // Build, merge
554            // check for 64 or 128 bit representation
555            let mut quality = QualOpts {
556                min_count: 0,
557                min_qual: *min_qual,
558                qual_filter: *qual_filter,
559            };
560            if k.le(&31) {
561                log::info!("k={}: using 64-bit representation", *k);
562                quality.min_count =
563                    kmer_min_cutoff::<u64>(min_count, &input_files, k, rc, args.verbose);
564                let merged_dict = build_and_merge::<u64>(
565                    &input_files,
566                    *k,
567                    rc,
568                    &quality,
569                    *threads,
570                    *proportion_reads,
571                );
572
573                // Save
574                save_skf(&merged_dict, output);
575            } else {
576                log::info!("k={}: using 128-bit representation", *k);
577                quality.min_count =
578                    kmer_min_cutoff::<u128>(min_count, &input_files, k, rc, args.verbose);
579                let merged_dict = build_and_merge::<u128>(
580                    &input_files,
581                    *k,
582                    rc,
583                    &quality,
584                    *threads,
585                    *proportion_reads,
586                );
587
588                // Save
589                save_skf(&merged_dict, output);
590            }
591        }
592        Commands::Align {
593            input,
594            output,
595            min_freq,
596            filter_ambig_as_missing,
597            filter,
598            ambig_mask,
599            no_gap_only_sites,
600            threads,
601        } => {
602            check_threads(*threads);
603            if let Ok(mut ska_array) = load_array::<u64>(input, *threads) {
604                // In debug mode (cannot be set from CLI, give details)
605                log::debug!("{ska_array}");
606                align(
607                    &mut ska_array,
608                    output,
609                    filter,
610                    *ambig_mask,
611                    *no_gap_only_sites,
612                    *min_freq,
613                    *filter_ambig_as_missing,
614                );
615            } else if let Ok(mut ska_array) = load_array::<u128>(input, *threads) {
616                // In debug mode (cannot be set from CLI, give details)
617                log::debug!("{ska_array}");
618                align(
619                    &mut ska_array,
620                    output,
621                    filter,
622                    *ambig_mask,
623                    *no_gap_only_sites,
624                    *min_freq,
625                    *filter_ambig_as_missing,
626                );
627            } else {
628                panic!("Could not read input file(s): {input:?}");
629            }
630        }
631        Commands::Map {
632            reference,
633            input,
634            output,
635            format,
636            ambig_mask,
637            repeat_mask,
638            threads,
639        } => {
640            check_threads(*threads);
641
642            log::info!("Loading skf as dictionary");
643            if let Ok(ska_array) = load_array::<u64>(input, *threads) {
644                log::info!(
645                    "Making skf of reference k={} rc={}",
646                    ska_array.kmer_len(),
647                    ska_array.rc()
648                );
649                let mut ska_ref = RefSka::<u64>::new(
650                    ska_array.kmer_len(),
651                    reference,
652                    ska_array.rc(),
653                    *ambig_mask,
654                    *repeat_mask,
655                );
656                map(&ska_array, &mut ska_ref, output, format, *threads);
657            } else if let Ok(ska_array) = load_array::<u128>(input, *threads) {
658                log::info!(
659                    "Making skf of reference k={} rc={}",
660                    ska_array.kmer_len(),
661                    ska_array.rc()
662                );
663                let mut ska_ref = RefSka::<u128>::new(
664                    ska_array.kmer_len(),
665                    reference,
666                    ska_array.rc(),
667                    *ambig_mask,
668                    *repeat_mask,
669                );
670                map(&ska_array, &mut ska_ref, output, format, *threads);
671            } else {
672                panic!("Could not read input file(s): {input:?}");
673            }
674        }
675        Commands::Distance {
676            skf_file,
677            output,
678            min_freq,
679            allow_ambiguous,
680            threads,
681        } => {
682            check_threads(*threads);
683            let filter_ambiguous = !*allow_ambiguous;
684            if let Ok(mut ska_array) = MergeSkaArray::<u64>::load(skf_file) {
685                // In debug mode (cannot be set from CLI, give details)
686                log::debug!("{ska_array}");
687                distance(&mut ska_array, output, *min_freq, filter_ambiguous);
688            } else if let Ok(mut ska_array) = MergeSkaArray::<u128>::load(skf_file) {
689                // In debug mode (cannot be set from CLI, give details)
690                log::debug!("{ska_array}");
691                distance(&mut ska_array, output, *min_freq, filter_ambiguous);
692            } else {
693                panic!("Could not read input file(s): {skf_file}");
694            }
695        }
696        Commands::Merge { skf_files, output } => {
697            if skf_files.len() < 2 {
698                panic!("Need at least two files to merge");
699            }
700
701            log::info!("Loading first alignment");
702            if let Ok(first_array) = MergeSkaArray::<u64>::load(&skf_files[0]) {
703                merge(&first_array, &skf_files[1..], output);
704            } else if let Ok(first_array) = MergeSkaArray::<u128>::load(&skf_files[0]) {
705                merge(&first_array, &skf_files[1..], output);
706            } else {
707                panic!("Could not read input file: {}", skf_files[0]);
708            }
709        }
710        Commands::Delete {
711            skf_file,
712            output,
713            file_list,
714            names,
715        } => {
716            let input_files = get_input_list(file_list, names);
717            let input_names: Vec<&str> = input_files.iter().map(|t| &*t.0).collect();
718            let output_file = output.clone().unwrap_or(skf_file.to_string());
719            log::info!("Loading skf file");
720            if let Ok(mut ska_array) = MergeSkaArray::<u64>::load(skf_file) {
721                delete(&mut ska_array, &input_names, &output_file);
722            } else if let Ok(mut ska_array) = MergeSkaArray::<u128>::load(skf_file) {
723                delete(&mut ska_array, &input_names, &output_file);
724            } else {
725                panic!("Could not read input file: {skf_file}");
726            }
727        }
728        Commands::Weed {
729            skf_file,
730            weed_file,
731            output,
732            reverse,
733            min_freq,
734            filter_ambig_as_missing,
735            ambig_mask,
736            no_gap_only_sites,
737            filter,
738        } => {
739            log::info!("Loading skf file");
740            if let Ok(mut ska_array) = MergeSkaArray::<u64>::load(skf_file) {
741                weed(
742                    &mut ska_array,
743                    weed_file,
744                    *reverse,
745                    *min_freq,
746                    *filter_ambig_as_missing,
747                    filter,
748                    *ambig_mask,
749                    *no_gap_only_sites,
750                    if output.is_none() {
751                        skf_file
752                    } else {
753                        output.as_ref().unwrap().as_str()
754                    },
755                );
756            } else if let Ok(mut ska_array) = MergeSkaArray::<u128>::load(skf_file) {
757                weed(
758                    &mut ska_array,
759                    weed_file,
760                    *reverse,
761                    *min_freq,
762                    *filter_ambig_as_missing,
763                    filter,
764                    *ambig_mask,
765                    *no_gap_only_sites,
766                    if output.is_none() {
767                        skf_file
768                    } else {
769                        output.as_ref().unwrap().as_str()
770                    },
771                );
772            } else {
773                panic!("Could not read input file: {skf_file}");
774            }
775        }
776        Commands::Nk {
777            skf_file,
778            full_info,
779        } => {
780            if let Ok(ska_array) = MergeSkaArray::<u64>::load(skf_file) {
781                println!("{ska_array}");
782                if *full_info {
783                    log::info!("Printing full info");
784                    println!("{ska_array:?}");
785                }
786            } else if let Ok(ska_array) = MergeSkaArray::<u128>::load(skf_file) {
787                println!("{ska_array}");
788                if *full_info {
789                    log::info!("Printing full info");
790                    println!("{ska_array:?}");
791                }
792            } else {
793                panic!("Could not read input file: {skf_file}");
794            }
795        }
796        Commands::Cov {
797            fastq_fwd,
798            fastq_rev,
799            k,
800            single_strand,
801        } => {
802            // Build, merge
803            let rc = !*single_strand;
804            let cutoff;
805            if *k <= 31 {
806                log::info!("k={}: using 64-bit representation", *k);
807                let mut cov =
808                    CoverageHistogram::<u64>::new(fastq_fwd, fastq_rev, *k, rc, args.verbose);
809                cutoff = cov.fit_histogram().expect("Couldn't fit coverage model");
810                cov.plot_hist();
811            } else {
812                log::info!("k={}: using 128-bit representation", *k);
813                let mut cov =
814                    CoverageHistogram::<u128>::new(fastq_fwd, fastq_rev, *k, rc, args.verbose);
815                cutoff = cov.fit_histogram().expect("Couldn't fit coverage model");
816                cov.plot_hist();
817            }
818            eprintln!("Estimated cutoff\t{cutoff}");
819        }
820        Commands::Lo {
821            input_skf,
822            reference,
823            output,
824            missing,
825            depth,
826            indel_kmers,
827            threads,
828        } => {
829            let config = Config {
830                input_file: input_skf.clone(),
831                output_name: output.clone(),
832                max_missing: *missing,
833                max_depth: *depth,
834                max_indel_kmers: *indel_kmers,
835                nb_threads: *threads,
836                reference_genome: reference.clone(),
837            };
838
839            if let Ok(ska_array) = load_array::<u64>(std::slice::from_ref(input_skf), *threads) {
840                log::info!("Reading file {input_skf}");
841                log::info!("Using 64-bit representation");
842                skalo(ska_array, config);
843            } else if let Ok(ska_array) =
844                load_array::<u128>(std::slice::from_ref(input_skf), *threads)
845            {
846                log::info!("Reading file {input_skf}");
847                log::info!("Using 128-bit representation");
848                skalo(ska_array, config);
849            } else {
850                panic!("Could not read input file(s): {input_skf:?}");
851            }
852        }
853    }
854    let end = Instant::now();
855
856    eprintln!("SKA done in {}s", end.duration_since(start).as_secs());
857    eprintln!("⬛⬜⬛⬜⬛⬜⬛");
858    eprintln!("⬜⬛⬜⬛⬜⬛⬜");
859    log::info!("Complete");
860}
ska/lib.rs

ska/
lib.rs