Crate ska

source ·
Expand description

Split k-mer analysis (version 2) uses exact matching of split k-mer sequences to align closely related sequences, typically small haploid genomes such as bacteria and viruses.

SKA can only align SNPs further than the k-mer length apart, and does not use a gap penalty approach or give alignment scores. But the advantages are speed and flexibility, particularly the ability to run on a reference-free manner (i.e. including accessory genome variation) on both assemblies and reads.

§Details

A split k-mer is a sequence of bases with the middle position removed. For example, the split 9-mer pattern is XXXX-XXXX, and the split 9-mers of the sequence ACGAGAGTCTT are:

Split k-merMiddle base
ACGAAGTCG
CGAGGTCTA
GAGATCTTG

Which is broadly how SKA represents sequences in .skf files, as a dictionary with split k-mers as keys and the middle base as values. These dictionaries are merged by exactly matching the split k-mers, such that the middle positions are aligned (but unordered). Split k-mers can also be matched to those from a reference sequence to give an ordered pseudoalignment.

Various optimisations are used to make this as fast as possible. For a more thorough comparison with version 1.0 of SKA, see the github description.

NB As split k-mers are even lengths, it is possible that they are their own reverse complement. The original version of ska would randomly pick a strand, possibly introducing a SNP across samples. This version uses an ambiguous middle base (W for A/T; S for C/G) to represent this case.

Tutorials:

Command line usage follows. For API documentation and usage, see the end of this section.

§Usage

.skf files represent merged split k-mers from multiple sequence files. They are created with ska build. You can subsequently use ska align to write out an unordered alignment from these files, or ska map to write an alignment ordered versus a reference sequence.

Alternatively, both ska align and ska map can take input sequence directly to obtain output alignments in a single command and without saving an .skf file. NB: This uses the default options of ska build, so to change these you will need to run the alignment in two steps.

Output from ska align and ska map is to STDOUT, so you can use a redirect > to save to a file or pipe | to stream into another compatible program on STDIN. You can also add an output file prefix directly with -o (for ska build this is required).

§Important notes

  • In this version of ska the k-mer length is the total split k-mer size, whereas in ska1 the k-mer length is half the split length. So the default of k=15 in ska1 corresponds to choosing -k 31 in this version.
  • If you are using FASTQ input files (reads), these must be provided as two deinterleaved files using the -f option to ska build. Providing them as a single file will treat them as FASTA, and not correct sequencing errors.
  • If you are running ska map, it is more memory efficient to run these one at a time, rather than merging a single .skf file. A command for doing this in parallel is listed below.

§Common options

Version can be viewed by running ska -V.

Details and progress messages are written on STDERR. You can see more logging information by adding the verbose flag -v.

§ska build

This command creates an .skf file from sequence (.fasta/.fasta.gz/.fastq/.fastq.gz) input. K-mer size must be odd, greater than 5, and a maximum of 63. Smaller k-mers are more sensitive and can find closer positions, but are less specific so may lead to more repeated split k-mers and ambiguous bases. Using k <= 31 uses 64-bit integers and may be faster than 31 < k <= 63, which uses 128-bits.

This is typically the most computationally intensive step of ska, and multiple --threads can be used to split the work over multiple CPU cores.

Build from two input FASTA files with a k-mer size of 31:

ska build -o seqs -k 31 assemblies/seq1.fa assemblies/seq2.fa

This will assume sample names of seq1 and seq2 –- the base path and with known fastx file extensions removed. If you know the strand, for example when exclusively using reference sequences or single stranded viruses, add --single-strand to ignore reverse complements.

§Read files

To use FASTQ files, or specify sample names, or more easily input a larger number of input files, you can provide a tab separated file list via -f instead of listing files. For example a file called input_sequence.txt to describe sample_1 (an assembly) and sample_2 (paired reads):

sample_1    assemblies/seq1.fa
sample_2    reads/seq2_1.fastq.gz   reads/seq2_2.fastq.gz

You’d run:

ska build -o seqs -f input_sequence.txt

Options for filtering/error correction are:

  • --min-count. Specify a minimum number of appearances a k-mer must have to be included. This is an effective way of filtering sequencing errors if set to at least three, but higher may be appropriate for higher coverage data. A two-step blocked bloom and hash table filter is used for memory efficiency.
  • --qual-filter. none do not filter based on quality scores. middle (default) filter k-mers where the middle base is below the minimum quality. strict filter k-mers where any base is below the minimum quality.
  • --min-qual. Specify a minimum PHRED score to use in the filter.

FASTQ files must be paired end. If you’d like to request more flexibility in this regard please contact us.

§Threads

The maximum threads actually used will be a power of two, so if you provided --threads 6 only four would be used. Additionally, at least ten samples per thread are required so maximums are:

SamplesMaximum threads
1-191
20-392
40-794

and so on. Use -v to see a message with the number being used.

Using more threads will increase memory use.

You can also run blocks of samples independently (e.g. with snakemake or a job array) then use ska merge to combine results.

§ska align

Creates an alignment from a .skf file or sequence files. Sites (columns) are in an arbitrary order. Two basic filters are available: --min-freq which sets the maximum number of missing sites; --filter which can be set to one of three options:

  • no-filter – no extra filtering
  • no-const – no constant sites
  • no-ambig-or-const – no constant sites, or sites where the only variable base is ambiguous

no-filter may be useful for ascertainment bias correction in phylogenetic algorithms, but note the flanking split k-mers will never be included. no-const is the default. no-ambig-or-const can be used when you want to treat any ambiguity as an N.

With an .skf file from ska build, constant sites, and no missing variants:

ska align --min-freq 1 --filter no-filter -o seqs seqs.skf

Another example: directly from FASTA files, with default build and align settings, and gzipping the output alignment

ska align seq*.fa --threads 8 | gzip -c - > seqs.aln.gz

§ska map

Create an alignment from a .skf file or sequence files by mapping its split k-mers to split k-mers of a linear reference sequence. This produces pseudoalignment with the same sites/columns as the reference sequence. Sites not mapped will be output as missing ‘-’.

Pass the FASTA file of the reference as the first argument, and the skf file as the second argument:

ska map ref.fa seqs.skf -o ref_mapped.aln

Add --repeat-mask to mask any repeated split k-mers in the reference with ‘N’.

You can also get a VCF as output, which has rows as variants, and only has the variable sites (but will include unmapped bases as missing). An example command, also demonstrating that everything can be done from input sequences in a single command:

ska map ref.fa seq1.fa seq2.fa -f vcf --threads 2 | bgzip -c - > seqs.vcf.gz

§Efficiency

As ska map is independent for each input file, the most efficient way to run this on a number of samples is with gnu parallel, for example:

cat ${FILE_LIST} | parallel -j 8 "ID=\$(basename {} | sed 's/_a.*fasta/_temp_skf/g');
ska build -o \$ID -k 31 {} ;
MAPOUT=\$(basename {} | sed 's/_a.*fasta/_aln/g');
 ska map $REFERENCE_LOC \${ID}.skf -o \${MAPOUT}.aln"
cat *.aln > ska_map.aln

§ska distance

Use to calculate distances between all samples within an .skf file. The output will contain the number of SNP differences between all pairs of samples, as well as the proportion of matching k-mers.

ska distance -o distances.txt seqs.skf

Ignore ambiguous bases by adding --filter-ambiguous flag, and --min-freq to ignore k-mers only found in some samples. Multiple threads can be used, but this will only be effective with large numbers of samples.

The companion script in scripts/cluster_dists.py (requires networkx) can be used to make single linkage clusters from these distances at given thresholds, and create a visualisation in Microreact:

ska distance seqs.skf > distances.txt
python scripts/cluster_dists.py distances.txt --snps 20 --mismatches 0.05

If you install rapidnj and biopython you can also draw an NJ tree from these distances, which will be displayed in Microreact. Use your --api-key to directly upload and get the URL printed on the terminal.

§ska merge

Use to combine multiple .skf files into one, typically for subsequent use in align or map. This may be particularly useful if ska build was run on multiple input files one at a time (for example in a job array).

ska merge -o all_samples sample1.skf sample2.skf sample3.skf

§ska delete

Remove samples by name from an .skf file. All sample names must exist in the file, or an error will be thrown. After removing samples, the input .skf file will be overwritten. Any split k-mers which have no observed missing bases after removal will also be deleted.

ska delete all_samples.skf sample_1 sample_3

If you wish to delete many samples, you can use -f to provide their names by line via an input file.

§ska weed

Remove (weed) split k-mers from an .skf file. The split k-mers to be removed are generated from a FASTA file, which may for example contain known transposons or other repeat sequences. As with delete, the input .skf file is overwritten by default. Use -o to write the results to a new file instead of overwriting the input .skf.

ska weed all_samples.skf MGEs.fa

In addition, you can also use ska weed to filter split k-mers by proportion of samples they appear in, and constant or amibguous middle cases, which does not require a list of split k-mers to be removed (but both can be used together, if you wish).

For example, you may want to reduce the size of an .skf file for online use. You can do this by removing any constant sites or ambiguous-only s0tes (which are typically unused), and by hard-filtering by frequency (i.e. before writing output):

ska weed --filter no-ambig-or-const --min-freq 0.9 all_samples.skf

§ska nk

Return information on the number of k-mers in an .skf file. This will print on STDOUT:

  • The version of ska used to build the file.
  • The k-mer size.
  • Number of bits used for the split k-mer (64 or 128).
  • Whether reverse complements were used.
  • The total number of split k-mers.
  • The total number of samples.
  • The sample names.
  • The number of split k-mers found in each sample.
ska nk all_samples.skf
ska_version=0.3.1
k=21
k_bits=64
rc=true
k-mers=3228084
samples=28
sample_names=["19183_4#73", "12673_8#29", "12673_8#31", "12754_5#61", "12754_5#89", "12754_5#47", "12754_5#32", "12754_5#78", "12754_4#85", "12754_5#74", "19183_4#57", "12754_5#36", "19183_4#49", "19183_4#79", "19183_4#60", "12754_5#24", "12754_5#22", "12754_5#71", "12673_8#26", "12754_5#95", "12754_5#86", "12673_8#24", "19183_4#61", "12673_8#41", "12754_4#66", "12754_5#80", "12754_5#84", "19183_4#78"]
sample_kmers=[2872587, 2997448, 2949719, 2997496, 2997178, 2912749, 2996491, 2997221, 2949102, 2997454, 2914109, 2912237, 2872518, 2869957, 2872470, 2997992, 2997647, 2958512, 2998099, 2997290, 2950253, 3027707, 2997881, 2907920, 2911447, 2997644, 2944830, 2915080]

If you add --full-info, the split k-mer dictionary will be decoded and printed to STDOUT after the baseline information, for example:

TAAATATC        TAAACGCC        A,-
AGACTCTC        TACAGCTA        G,G
AAACCATC        AAACACTC        A,-
TTAAAAGA        TCTCGTAC        C,C

This is tab-separated, showing the first and second half of the split k-mer and the middle bases of each sample (comma seperated).

NB: Only one split k-mer is shown even if the reverse complement was used. These are not precisely canonical k-mers, as the encoding order {A, C, T, G} is used internally. But if you can’t find a sequence in your input file, you will find its reverse complement.

§ska cov

Estimate a coverage cutoff for use with read data. This will count the split k-mers in a pair of FASTQ samples, and create a histogram of these counts. A mixture model is then fitted to this histogram using maximum likelihood, which can give a suggested cutoff with noisy data.

The cutoff will be printed to STDERR. Use -v to get additional information on the optimisation progress and result. A table of counts and the fit will be printed to STDOUT, which can then be plotted by the companion script in scripts/plot_cov.py (requires matplotlib):

ska cov reads_1.fastq.gz reads_2.fastq.gz -k 31 -v > cov_plot.txt
python scripts/plot_cov.py cov_plot.txt

The cutoff can be used with the --min-count parameter of ska build. For a set of sequence experiments with similar characteristics it may be sufficient to use the same cutoff. Alternatively ska cov can be run on every sample independently (gnu parallel would be an efficient way to do this).

§API usage

See the submodule documentation linked below.

To use ska build in other rust code:

use ska::merge_ska_dict::{InputFastx, build_and_merge};
use ska::merge_ska_array::MergeSkaArray;
use ska::{QualOpts, QualFilter};

// Build, merge
let input_files: [InputFastx; 2] = [("test1".to_string(),
                                     "tests/test_files_in/test_1.fa".to_string(),
                                     None),
                                    ("test2".to_string(),
                                     "tests/test_files_in/test_2.fa".to_string(),
                                     None)];
let rc = true;
let k = 17;
let quality = QualOpts {min_count: 1, min_qual: 0, qual_filter: QualFilter::NoFilter};
let threads = 2;
// NB u64 for k<=31, u128 for k<=63
let merged_dict =
    build_and_merge::<u64>(&input_files, k, rc, &quality, threads);

// Save
let ska_array = MergeSkaArray::new(&merged_dict);
ska_array
    .save(format!("merged.skf").as_str())
    .expect("Failed to save output file");

To use ska align in other rust code:

use ska::io_utils::{load_array, set_ostream};
use ska::cli::FilterType;

// Load an .skf file
let threads = 4;
let input = vec!["tests/test_files_in/merge.skf".to_string()];
let mut ska_array = load_array::<u64>(&input, threads).expect("Could not open input as u64");

// Apply filters
let min_count = 2;
let filter_ambig_as_missing = false;
let update_kmers = false;
let filter = FilterType::NoConst;
let ignore_const_gaps = false;
let ambig_mask = false;
ska_array.filter(min_count, filter_ambig_as_missing, &filter, ambig_mask, ignore_const_gaps, update_kmers);

// Write out to stdout
let mut out_stream = set_ostream(&None);
ska_array
    .write_fasta(&mut out_stream)
    .expect("Couldn't write output fasta");

To use ska map in other rust code, see the example for ska_ref.

§Other APIs

It would be possible to make a python API using maturin. Please submit a feature request if you would find this useful.

Modules§

  • Command line interface, built using crate::clap with Derive
  • Tools for estimating a count cutoff with FASTQ input.
  • Main control of most CLI functions, generic over u64 and u128.
  • Common helper functions for parsing file input, loading, and setting output
  • Main type for working with multiple samples.
  • Type for combining split-kmers from multiple SkaDict.
  • Type for building a split k-mer dictionary from a fasta/fastq file.
  • Types for listing split-kmers from one input file (usually a FASTA reference sequence).

Structs§

  • Quality filtering options for FASTQ files

Enums§

  • Possible quality score filters when building with reads