[][src]Module bio::alignment::sparse

Calculate 'sparse' alignments from kmer matches. Can be much faster than Smith-Waterman for long string, when a large enough k is used. Complexity: O(n * log(n)) for a pair of strings with n k-kmer matches. This approach is useful for generating an approximate 'backbone' alignments between two long sequences, for example in long-read alignment or genome-genome alignment. The backbone alignment can be used as-is, or can serve as a guide for a banded alignment. By tuning k so that len(query) + len(reference) < 4^k, the number of false positive kmer matches is kept small, resulting in very fast run times for long strings.

Example

use bio::alignment::sparse::*;

let s1 =   b"ACGTACGATAGGTA";
let s2 = b"TTACGTACGATAGGTATT";
let k = 8;
let matches = find_kmer_matches(s1, s2, k);
let sparse_al = lcskpp(&matches, k);
let match_path: Vec<(u32,u32)> = sparse_al.path.iter().map(|i| matches[*i]).collect();
assert_eq!(match_path, vec![(0,2), (1,3), (2,4), (3,5), (4,6), (5,7), (6,8)]);
assert_eq!(sparse_al.score, 14);

Structs

SparseAlignmentResult

Result of a sparse alignment

Functions

expand_kmer_matches
find_kmer_matches

Find all matches of length k between two strings, using a q-gram index. For very long reference strings, it may be more efficient to use and FMD index to generate the matches. Note that this method is mainly for demonstration & testing purposes. For aligning many query sequences against the same reference, you should reuse the QGramIndex of the reference.

find_kmer_matches_seq1_hashed
find_kmer_matches_seq2_hashed
hash_kmers

Creates a HashMap containing all the k-mers in the sequence. FxHasher is used as the hash function instead of the inbuilt one. A good rolling hash function should speed up the code.

lcskpp

Sparse DP routine for Longest Common Subsequence in length k substrings. Also known of LCSk++ From LCSk++: Practical similarity metric for long strings. Filip Pavetić, Goran Žužić, Mile Šikić Paper here :https://arxiv.org/abs/1407.2407. Original implementation here: https://github.com/fpavetic/lcskpp

sdpkpp

Sparse DP routine generalizing LCSk++ method above to penalize alignment gaps. A gap is an unknown combination of mismatch, insertion and deletions, and incurs a penalty of gap_open + d * gap_extend, where d is the distance along the diagonal of the gap.

sdpkpp_union_lcskpp_path

Type Definitions

HashMapFx