[−][src]Module bio::pattern_matching::myers
Myers bit-parallel approximate pattern matching algorithm. Finds all matches up to a given edit distance. The pattern has to fit into a bitvector, and is thus limited to 64 or (since stable Rust version 1.26) to 128 symbols. Complexity: O(n)
Traceback allows obtaining the starting position and the alignment path of the hit. Its implementation is somehow similar to the one by Edlib (Šošić and Šikić 2017), although there can be small differences when there is more than one possible alignment path with then same edit distance at a position: Edlib prefers to make insertions and deletions to the pattern (query) over substitutions (Insertion > Deletion > Substitution) while our implementation prefers substitutions (Substitution > Insertion > Deletion).
Myers, G. (1999). A fast bit-vector algorithm for approximate string matching based on dynamic programming. Journal of the ACM (JACM) 46, 395–415.
Šošić, M., and Šikić, M. (2017). Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance. Bioinformatics 33, 1394–1395.
Example
Iterating over matches in pairs of (end, distance)
using u64
as bitvector type:
use bio::pattern_matching::myers::Myers; let text = b"CGGTCCTGAGGGATTAGCAC"; let pattern = b"TCCTAGGGC"; let myers = Myers::<u64>::new(pattern); let occ: Vec<_> = myers.find_all_end(text, 2).collect(); assert_eq!(occ, [(11, 2), (12, 2)]);
Starting with stable Rust 1.26, it is also possible to use u128
as bitvector
(Myers::<u128>
), which enables longer patterns, but is somewhat slower.
Long patterns
With the default implementation, query (pattern) length is limited by the size of the
bit-vector; 64 symbols for Myers::<u64>
. Patterns longer than 128 symbols (when using
u128
as bit-vector) can only be handled by using the block-based Myers implementation,
which lives in the long
submodule. An example:
use bio::pattern_matching::myers::{Myers, long}; let text = b"CGGTCCTGAGGGATTAGCAC"; let pattern = b"TCCTAGGGC"; let myers_64 = Myers::<u64>::new(pattern); let occ_64: Vec<_> = myers_64.find_all_end(text, 2).collect(); // the pattern of length 9 is too long to fit into a single `u8` bit-vector // (panics!) // let myers_8 = Myers::<u8>::new(pattern); // However, we can use the block-based implementation with `u8` bit-vectors let myers_long_8 = long::Myers::<u8>::new(pattern); let occ_long_8: Vec<_> = myers_long_8 .find_all_end(text, 2) .map(|(end, dist)| (end, dist as u8)) .collect(); assert_eq!(occ_64, occ_long_8);
Note that u8
just used for demonstration, using u64
is still the best in most cases.
Obtaining the starting position of a match
The Myers::find_all
method provides an iterator over tuples of (start, end, distance)
.
Calculating the starting position requires finding the alignment path, therefore this is
slower than Myers::find_all_end
. Note that the end positions differ from above by one.
This is intentional, as the iterator returns a range rather an index, and ranges in Rust
do not include the end position by default.
use bio::pattern_matching::myers::Myers; let text = b"CGGTCCTGAGGGATTAGCAC"; let pattern = b"TCCTAGGGC"; let mut myers = Myers::<u64>::new(pattern); let occ: Vec<_> = myers.find_all(text, 2).collect(); assert_eq!(occ, [(3, 12, 2), (3, 13, 2)]);
Obtaining alignments
FullMatches
returned by Myers::find_all()
also provide a method
for obtaining an alignment path:
use bio::pattern_matching::myers::Myers; use bio::alignment::Alignment; let text = b"CGGTCCTGAGGGATTAGCAC"; let pattern = b"TCCTAGGGC"; let mut myers = Myers::<u64>::new(pattern); // create an 'empty' alignment instance, which can be reused let mut aln = Alignment::default(); let mut matches = myers.find_all(text, 3); while matches.next_alignment(&mut aln) { //println!("Hit fond in range: {}..{} (distance: {})", aln.ystart, aln.yend, aln.score); //println!("{}", aln.pretty(pattern, text)); }
Output:
Hit fond in range: 3..10 (distance: 3) TCCTAGGGC ||||+|\|+ TCCTCCT-GAG-GGATTAGCAC Hit fond in range: 3..11 (distance: 3) TCCTAGGGC ||||+|\|\ TCCTCCT-GAGGGATTAGCAC Hit fond in range: 3..12 (distance: 2) TCCT-AGGGC ||||x||||+ TCCTCCTGAGGG-ATTAGCAC Hit fond in range: 3..13 (distance: 2) TCCT-AGGGC ||||x||||\ TCCTCCTGAGGGATTAGCAC ... (truncated)
Note that the Alignment
instance is only created
once and then reused. Because the Myers algorithm is very fast, the allocation necessary for
Alignment::operations
can have a non-negligible impact on performance; and thus, recycling
makes sense.
Finding the best hit
In many cases, only the match with the smallest edit distance is actually of interest.
Calculating an alignment for every hit is therefore not necessary.
LazyMatches
returned by Myers::find_all_lazy()
provide an iterator over tuples of (end, distance)
like Myers::find_all_end()
, but
additionally keep the data necessary for calculating the alignment path later at any desired
position. Storing the data itself has a slight performance impact and requires more memory
compared to Myers::find_all_end()
[O(n) as opposed to O(m + k)]. Still the following code
is faster than using FullMatches
:
use bio::pattern_matching::myers::Myers; use bio::alignment::Alignment; let text = b"CGGTCCTGAGGGATTAGCAC"; let pattern = b"TCCTAGGGC"; let mut myers = Myers::<u64>::new(pattern); let mut aln = Alignment::default(); let mut matches = myers.find_all_lazy(text, 2); // first, find the best hit let (best_end, _) = matches .by_ref() .min_by_key(|&(_, dist)| dist) .unwrap(); // now calculate the alignment matches.alignment_at(best_end, &mut aln); println!("Best alignment at {}..{} (distance: {})", aln.ystart, aln.yend, aln.score); println!("{}", aln.pretty(pattern, text));
Output:
Best alignment at 3..12 (distance: 2) TCCT-AGGGC ||||x||||+ TCCTCCTGAGGG-ATTAGCAC
Actually as seen in the previous chapters, there are two hits with the same distance of 2. It may make sense to consider both of them.
Dealing with ambiguities
Matching multiple or all symbols at once can be achieved using MyersBuilder
. This example
allows N
in the search pattern to match all four DNA bases in the text:
use bio::pattern_matching::myers::MyersBuilder; let text = b"GTCTGATCTTACC"; let pattern = b"TGATCNT"; let myers = MyersBuilder::new() .ambig(b'N', b"ACGT") .build_64(pattern); assert_eq!(myers.distance(text), 0);
For more examples see the documentation of MyersBuilder
.
Modules
long | Block-based version of the algorithm, which does not restrict pattern length. |
Structs
FullMatches | Iterator over tuples of starting position, end position and distance of matches. In addition, methods for obtaining the hit alignment path are provided. |
LazyMatches | Iterator over tuples of end position and distance of matches. In addition, methods for obtaining the hit alignment path are provided. |
Matches | Iterator over pairs of end positions and distance of matches. |
Myers | Myers algorithm. |
MyersBuilder | Builds a Myers instance, allowing to specify ambiguities. |
Traits
BitVec | This trait must be implemented for integer types serving as bit vectors. Only unsigned integers will work correctly. |
DistType | Trait for types that should be used to store the distance score when using the simple
Myers algorithm (not the block-based one, which always uses |