Module bio::pattern_matching::myers
source · Expand description
Myers bit-parallel approximate pattern matching algorithm. Finds all matches up to a given edit distance. The pattern has to fit into a bitvector, and is thus limited to 64 or (since stable Rust version 1.26) to 128 symbols. Complexity: O(n)
Example
Iterating over matches in pairs of (end, distance)
using u64
as bitvector type:
use bio::pattern_matching::myers::Myers;
let text = b"CGGTCCTGAGGGATTAGCAC";
let pattern = b"TCCTAGGGC";
let myers = Myers::<u64>::new(pattern);
let occ: Vec<_> = myers.find_all_end(text, 2).collect();
assert_eq!(occ, [(11, 2), (12, 2)]);
Starting with stable Rust 1.26, it is also possible to use u128
as bitvector (Myers128
),
which enables longer patterns, but is somewhat slower.
Obtaining the starting position of a match
The Myers::find_all
method provides an iterator over tuples of (start, end, distance)
.
Calculating the starting position requires finding the alignment path, therefore this is
slower than Myers::find_all_end
. Note that the end positions differ from above by one.
This is intentional, as the iterator returns a range rather an index, and ranges in Rust
do not include the end position by default.
use bio::pattern_matching::myers::Myers;
let text = b"CGGTCCTGAGGGATTAGCAC";
let pattern = b"TCCTAGGGC";
let mut myers = Myers::<u64>::new(pattern);
let occ: Vec<_> = myers.find_all(text, 2).collect();
assert_eq!(occ, [(3, 12, 2), (3, 13, 2)]);
Obtaining alignments
FullMatches
returned by Myers::find_all()
also provide a method
for obtaining an alignment path:
use bio::pattern_matching::myers::Myers;
use bio::alignment::Alignment;
let text = b"CGGTCCTGAGGGATTAGCAC";
let pattern = b"TCCTAGGGC";
let mut myers = Myers::<u64>::new(pattern);
// create an 'empty' alignment instance, which can be reused
let mut aln = Alignment::default();
let mut matches = myers.find_all(text, 3);
while matches.next_alignment(&mut aln) {
println!("Hit fond in range: {}..{} (distance: {})", aln.ystart, aln.yend, aln.score);
println!("{}", aln.pretty(pattern, text));
}
Output:
Hit fond in range: 3..10 (distance: 3) TCCTAGGGC ||||+|\|+ TCCTCCT-GAG-GGATTAGCAC Hit fond in range: 3..11 (distance: 3) TCCTAGGGC ||||+|\|\ TCCTCCT-GAGGGATTAGCAC Hit fond in range: 3..12 (distance: 2) TCCT-AGGGC ||||x||||+ TCCTCCTGAGGG-ATTAGCAC Hit fond in range: 3..13 (distance: 2) TCCT-AGGGC ||||x||||\ TCCTCCTGAGGGATTAGCAC ... (truncated)
Note that the Alignment
instance is only created
once and then reused. Because the Myers algorithm is very fast, the allocation necessary for
Alignment::operations
can have a non-negligible impact on performance; and thus, recycling
makes sense.
Finding the best hit
In many cases, only the match with the smallest edit distance is actually of interest.
Calculating an alignment for every hit is therefore not necessary.
LazyMatches
returned by Myers::find_all_lazy()
provide an iterator over tuples of (end, distance)
like Myers::find_all_end()
, but
additionally keep the data necessary for calculating the alignment path later at any desired
position. Storing the data itself has a slight performance impact and requires more memory
compared to Myers::find_all_end()
[O(n) as opposed to O(m + k)]. Still the following code
is faster than using FullMatches
:
use bio::pattern_matching::myers::Myers;
use bio::alignment::Alignment;
let text = b"CGGTCCTGAGGGATTAGCAC";
let pattern = b"TCCTAGGGC";
let mut myers = Myers::<u64>::new(pattern);
let mut aln = Alignment::default();
let mut matches = myers.find_all_lazy(text, 2);
// first, find the best hit
let (best_end, _) = matches
.by_ref()
.min_by_key(|&(_, dist)| dist)
.unwrap();
// now calculate the alignment
matches.alignment_at(best_end, &mut aln);
println!("Best alignment at {}..{} (distance: {})", aln.ystart, aln.yend, aln.score);
println!("{}", aln.pretty(pattern, text));
Output:
Best alignment at 3..12 (distance: 2) TCCT-AGGGC ||||x||||+ TCCTCCTGAGGG-ATTAGCAC
Actually as seen in the previous chapters, there are two hits with the same distance of 2. It may make sense to consider both of them.
Dealing with ambiguities
Matching multiple or all symbols at once can be achieved using MyersBuilder
. This example
allows N
in the search pattern to match all four DNA bases in the text:
use bio::pattern_matching::myers::MyersBuilder;
let text = b"GTCTGATCTTACC";
let pattern = b"TGATCNT";
let myers = MyersBuilder::new()
.ambig(b'N', b"ACGT")
.build_64(pattern);
assert_eq!(myers.distance(text), 0);
For more examples see the documentation of MyersBuilder
.