Module bio::pattern_matching::myers

source ·
Expand description

Myers bit-parallel approximate pattern matching algorithm. Finds all matches up to a given edit distance. The pattern has to fit into a bitvector, and is thus limited to 64 or (since stable Rust version 1.26) to 128 symbols. Complexity: O(n)

Example

Iterating over matches in pairs of (end, distance) using u64 as bitvector type:

use bio::pattern_matching::myers::Myers;

let text = b"CGGTCCTGAGGGATTAGCAC";
let pattern = b"TCCTAGGGC";

let myers = Myers::<u64>::new(pattern);
let occ: Vec<_> = myers.find_all_end(text, 2).collect();

assert_eq!(occ, [(11, 2), (12, 2)]);

Starting with stable Rust 1.26, it is also possible to use u128 as bitvector (Myers128), which enables longer patterns, but is somewhat slower.

Obtaining the starting position of a match

The Myers::find_all method provides an iterator over tuples of (start, end, distance). Calculating the starting position requires finding the alignment path, therefore this is slower than Myers::find_all_end. Note that the end positions differ from above by one. This is intentional, as the iterator returns a range rather an index, and ranges in Rust do not include the end position by default.

use bio::pattern_matching::myers::Myers;

let text = b"CGGTCCTGAGGGATTAGCAC";
let pattern = b"TCCTAGGGC";

let mut myers = Myers::<u64>::new(pattern);
let occ: Vec<_> = myers.find_all(text, 2).collect();

assert_eq!(occ, [(3, 12, 2), (3, 13, 2)]);

Obtaining alignments

FullMatches returned by Myers::find_all() also provide a method for obtaining an alignment path:

use bio::pattern_matching::myers::Myers;
use bio::alignment::Alignment;

let text = b"CGGTCCTGAGGGATTAGCAC";
let pattern = b"TCCTAGGGC";

let mut myers = Myers::<u64>::new(pattern);
// create an 'empty' alignment instance, which can be reused
let mut aln = Alignment::default();

let mut matches = myers.find_all(text, 3);
while matches.next_alignment(&mut aln) {
    println!("Hit fond in range: {}..{} (distance: {})", aln.ystart, aln.yend, aln.score);
    println!("{}", aln.pretty(pattern, text));
}

Output:

Hit fond in range: 3..10 (distance: 3)
   TCCTAGGGC
   ||||+|\|+
TCCTCCT-GAG-GGATTAGCAC

Hit fond in range: 3..11 (distance: 3)
   TCCTAGGGC
   ||||+|\|\
TCCTCCT-GAGGGATTAGCAC

Hit fond in range: 3..12 (distance: 2)
   TCCT-AGGGC
   ||||x||||+
TCCTCCTGAGGG-ATTAGCAC

Hit fond in range: 3..13 (distance: 2)
   TCCT-AGGGC
   ||||x||||\
TCCTCCTGAGGGATTAGCAC

... (truncated)

Note that the Alignment instance is only created once and then reused. Because the Myers algorithm is very fast, the allocation necessary for Alignment::operations can have a non-negligible impact on performance; and thus, recycling makes sense.

Finding the best hit

In many cases, only the match with the smallest edit distance is actually of interest. Calculating an alignment for every hit is therefore not necessary. LazyMatches returned by Myers::find_all_lazy() provide an iterator over tuples of (end, distance) like Myers::find_all_end(), but additionally keep the data necessary for calculating the alignment path later at any desired position. Storing the data itself has a slight performance impact and requires more memory compared to Myers::find_all_end() [O(n) as opposed to O(m + k)]. Still the following code is faster than using FullMatches:

use bio::pattern_matching::myers::Myers;
use bio::alignment::Alignment;

let text = b"CGGTCCTGAGGGATTAGCAC";
let pattern = b"TCCTAGGGC";

let mut myers = Myers::<u64>::new(pattern);
let mut aln = Alignment::default();

let mut matches = myers.find_all_lazy(text, 2);

// first, find the best hit
let (best_end, _) = matches
    .by_ref()
    .min_by_key(|&(_, dist)| dist)
    .unwrap();

// now calculate the alignment
matches.alignment_at(best_end, &mut aln);
println!("Best alignment at {}..{} (distance: {})", aln.ystart, aln.yend, aln.score);
println!("{}", aln.pretty(pattern, text));

Output:

Best alignment at 3..12 (distance: 2)
   TCCT-AGGGC
   ||||x||||+
TCCTCCTGAGGG-ATTAGCAC

Actually as seen in the previous chapters, there are two hits with the same distance of 2. It may make sense to consider both of them.

Dealing with ambiguities

Matching multiple or all symbols at once can be achieved using MyersBuilder. This example allows N in the search pattern to match all four DNA bases in the text:

use bio::pattern_matching::myers::MyersBuilder;

let text = b"GTCTGATCTTACC";
let pattern = b"TGATCNT";

let myers = MyersBuilder::new()
    .ambig(b'N', b"ACGT")
    .build_64(pattern);
assert_eq!(myers.distance(text), 0);

For more examples see the documentation of MyersBuilder.

Structs

Iterator over tuples of starting position, end position and distance of matches. In addition, methods for obtaining the hit alignment path are provided.
Iterator over tuples of end position and distance of matches. In addition, methods for obtaining the hit alignment path are provided.
Iterator over pairs of end positions and distance of matches.
Myers algorithm.
Builds a Myers instance, allowing to specify ambiguities.

Traits

This trait must be implemented for integer types serving as bit vectors. Only unsigned integers will work correctly.