Module bio::pattern_matching
source · Expand description
This module contains various useful pattern matching algorithms. The implementations are based on the lecture notes “Algorithmen auf Sequenzen”, Kopczynski, Marschall, Martin and Rahmann, 2008 - 2015.
- Algorithm of Horspool: fastest for a sufficiently large alphabet
- Shift And algorithm: fast for patterns with less than 64 symbols and very small alphabets.
- BNDM algorithm: fast for patterns with less than 64 symbols.
- BOM algorithm: fast for long patterns and small alphabet.
- KMP algorithm: the classical ancestor.
- Ukkonens algorithm: approximate pattern matching with dynamic programming.
- Myers algorithm: linear-time approximate pattern matching with edit distance for small patterns
Another fast pattern matching algorithm is available in the twoway crate: https://crates.io/crates/twoway
Modules
Backward nondeterministic DAWG matching (BNDM).
Best-case complexity: O(n / m) with pattern of length m <= 64 and text of length n.
Worst case complexity: O(n * m).
Backward oracle matching algorithm.
Best-case complexity: O(n / m) with pattern of length m and text of length n.
Worst case complexity: O(n * m).
Algorithm of Horspool.
Window-based, similar to but faster than Boyer-Moore.
Algorithm of Knuth Morris and Pratt.
Constructs an automaton recognizing the pattern, and scans linearly over
a text of length n. Complexity: O(n).
The transition function delta is simulated via the lps-function, that assigns to each position
q in the pattern the longest prefix of the pattern that is suffix of pattern[..q+1].
Then, in the NFA for the pattern, active states after reading position q are
{q, lps(q), lps(lps(q)), … 0}.
Myers bit-parallel approximate pattern matching algorithm.
Finds all matches up to a given edit distance. The pattern has to fit into a bitvector,
and is thus limited to 64 or (since stable Rust version 1.26) to 128 symbols.
Complexity: O(n)
Create a weight matrix representing a set of aligned reference sequences
that constitute a motif, and use this matrix to scan query sequences for
occurrences of this motif.
Complexity: O(n*m) for motif length n and query length m
ShiftAnd
algorithm for pattern matching.
Patterns may contain at most 64 symbols.
Complexity: O(n) with text length n.Bounded version of Ukkonens DP algorithm for approximate pattern matching.
Complexity: O(n * k) on random texts.