[][src]Module bio::pattern_matching

This module contains various useful pattern matching algorithms. The implementations are based on the lecture notes "Algorithmen auf Sequenzen", Kopczynski, Marschall, Martin and Rahmann, 2008 - 2015.

  • Algorithm of Horspool: fastest for a sufficiently large alphabet
  • Shift And algorithm: fast for patterns with less than 64 symbols and very small alphabets.
  • BNDM algorithm: fast for patterns with less than 64 symbols.
  • BOM algorithm: fast for long patterns and small alphabet.
  • KMP algorithm: the classical ancestor.
  • Ukkonens algorithm: approximate pattern matching with dynamic programming.
  • Myers algorithm: linear-time approximate pattern matching with edit distance for small patterns

Another fast pattern matching algorithm is available in the twoway crate: https://crates.io/crates/twoway

Modules

bndm

Backward nondeterministic DAWG matching (BNDM). Best-case complexity: O(n / m) with pattern of length m <= 64 and text of length n. Worst case complexity: O(n * m).

bom

Backward oracle matching algorithm. Best-case complexity: O(n / m) with pattern of length m and text of length n. Worst case complexity: O(n * m).

horspool

Algorithm of Horspool. Window-based, similar to but faster than Boyer-Moore.

kmp

Algorithm of Knuth Morris and Pratt. Constructs an automaton recognizing the pattern, and scans linearly over a text of length n. Complexity: O(n). The transition function delta is simulated via the lps-function, that assigns to each position q in the pattern the longest prefix of the pattern that is suffix of pattern[..q+1]. Then, in the NFA for the pattern, active states after reading position q are {q, lps(q), lps(lps(q)), ... 0}.

myers

Myers bit-parallel approximate pattern matching algorithm. Finds all matches up to a given edit distance. The pattern has to fit into a bitvector, and is thus limited to 64 or (since stable Rust version 1.26) to 128 symbols. Complexity: O(n)

pssm

Create a weight matrix representing a set of aligned reference sequences that constitute a motif, and use this matrix to scan query sequences for occurrences of this motif. Complexity: O(n*m) for motif length n and query length m

shift_and

ShiftAnd algorithm for pattern matching. Patterns may contain at most 64 symbols. Complexity: O(n) with text length n.

ukkonen

Bounded version of Ukkonens DP algorithm for approximate pattern matching. Complexity: O(n * k) on random texts.