Module pattern_matching

Module pattern_matching 

Source
Expand description

This module contains various useful pattern matching algorithms. The implementations are based on the lecture notes “Algorithmen auf Sequenzen”, Kopczynski, Marschall, Martin and Rahmann, 2008 - 2015.

  • Algorithm of Horspool: fastest for a sufficiently large alphabet
  • Shift And algorithm: fast for patterns with less than 64 symbols and very small alphabets.
  • BNDM algorithm: fast for patterns with less than 64 symbols.
  • BOM algorithm: fast for long patterns and small alphabet.
  • KMP algorithm: the classical ancestor.
  • Ukkonens algorithm: approximate pattern matching with dynamic programming.
  • Myers algorithm: linear-time approximate pattern matching with edit distance.

Another library that provides heavily optimized routines for string search primitives is memchr: https://crates.io/crates/memchr

Modules§

bndm
Backward nondeterministic DAWG matching (BNDM). Best-case complexity: O(n / m) with pattern of length m <= 64 and text of length n. Worst case complexity: O(n * m).
bom
Backward oracle matching algorithm. Best-case complexity: O(n / m) with pattern of length m and text of length n. Worst case complexity: O(n * m).
horspool
Algorithm of Horspool. Window-based, similar to but faster than Boyer-Moore.
kmp
Algorithm of Knuth Morris and Pratt. Constructs an automaton recognizing the pattern, and scans linearly over a text of length n. Complexity: O(n). The transition function delta is simulated via the lps-function, that assigns to each position q in the pattern the longest prefix of the pattern that is suffix of pattern[..q+1]. Then, in the NFA for the pattern, active states after reading position q are {q, lps(q), lps(lps(q)), … 0}.
myers
Myers bit-parallel approximate pattern matching algorithm. Finds all matches up to a given edit distance k and supports ambiguous matching. Methods for obtaining the alignment path are provided. The simple and fast implementation works with patterns of up to 64 or 128 symbols, depending on the bit vector type used. The block-based implementation (long submodule) supports patterns of arbitrary length.
pssm
Create a weight matrix representing a set of aligned reference sequences that constitute a motif, and use this matrix to scan query sequences for occurrences of this motif. Complexity: O(n*m) for motif length n and query length m
shift_and
ShiftAnd algorithm for pattern matching. Patterns may contain at most 64 symbols. Complexity: O(n) with text length n.
ukkonen
Bounded version of Ukkonens DP algorithm for approximate pattern matching. Complexity: O(n * k) on random texts of length n.