Module bio::pattern_matching

source ·
Expand description

This module contains various useful pattern matching algorithms. The implementations are based on the lecture notes “Algorithmen auf Sequenzen”, Kopczynski, Marschall, Martin and Rahmann, 2008 - 2015.

  • Algorithm of Horspool: fastest for a sufficiently large alphabet
  • Shift And algorithm: fast for patterns with less than 64 symbols and very small alphabets.
  • BNDM algorithm: fast for patterns with less than 64 symbols.
  • BOM algorithm: fast for long patterns and small alphabet.
  • KMP algorithm: the classical ancestor.
  • Ukkonens algorithm: approximate pattern matching with dynamic programming.
  • Myers algorithm: linear-time approximate pattern matching with edit distance for small patterns

Another library that provides heavily optimized routines for string search primitives is memchr: https://crates.io/crates/memchr

Modules§

  • Backward nondeterministic DAWG matching (BNDM). Best-case complexity: O(n / m) with pattern of length m <= 64 and text of length n. Worst case complexity: O(n * m).
  • Backward oracle matching algorithm. Best-case complexity: O(n / m) with pattern of length m and text of length n. Worst case complexity: O(n * m).
  • Algorithm of Horspool. Window-based, similar to but faster than Boyer-Moore.
  • Algorithm of Knuth Morris and Pratt. Constructs an automaton recognizing the pattern, and scans linearly over a text of length n. Complexity: O(n). The transition function delta is simulated via the lps-function, that assigns to each position q in the pattern the longest prefix of the pattern that is suffix of pattern[..q+1]. Then, in the NFA for the pattern, active states after reading position q are {q, lps(q), lps(lps(q)), … 0}.
  • Myers bit-parallel approximate pattern matching algorithm. Finds all matches up to a given edit distance. The pattern has to fit into a bitvector, and is thus limited to 64 or (since stable Rust version 1.26) to 128 symbols. Complexity: O(n)
  • Create a weight matrix representing a set of aligned reference sequences that constitute a motif, and use this matrix to scan query sequences for occurrences of this motif. Complexity: O(n*m) for motif length n and query length m
  • ShiftAnd algorithm for pattern matching. Patterns may contain at most 64 symbols. Complexity: O(n) with text length n.
  • Bounded version of Ukkonens DP algorithm for approximate pattern matching. Complexity: O(n * k) on random texts.