Expand description
This module contains various useful pattern matching algorithms. The implementations are based on the lecture notes “Algorithmen auf Sequenzen”, Kopczynski, Marschall, Martin and Rahmann, 2008 - 2015.
- Algorithm of Horspool: fastest for a sufficiently large alphabet
- Shift And algorithm: fast for patterns with less than 64 symbols and very small alphabets.
- BNDM algorithm: fast for patterns with less than 64 symbols.
- BOM algorithm: fast for long patterns and small alphabet.
- KMP algorithm: the classical ancestor.
- Ukkonens algorithm: approximate pattern matching with dynamic programming.
- Myers algorithm: linear-time approximate pattern matching with edit distance.
Another library that provides heavily optimized routines for string search primitives is memchr: https://crates.io/crates/memchr
Modules§
- bndm
- Backward nondeterministic DAWG matching (BNDM). Best-case complexity: O(n / m) with pattern of length m <= 64 and text of length n. Worst case complexity: O(n * m).
- bom
- Backward oracle matching algorithm. Best-case complexity: O(n / m) with pattern of length m and text of length n. Worst case complexity: O(n * m).
- horspool
- Algorithm of Horspool. Window-based, similar to but faster than Boyer-Moore.
- kmp
- Algorithm of Knuth Morris and Pratt. Constructs an automaton recognizing the pattern, and scans linearly over a text of length n. Complexity: O(n). The transition function delta is simulated via the lps-function, that assigns to each position q in the pattern the longest prefix of the pattern that is suffix of pattern[..q+1]. Then, in the NFA for the pattern, active states after reading position q are {q, lps(q), lps(lps(q)), … 0}.
- myers
- Myers bit-parallel approximate pattern matching algorithm.
Finds all matches up to a given edit distance k and supports ambiguous matching.
Methods for obtaining the alignment path are provided.
The simple and fast implementation works with patterns of up to
64 or 128 symbols, depending on the bit vector type used.
The block-based implementation (
longsubmodule) supports patterns of arbitrary length. - pssm
- Create a weight matrix representing a set of aligned reference sequences that constitute a motif, and use this matrix to scan query sequences for occurrences of this motif. Complexity: O(n*m) for motif length n and query length m
- shift_
and ShiftAndalgorithm for pattern matching. Patterns may contain at most 64 symbols. Complexity: O(n) with text length n.- ukkonen
- Bounded version of Ukkonens DP algorithm for approximate pattern matching. Complexity: O(n * k) on random texts of length n.