Skip to main content

Module suffix_automaton

Module suffix_automaton 

Source
Expand description

Suffix automaton (directed acyclic word graph, DAWG).

References:

  • Anselm Blumer, Janet Blumer, David Haussler, Andrzej Ehrenfeucht, M. T. Chen & Joel Seiferas, “The smallest automaton recognizing the subwords of a text”, Theoretical Computer Science 40, 1985, pp. 31–55.
  • Maxime Crochemore, “Transducers and repetitions”, TCS 45, 1986 — the linear online construction with state cloning on split, used here.

§What a suffix automaton is

The suffix automaton of a string s is the smallest deterministic finite automaton (with at most 2|s| − 1 states for |s| ≥ 2) whose recognised language is exactly the set of substrings of s. Every state v carries:

  • len[v] — the length of the longest string in v’s endpos-equivalence class (two substrings are equivalent when they end at the same set of positions in s), and
  • link[v] — the suffix link to the state of the longest proper suffix of v’s strings that lies in a different endpos class.

The strings recognised at state v are precisely the suffixes of its longest string whose lengths lie in (len[link[v]], len[v]].

§Online construction

Characters are appended one at a time. Extending by c creates a new state cur with len[cur] = len[last] + 1 and walks the suffix-link chain from last adding c-transitions to cur. The interesting case is cloning: when the chain reaches a state p that already has a c-transition to some q with len[q] != len[p] + 1, a copy clone of q is inserted with len[clone] = len[p] + 1, redirecting the affected transitions and suffix links. This split is what keeps the automaton minimal; it is exercised directly by the tests (e.g. "abcbc").

§Provided queries

Input is treated as raw bytes (&[u8]); for ASCII this matches the usual character-level notion of substring.

Structs§

SuffixAutomaton
A suffix automaton (DAWG) built online from a byte string.

Functions§

longest_common_substring
Longest common substring of a and b.