Expand description
Suffix automaton (directed acyclic word graph, DAWG).
References:
- Anselm Blumer, Janet Blumer, David Haussler, Andrzej Ehrenfeucht, M. T. Chen & Joel Seiferas, “The smallest automaton recognizing the subwords of a text”, Theoretical Computer Science 40, 1985, pp. 31–55.
- Maxime Crochemore, “Transducers and repetitions”, TCS 45, 1986 — the linear online construction with state cloning on split, used here.
§What a suffix automaton is
The suffix automaton of a string s is the smallest deterministic finite
automaton (with at most 2|s| − 1 states for |s| ≥ 2) whose recognised
language is exactly the set of substrings of s. Every state v carries:
len[v]— the length of the longest string inv’s endpos-equivalence class (two substrings are equivalent when they end at the same set of positions ins), andlink[v]— the suffix link to the state of the longest proper suffix ofv’s strings that lies in a different endpos class.
The strings recognised at state v are precisely the suffixes of its
longest string whose lengths lie in (len[link[v]], len[v]].
§Online construction
Characters are appended one at a time. Extending by c creates a new state
cur with len[cur] = len[last] + 1 and walks the suffix-link chain from
last adding c-transitions to cur. The interesting case is cloning:
when the chain reaches a state p that already has a c-transition to some
q with len[q] != len[p] + 1, a copy clone of q is inserted with
len[clone] = len[p] + 1, redirecting the affected transitions and suffix
links. This split is what keeps the automaton minimal; it is exercised
directly by the tests (e.g. "abcbc").
§Provided queries
SuffixAutomaton::contains— substring membership.SuffixAutomaton::distinct_substring_count— number of distinct non-empty substrings,Σ_v (len[v] − len[link[v]]).SuffixAutomaton::occurrences— number of (possibly overlapping) occurrences of a pattern, via the size of each state’s endpos set.longest_common_substring— the longest common substring of two strings by running the automaton of one over the other.
Input is treated as raw bytes (&[u8]); for ASCII this matches the usual
character-level notion of substring.
Structs§
- Suffix
Automaton - A suffix automaton (DAWG) built online from a byte string.
Functions§
- longest_
common_ substring - Longest common substring of
aandb.