Expand description
Suffix array (SA-IS) with Kasai’s LCP array and SA binary-search.
References:
- Ge Nong, Sen Zhang & Wai Hong Chan, “Linear Suffix Array Construction by Almost Pure Induced-Sorting”, Data Compression Conference (DCC), 2009, pp. 193–202 — the SA-IS algorithm implemented here.
- Toru Kasai, Gunho Lee, Hiroki Arimura, Setsuo Arikawa & Kunsoo Park, “Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications”, CPM 2001, LNCS 2089, pp. 181–192 — the LCP array.
§Suffix array
The suffix array of a string s of length n is the permutation
sa[0..n] of 0..n that lists the starting positions of all suffixes of
s in lexicographic order: s[sa[0]..] < s[sa[1]..] < … < s[sa[n-1]..].
SA-IS builds it in O(n) time and O(n) space by:
- classifying each position as S-type (its suffix is lexicographically smaller than the next position’s suffix) or L-type (larger);
- identifying LMS (“left-most S”) positions — an S-type position whose predecessor is L-type — which split the string into LMS-substrings;
- sorting the LMS-substrings by two induced-sort passes (sort L-types from sorted S/LMS, then S-types from sorted L-types);
- naming the sorted LMS-substrings and recursing on the shorter name string if any two share a name, then inducing the final order from the recursive answer.
A unique sentinel smaller than every real byte is appended internally so the last suffix is always the unique smallest; the returned array excludes it.
§LCP array (Kasai)
lcp[i] (for 1 ≤ i < n) is the length of the longest common prefix of the
two adjacent suffixes s[sa[i-1]..] and s[sa[i]..]; lcp[0] = 0 by
convention. Kasai’s algorithm computes all of them in O(n) using the rank
(inverse SA) array and the observation that the LCP of consecutive text
positions can only shrink by one as we advance the suffix start.
§Pattern search
Because suffixes are sorted, every occurrence of a pattern p corresponds to
a contiguous range of the suffix array (the suffixes that have p as a
prefix). SuffixArray::search locates that range with two binary searches
and returns the sorted text positions.
Inputs are raw bytes (&[u8]); the alphabet is the full byte range.
Structs§
- Suffix
Array - A suffix array together with its source length and (lazily-built) LCP array.