Skip to main content

Module suffix_array

Module suffix_array 

Source
Expand description

Suffix array (SA-IS) with Kasai’s LCP array and SA binary-search.

References:

  • Ge Nong, Sen Zhang & Wai Hong Chan, “Linear Suffix Array Construction by Almost Pure Induced-Sorting”, Data Compression Conference (DCC), 2009, pp. 193–202 — the SA-IS algorithm implemented here.
  • Toru Kasai, Gunho Lee, Hiroki Arimura, Setsuo Arikawa & Kunsoo Park, “Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications”, CPM 2001, LNCS 2089, pp. 181–192 — the LCP array.

§Suffix array

The suffix array of a string s of length n is the permutation sa[0..n] of 0..n that lists the starting positions of all suffixes of s in lexicographic order: s[sa[0]..] < s[sa[1]..] < … < s[sa[n-1]..].

SA-IS builds it in O(n) time and O(n) space by:

  1. classifying each position as S-type (its suffix is lexicographically smaller than the next position’s suffix) or L-type (larger);
  2. identifying LMS (“left-most S”) positions — an S-type position whose predecessor is L-type — which split the string into LMS-substrings;
  3. sorting the LMS-substrings by two induced-sort passes (sort L-types from sorted S/LMS, then S-types from sorted L-types);
  4. naming the sorted LMS-substrings and recursing on the shorter name string if any two share a name, then inducing the final order from the recursive answer.

A unique sentinel smaller than every real byte is appended internally so the last suffix is always the unique smallest; the returned array excludes it.

§LCP array (Kasai)

lcp[i] (for 1 ≤ i < n) is the length of the longest common prefix of the two adjacent suffixes s[sa[i-1]..] and s[sa[i]..]; lcp[0] = 0 by convention. Kasai’s algorithm computes all of them in O(n) using the rank (inverse SA) array and the observation that the LCP of consecutive text positions can only shrink by one as we advance the suffix start.

Because suffixes are sorted, every occurrence of a pattern p corresponds to a contiguous range of the suffix array (the suffixes that have p as a prefix). SuffixArray::search locates that range with two binary searches and returns the sorted text positions.

Inputs are raw bytes (&[u8]); the alphabet is the full byte range.

Structs§

SuffixArray
A suffix array together with its source length and (lazily-built) LCP array.