pub struct SuffixTable<'s, 't> { /* private fields */ }
Expand description
A suffix table is a sequence of lexicographically sorted suffixes.
The lifetimes 's
and 't
(respectively) refer to the text and suffix
indices when borrowed.
This is distinct from a suffix array in that it only contains suffix indices. It has no “enhanced” information like the inverse suffix table or least-common-prefix lengths (LCP array). This representation limits what you can do (and how fast), but it uses very little memory (4 bytes per character in the text).
§Construction
Suffix array construction is done in O(n)
time and in O(kn)
space,
where k
is the number of unique characters in the text. (More details
below.) The specific algorithm implemented is from
(Nong et al., 2009),
but I actually used the description found in
(Shrestha et al., 2014),
because it is much more accessible to someone who is not used to reading
algorithms papers.
The main thrust of the algorithm is that of “reduce and conquer.” Namely,
it reduces the problem of finding lexicographically sorted suffixes to a
smaller subproblem, and solves it recursively. The subproblem is to find
the suffix array of a smaller string, where that string is composed by
naming contiguous regions of the original text. If there are any duplicate
names, then the algorithm proceeds recursively. If there are no duplicate
names (base case), then the suffix array of the subproblem is already
computed. In essence, this “inductively sorts” suffixes of the original
text with several linear scans over the text. Because of the number of
linear scans, the performance of construction is heavily tied to cache
performance (and this is why u32
is used to represent the suffix index
instead of a u64
).
The space usage is roughly 6
bytes per character. (The optimal bound is
5
bytes per character, although that may be for a small constant
alphabet.) 4
bytes comes from the suffix array itself. The extra 2
bytes comes from storing the suffix type of each character (1
byte) and
information about bin boundaries, where the number of bins is equal to
the number of unique characters in the text. This doesn’t formally imply
another byte of overhead, but in practice, the alphabet can get quite large
when solving the subproblems mentioned above (even if the alphabet of the
original text is very small).
Implementations§
Source§impl<'s, 't> SuffixTable<'s, 't>
impl<'s, 't> SuffixTable<'s, 't>
Sourcepub fn new<S>(text: S) -> SuffixTable<'s, 't>
pub fn new<S>(text: S) -> SuffixTable<'s, 't>
Creates a new suffix table for text
in O(n)
time and O(kn)
space, where k
is the size of the alphabet in the text.
The table stores either S
or a &S
and a lexicographically sorted
list of suffixes. Each suffix is represented by a 32 bit integer and
is a byte index into text
.
§Panics
Panics if the text
contains more than 2^32 - 1
bytes. This
restriction is mostly artificial; there’s no fundamental reason why
suffix array construction algorithm can’t use a u64
. Nevertheless,
u32
was chosen for performance reasons. The performance of the
construction algorithm is highly dependent on cache performance, which
is degraded with a bigger number type. u32
strikes a nice balance; it
gets good performance while allowing most reasonably sized documents
(~4GB).
Sourcepub fn from_parts<S, T>(text: S, table: T) -> SuffixTable<'s, 't>
pub fn from_parts<S, T>(text: S, table: T) -> SuffixTable<'s, 't>
Creates a new suffix table from an existing list of lexicographically sorted suffix indices.
Note that the invariant that table
must be a suffix table of text
is not checked! If it isn’t, this will cause other operations on a
suffix table to fail in weird ways.
This fails if the number of characters in text
does not equal the
number of suffixes in table
.
Sourcepub fn into_parts(self) -> (Cow<'s, str>, Cow<'t, [u32]>)
pub fn into_parts(self) -> (Cow<'s, str>, Cow<'t, [u32]>)
Extract the parts of a suffix table.
This is useful to avoid copying when the suffix table is part of an intermediate computation.
Sourcepub fn len(&self) -> usize
pub fn len(&self) -> usize
Returns the number of suffixes in the table.
Alternatively, this is the number of bytes in the text.
Sourcepub fn suffix_bytes(&self, i: usize) -> &[u8] ⓘ
pub fn suffix_bytes(&self, i: usize) -> &[u8] ⓘ
Returns the suffix bytes starting at index i
.
Sourcepub fn contains(&self, query: &str) -> bool
pub fn contains(&self, query: &str) -> bool
Returns true if and only if query
is in text.
This runs in O(mlogn)
time, where m == query.len()
and
n == self.len()
. (As far as this author knows, this is the best known
bound for a plain suffix table.)
You should prefer this over positions
when you only need to test
existence (because it is faster).
§Example
Build a suffix array of some text and test existence of a substring:
use suffix::SuffixTable;
let sa = SuffixTable::new("The quick brown fox.");
assert!(sa.contains("quick"));
Sourcepub fn positions(&self, query: &str) -> &[u32]
pub fn positions(&self, query: &str) -> &[u32]
Returns an unordered list of positions where query
starts in text
.
This runs in O(mlogn)
time, where m == query.len()
and
n == self.len()
. (As far as this author knows, this is the best known
bound for a plain suffix table.)
Positions are byte indices into text
.
If you just need to test existence, then use contains
since it is
faster.
§Example
Build a suffix array of some text and find all occurrences of a substring:
use suffix::SuffixTable;
let sa = SuffixTable::new("The quick brown fox was very quick.");
assert_eq!(sa.positions("quick"), &[4, 29]);
Sourcepub fn any_position(&self, query: &str) -> Option<u32>
pub fn any_position(&self, query: &str) -> Option<u32>
Returns an arbitrary one of the positions where query
starts in
text
.
This runs in O(mlogn)
time just like contains
and
positions
, but the constant factor is that of
contains
, which is the more efficient of the two.
The return value is a byte indices into text
.
§Example
use suffix::SuffixTable;
let sa = SuffixTable::new("The quick brown fox was very quick.");
let position = sa.any_position("quick");
assert!(position == Some(4) || position == Some(29));
Trait Implementations§
Source§impl<'s, 't> Clone for SuffixTable<'s, 't>
impl<'s, 't> Clone for SuffixTable<'s, 't>
Source§fn clone(&self) -> SuffixTable<'s, 't>
fn clone(&self) -> SuffixTable<'s, 't>
1.0.0 · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source
. Read more