Struct suffix::SuffixTable
[−]
[src]
pub struct SuffixTable<'s, 't> { /* fields omitted */ }
A suffix table is a sequence of lexicographically sorted suffixes.
The lifetimes 's
and 't
(respectively) refer to the text and suffix
indices when borrowed.
This is distinct from a suffix array in that it only contains suffix indices. It has no "enhanced" information like the inverse suffix table or least-common-prefix lengths (LCP array). This representation limits what you can do (and how fast), but it uses very little memory (4 bytes per character in the text).
Construction
Suffix array construction is done in O(n)
time and in O(kn)
space,
where k
is the number of unique characters in the text. (More details
below.) The specific algorithm implemented is from
(Nong et al., 2009),
but I actually used the description found in
(Shrestha et al., 2014),
because it is much more accessible to someone who is not used to reading
algorithms papers.
The main thrust of the algorithm is that of "reduce and conquer." Namely,
it reduces the problem of finding lexicographically sorted suffixes to a
smaller subproblem, and solves it recursively. The subproblem is to find
the suffix array of a smaller string, where that string is composed by
naming contiguous regions of the original text. If there are any duplicate
names, then the algorithm procedes recursively. If there are no duplicate
names (base case), then the suffix array of the subproblem is already
computed. In essence, this "inductively sorts" suffixes of the original
text with several linear scans over the text. Because of the number of
linear scans, the performance of construction is heavily tied to cache
performance (and this is why u32
is used to represent the suffix index
instead of a u64
).
The space usage is roughly 6
bytes per character. (The optimal bound is
5
bytes per character, although that may be for a small constant
alphabet.) 4
bytes comes from the suffix array itself. The extra 2
bytes comes from storing the suffix type of each character (1
byte) and
information about bin boundaries, where the number of bins is equal to
the number of unique characters in the text. This doesn't formally imply
another byte of overhead, but in practice, the alphabet can get quite large
when solving the subproblems mentioned above (even if the alphabet of the
original text is very small).
Methods
impl<'s, 't> SuffixTable<'s, 't>
[src]
fn new<S>(text: S) -> SuffixTable<'s, 't> where
S: Into<Cow<'s, str>>,
S: Into<Cow<'s, str>>,
Creates a new suffix table for text
in O(n)
time and O(kn)
space, where k
is the size of the alphabet in the text.
The table stores either S
or a &S
and a lexicographically sorted
list of suffixes. Each suffix is represented by a 32 bit integer and
is a byte index into text
.
Panics
Panics if the text
contains more than 2^32 - 1
bytes. This
restriction is mostly artificial; there's no fundamental reason why
suffix array construction algorithm can't use a u64
. Nevertheless,
u32
was chosen for performance reasons. The performance of the
construction algorithm is highly dependent on cache performance, which
is degraded with a bigger number type. u32
strikes a nice balance; it
gets good performance while allowing most reasonably sized documents
(~4GB).
fn from_parts<S, T>(text: S, table: T) -> SuffixTable<'s, 't> where
S: Into<Cow<'s, str>>,
T: Into<Cow<'t, [u32]>>,
S: Into<Cow<'s, str>>,
T: Into<Cow<'t, [u32]>>,
Creates a new suffix table from an existing list of lexicographically sorted suffix indices.
Note that the invariant that table
must be a suffix table of text
is not checked! If it isn't, this will cause other operations on a
suffix table to fail in weird ways.
This fails if the number of characters in text
does not equal the
number of suffixes in table
.
fn into_parts(self) -> (Cow<'s, str>, Cow<'t, [u32]>)
Extract the parts of a suffix table.
This is useful to avoid copying when the suffix table is part of an intermediate computation.
fn lcp_lens(&self) -> Vec<u32>
Computes the LCP array.
fn table(&self) -> &[u32]
Return the suffix table.
fn text(&self) -> &str
Return the text.
fn len(&self) -> usize
Returns the number of suffixes in the table.
Alternatively, this is the number of bytes in the text.
fn is_empty(&self) -> bool
Returns true
iff self.len() == 0
.
fn suffix(&self, i: usize) -> &str
Returns the suffix at index i
.
fn suffix_bytes(&self, i: usize) -> &[u8]
Returns the suffix bytes starting at index i
.
fn contains(&self, query: &str) -> bool
Returns true if and only if query
is in text.
This runs in O(mlogn)
time, where m == query.len()
and
n == self.len()
. (As far as this author knows, this is the best known
bound for a plain suffix table.)
You should prefer this over positions
when you only need to test
existence (because it is faster).
Example
Build a suffix array of some text and test existence of a substring:
use suffix::SuffixTable; let sa = SuffixTable::new("The quick brown fox."); assert!(sa.contains("quick"));
fn positions(&self, query: &str) -> &[u32]
Returns an unordered list of positions where query
starts in text
.
This runs in O(mlogn)
time, where m == query.len()
and
n == self.len()
. (As far as this author knows, this is the best known
bound for a plain suffix table.)
Positions are byte indices into text
.
If you just need to test existence, then use contains
since it is
faster.
Example
Build a suffix array of some text and find all occurrences of a substring:
use suffix::SuffixTable; let sa = SuffixTable::new("The quick brown fox was very quick."); assert_eq!(sa.positions("quick"), &[4, 29]);
Trait Implementations
impl<'s, 't> Clone for SuffixTable<'s, 't>
[src]
fn clone(&self) -> SuffixTable<'s, 't>
Returns a copy of the value. Read more
fn clone_from(&mut self, source: &Self)
1.0.0
Performs copy-assignment from source
. Read more
impl<'s, 't> Eq for SuffixTable<'s, 't>
[src]
impl<'s, 't> PartialEq for SuffixTable<'s, 't>
[src]
fn eq(&self, __arg_0: &SuffixTable<'s, 't>) -> bool
This method tests for self
and other
values to be equal, and is used by ==
. Read more
fn ne(&self, __arg_0: &SuffixTable<'s, 't>) -> bool
This method tests for !=
.