Struct suffix::SuffixTable [] [src]

pub struct SuffixTable<'s> {
    // some fields omitted
}

A suffix table is a sequence of lexicographically sorted suffixes.

The lifetime 's refers to the text when borrowed.

This is distinct from a suffix array in that it only contains suffix indices. It has no "enhanced" information like the inverse suffix table or least-common-prefix lengths (LCP array). This representation limits what you can do (and how fast), but it uses very little memory (4 bytes per character in the text).

Construction

Suffix array construction is done in O(n) time and in O(kn) space, where k is the number of unique characters in the text. (More details below.) The specific algorithm implemented is from (Nong et al., 2009), but I actually used the description found in (Shrestha et al., 2014), because it is much more accessible to someone who is not used to reading algorithms papers.

The main thrust of the algorithm is that of "reduce and conquer." Namely, it reduces the problem of finding lexicographically sorted suffixes to a smaller subproblem, and solves it recursively. The subproblem is to find the suffix array of a smaller string, where that string is composed by naming contiguous regions of the original text. If there are any duplicate names, then the algorithm procedes recursively. If there are no duplicate names (base case), then the suffix array of the subproblem is already computed. In essence, this "inductively sorts" suffixes of the original text with several linear scans over the text. Because of the number of linear scans, the performance of construction is heavily tied to cache performance (and this is why u32 is used to represent the suffix index instead of a u64).

The space usage is roughly 6 bytes per character. (The optimal bound is 5 bytes per character, although that may be for a small constant alphabet.) 4 bytes comes from the suffix array itself. The extra 2 bytes comes from storing the suffix type of each character (1 byte) and information about bin boundaries, where the number of bins is equal to the number of unique characters in the text. This doesn't formally imply another byte of overhead, but in practice, the alphabet can get quite large when solving the subproblems mentioned above (even if the alphabet of the original text is very small).

Methods

impl<'s> SuffixTable<'s>
[src]

fn new<S>(text: S) -> SuffixTable<'s> where S: Into<Cow<'s, str>>

Creates a new suffix table for text in O(n) time and O(kn) space, where k is the size of the alphabet in the text.

The table stores either S or a &S and a lexicographically sorted list of suffixes. Each suffix is represented by a 32 bit integer and is a byte index into text.

Panics

Panics if the text contains more than 2^32 - 1 bytes. This restriction is mostly artificial; there's no fundamental reason why suffix array construction algorithm can't use a u64. Nevertheless, u32 was chosen for performance reasons. The performance of the construction algorithm is highly dependent on cache performance, which is degraded with a bigger number type. u32 strikes a nice balance; it gets good performance while allowing most reasonably sized documents (~4GB).

fn from_parts<'t, S, T>(text: S, table: T) -> SuffixTable<'s> where S: Into<Cow<'s, str>>, T: Into<Cow<'t, [u32]>>

Creates a new suffix table from an existing list of lexicographically sorted suffix indices.

Note that the invariant that table must be a suffix table of text is not checked! If it isn't, this will cause other operations on a suffix table to fail in weird ways.

Note that if table is borrowed (i.e., a &[u8]), then it is copied.

This fails if the number of characters in text does not equal the number of suffixes in table.

fn into_parts(self) -> (Cow<'s, str>, Vec<u32>)

Extract the parts of a suffix table.

This is useful to avoid copying when the suffix table is part of an intermediate computation.

fn lcp_lens(&self) -> Vec<u32>

Computes the LCP array in linear time and linear space.

fn table(&self) -> &[u32]

Return the suffix table.

fn text(&self) -> &str

Return the text.

fn len(&self) -> usize

Returns the number of suffixes in the table.

Alternatively, this is the number of characters in the text.

fn is_empty(&self) -> bool

Returns true iff self.len() == 0.

fn suffix(&self, i: usize) -> &str

Returns the suffix at index i.

fn contains(&self, query: &str) -> bool

Returns true if and only if query is in text.

This runs in O(mlogn) time, where m == query.len() and n == self.len(). (As far as this author knows, this is the best known bound for a plain suffix table.)

You should prefer this over positions when you only need to test existence (because it is faster).

Example

Build a suffix array of some text and test existence of a substring:

use suffix::SuffixTable;

let sa = SuffixTable::new("The quick brown fox.");
assert!(sa.contains("quick"));

fn positions(&self, query: &str) -> &[u32]

Returns an unordered list of positions where query starts in text.

This runs in O(mlogn) time, where m == query.len() and n == self.len(). (As far as this author knows, this is the best known bound for a plain suffix table.)

Positions are byte indices into text.

If you just need to test existence, then use contains since it is faster.

Example

Build a suffix array of some text and find all occurrences of a substring:

use suffix::SuffixTable;

let sa = SuffixTable::new("The quick brown fox was very quick.");
assert_eq!(sa.positions("quick"), &[4, 29]);

Trait Implementations

impl<'s> PartialEq for SuffixTable<'s>
[src]

fn eq(&self, __arg_0: &SuffixTable<'s>) -> bool

This method tests for self and other values to be equal, and is used by ==. Read more

fn ne(&self, __arg_0: &SuffixTable<'s>) -> bool

This method tests for !=.

impl<'s> Eq for SuffixTable<'s>
[src]

impl<'s> Clone for SuffixTable<'s>
[src]

fn clone(&self) -> SuffixTable<'s>

Returns a copy of the value. Read more

fn clone_from(&mut self, source: &Self)
1.0.0

Performs copy-assignment from source. Read more

impl<'s> Debug for SuffixTable<'s>
[src]

fn fmt(&self, f: &mut Formatter) -> Result

Formats the value using the given formatter.