Struct SuffixTable

Source
pub struct SuffixTable<'s, 't> { /* private fields */ }
Expand description

A suffix table is a sequence of lexicographically sorted suffixes.

The lifetimes 's and 't (respectively) refer to the text and suffix indices when borrowed.

This is distinct from a suffix array in that it only contains suffix indices. It has no “enhanced” information like the inverse suffix table or least-common-prefix lengths (LCP array). This representation limits what you can do (and how fast), but it uses very little memory (4 bytes per character in the text).

§Construction

Suffix array construction is done in O(n) time and in O(kn) space, where k is the number of unique characters in the text. (More details below.) The specific algorithm implemented is from (Nong et al., 2009), but I actually used the description found in (Shrestha et al., 2014), because it is much more accessible to someone who is not used to reading algorithms papers.

The main thrust of the algorithm is that of “reduce and conquer.” Namely, it reduces the problem of finding lexicographically sorted suffixes to a smaller subproblem, and solves it recursively. The subproblem is to find the suffix array of a smaller string, where that string is composed by naming contiguous regions of the original text. If there are any duplicate names, then the algorithm proceeds recursively. If there are no duplicate names (base case), then the suffix array of the subproblem is already computed. In essence, this “inductively sorts” suffixes of the original text with several linear scans over the text. Because of the number of linear scans, the performance of construction is heavily tied to cache performance (and this is why u32 is used to represent the suffix index instead of a u64).

The space usage is roughly 6 bytes per character. (The optimal bound is 5 bytes per character, although that may be for a small constant alphabet.) 4 bytes comes from the suffix array itself. The extra 2 bytes comes from storing the suffix type of each character (1 byte) and information about bin boundaries, where the number of bins is equal to the number of unique characters in the text. This doesn’t formally imply another byte of overhead, but in practice, the alphabet can get quite large when solving the subproblems mentioned above (even if the alphabet of the original text is very small).

Implementations§

Source§

impl<'s, 't> SuffixTable<'s, 't>

Source

pub fn new<S>(text: S) -> SuffixTable<'s, 't>
where S: Into<Cow<'s, str>>,

Creates a new suffix table for text in O(n) time and O(kn) space, where k is the size of the alphabet in the text.

The table stores either S or a &S and a lexicographically sorted list of suffixes. Each suffix is represented by a 32 bit integer and is a byte index into text.

§Panics

Panics if the text contains more than 2^32 - 1 bytes. This restriction is mostly artificial; there’s no fundamental reason why suffix array construction algorithm can’t use a u64. Nevertheless, u32 was chosen for performance reasons. The performance of the construction algorithm is highly dependent on cache performance, which is degraded with a bigger number type. u32 strikes a nice balance; it gets good performance while allowing most reasonably sized documents (~4GB).

Source

pub fn from_parts<S, T>(text: S, table: T) -> SuffixTable<'s, 't>
where S: Into<Cow<'s, str>>, T: Into<Cow<'t, [u32]>>,

Creates a new suffix table from an existing list of lexicographically sorted suffix indices.

Note that the invariant that table must be a suffix table of text is not checked! If it isn’t, this will cause other operations on a suffix table to fail in weird ways.

This fails if the number of characters in text does not equal the number of suffixes in table.

Source

pub fn into_parts(self) -> (Cow<'s, str>, Cow<'t, [u32]>)

Extract the parts of a suffix table.

This is useful to avoid copying when the suffix table is part of an intermediate computation.

Source

pub fn lcp_lens(&self) -> Vec<u32>

Computes the LCP array.

Source

pub fn table(&self) -> &[u32]

Return the suffix table.

Source

pub fn text(&self) -> &str

Return the text.

Source

pub fn len(&self) -> usize

Returns the number of suffixes in the table.

Alternatively, this is the number of bytes in the text.

Source

pub fn is_empty(&self) -> bool

Returns true iff self.len() == 0.

Source

pub fn suffix(&self, i: usize) -> &str

Returns the suffix at index i.

Source

pub fn suffix_bytes(&self, i: usize) -> &[u8]

Returns the suffix bytes starting at index i.

Source

pub fn contains(&self, query: &str) -> bool

Returns true if and only if query is in text.

This runs in O(mlogn) time, where m == query.len() and n == self.len(). (As far as this author knows, this is the best known bound for a plain suffix table.)

You should prefer this over positions when you only need to test existence (because it is faster).

§Example

Build a suffix array of some text and test existence of a substring:

use suffix::SuffixTable;

let sa = SuffixTable::new("The quick brown fox.");
assert!(sa.contains("quick"));
Source

pub fn positions(&self, query: &str) -> &[u32]

Returns an unordered list of positions where query starts in text.

This runs in O(mlogn) time, where m == query.len() and n == self.len(). (As far as this author knows, this is the best known bound for a plain suffix table.)

Positions are byte indices into text.

If you just need to test existence, then use contains since it is faster.

§Example

Build a suffix array of some text and find all occurrences of a substring:

use suffix::SuffixTable;

let sa = SuffixTable::new("The quick brown fox was very quick.");
assert_eq!(sa.positions("quick"), &[4, 29]);
Source

pub fn any_position(&self, query: &str) -> Option<u32>

Returns an arbitrary one of the positions where query starts in text.

This runs in O(mlogn) time just like contains and positions, but the constant factor is that of contains, which is the more efficient of the two.

The return value is a byte indices into text.

§Example
use suffix::SuffixTable;

let sa = SuffixTable::new("The quick brown fox was very quick.");
let position = sa.any_position("quick");
assert!(position == Some(4) || position == Some(29));

Trait Implementations§

Source§

impl<'s, 't> Clone for SuffixTable<'s, 't>

Source§

fn clone(&self) -> SuffixTable<'s, 't>

Returns a copy of the value. Read more
1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl<'s, 't> Debug for SuffixTable<'s, 't>

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl<'s, 't> PartialEq for SuffixTable<'s, 't>

Source§

fn eq(&self, other: &SuffixTable<'s, 't>) -> bool

Tests for self and other values to be equal, and is used by ==.
1.0.0 · Source§

fn ne(&self, other: &Rhs) -> bool

Tests for !=. The default implementation is almost always sufficient, and should not be overridden without very good reason.
Source§

impl<'s, 't> Eq for SuffixTable<'s, 't>

Source§

impl<'s, 't> StructuralPartialEq for SuffixTable<'s, 't>

Auto Trait Implementations§

§

impl<'s, 't> Freeze for SuffixTable<'s, 't>

§

impl<'s, 't> RefUnwindSafe for SuffixTable<'s, 't>

§

impl<'s, 't> Send for SuffixTable<'s, 't>

§

impl<'s, 't> Sync for SuffixTable<'s, 't>

§

impl<'s, 't> Unpin for SuffixTable<'s, 't>

§

impl<'s, 't> UnwindSafe for SuffixTable<'s, 't>

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.