Skip to main content

SbwtIndex

Struct SbwtIndex 

Source
pub struct SbwtIndex<SS: SubsetSeq> { /* private fields */ }
Expand description

The SBWT index data structure. Construct with SbwtIndexBuilder. For the SubsetSeq trait implementation, we recommend using the bit matrix implementation SubsetMatrix.

§SBWT index

The SBWT index is a compressed index for searching for k-mers in a set of k-mers. It can be seen a version of the FM-index on sets of k-mers. Here we give a brief overview of the data structure and key concepts that are helpful for understanding the API. For further details, see the paper.

§SBWT graph

To understand the SBWT index, it’s helpful to first understand the SBWT graph. The node-centric de Bruijn graph of a set of k-mers R (also called a k-spectrum) is a directed graph where the nodes are the k-mers in R, and there is an edge from x to y if x[1..k) = y[0..k-1). The label of the edge is the last character of y. The SBWT graph is a modified version of the node-centric de Bruijn graph. There are two modifications:

  1. We add a set R’ of “dummy nodes”. First, the source set R’ ⊆ R is the subset of k-mers in R that do not have an incoming edge in the de Bruijn graph, that is x ∈ R’ iff x[0..k-1) is not the suffix of any k-mer in R. The set of dummy nodes is the set of all proper prefixes of the k-mers in R’. We pad the dummy nodes with dollar symbols from the left so that all dummy nodes have length k. The set R ∪ R’ is called the padded k-spectrum. We add all de Bruijn graph edges between overlaps of the padded k-spectrum, that is, for every x ∈ R ∪ R’ and y ∈ R ∪ R’ such that x[1..k) = y[0..k-1), we add an edge from x to y in the SBWT graph. As a special case, the empty string $^k is always included in R’, and the incoming self-loop edge labeled with $ is not included.
  2. For every node in R ∪ R’ that has more than one incoming edge, we delete all incoming edges except the one that comes from the colexicographically smallest k-mer. A k-mer x is colexicographically smaller than k-mer y iff the reverse of x is lexicographically smaller than the reverse of y.

For example, below is the SBWT graph of all 4-mers of strings [“TGTTTG”, “TTGCTAT”, “ACGTAGTATAT”, “TGTAAA”]. The source node set is shown in blue, the dummy node set R’ in orange, and the de Bruijn graph edges that have been removed are shown in red.

SBWT graph

§SBWT definition

The SBWT is a sequence of subsets of {A,C,G,T} such that the i-th subset contains the edge labels of outgoing edges from the i-th k-mer in the SBWT graph in colexicographic order of the k-mers. If we take the example from above and stack the nodes in colexicographic order, we get:

SBWT graph

The SBWT subset sequence in this example is the sequence of outgoing edge label sets read from top to bottom: {A,T}, {C}, {}, {A}, {T}, {T}, {A,G,T}, {}, {}, {G}, {T}, {T}, {T}, {T}, {C}, {G}, {A}, {}, {}, {A}, {A}, {A}, {A,T}, {T}, {G}. The key property that makes the SBWT index work is that the i-th outgoing edge with a given label on the right column in the picture is the same edge as the i-th incoming edge with the same label on the left column. Thanks to this property, once the subset sequnce has been constructed, the k-mers and the graph can be discarded, with no loss of information. That is, the k-mers and the graph can be reconstructed from the subset sequence alone.

Given the subset sequence, there is also an algorithm to search for a k-mer in O(k) time. The algorithm is essentially the same as the search algorithm on the FM-index. Given a k-mer P[0..k], at iteration i, we have the range of rows in the picture whose k-mers have P[0..i) as a suffix. When i = 0, we have the empty suffix P[0..0), which is a suffix of all k-mers. To update the range for the next iteration, we follow the edges labeled with the next character of the pattern, using the key property mentioned above, as shown in the figure below for query k-mer TATA. This update step can be implemented efficiently by preprocessing the SBWT set sequence for subset rank queries. See this paper for more on how to implement these rank queries. We provide an implementation based on a bit matrix at SubsetMatrix. Any struct implementing the SubsetSeq trait can be used to query the SBWT.

SBWT graph

Implementations§

Source§

impl<SS: SubsetSeq> SbwtIndex<SS>

Source

pub fn n_kmers(&self) -> usize

Number of k-mers in the index, not counting dummy k-mers.

Source

pub fn n_sets(&self) -> usize

Number of sets in the SBWT.

Source

pub fn k(&self) -> usize

Length of the k-mers in the index.

Source

pub fn char_idx(&self, c: u8) -> usize

Maps A,C,G,T to 0,1,2,3. Panics if c is not a DNA character.

Source

pub fn alphabet(&self) -> &[u8]

Returns the alphabet ACGT of the index in ascii.

Source

pub fn interval_of_empty_string(&self) -> Range<usize>

Source

pub fn serialize<W: Write>(&self, out: &mut W) -> Result<usize>

Writes the index to the writer and retuns the number of bytes written. The array can later be loaded with SbwtIndex::load.

Source

pub fn load<R: Read>(input: &mut R) -> Result<Self>

Loads an index that was previously serialized with SbwtIndex::serialize.

Source

pub fn inlabel(&self, i: usize) -> Option<u8>

Returns the edge label on the incoming edge to the i-th node in colexicographic order in the SBWT graph (not the same graph as the de Bruijn graph!), or None if the node has no incoming edge. This is well-defined since every node in the SBWT graph has at most 1 incoming edge.

Source

pub fn build_select(&mut self)

Build select support for the SBWT subset sequence. This is required for SbwtIndex::inverse_lf_step and SbwtIndex::access_kmer.

Source

pub fn lf_step(&self, i: usize, char_idx: usize) -> usize

A low-level function returning C[char_idx] + SBWT.rank(char_idx, i).

Source

pub fn inverse_lf_step(&self, i: usize) -> Option<usize>

The inverse function of SbwtIndex::lf_step. Requires that the select support has been built. Returns None if called on the source node of the SBWT graph.

Source

pub fn push_kmer_to_vec(&self, colex_rank: usize, buf: &mut Vec<u8>)

Requires that the select support has been built. Accesses the k-mer associated with the node with colexicographic rank colex_rank. Note that this k-mer may contain dollar symbols if it’s a dummy node. The k-mer string is pushed to buf.

Source

pub fn access_kmer(&self, colex_rank: usize) -> Vec<u8>

Requires that the select support has been built. Accesses the k-mer associated with the node with colexicographic rank colex_rank. Note that this k-mer may contain dollar symbols if it’s a dummy node. The k-mer is retuned as a Vec<u8>. To push to an existing buffer for better performance, see SbwtIndex::push_kmer_to_vec.

Source

pub fn search(&self, pattern: &[u8]) -> Option<Range<usize>>

Returns the colexicographic interval of the pattern, if found. The pattern is given in ascii characters.

Source

pub fn search_from( &self, interval: Range<usize>, pattern: &[u8], ) -> Option<Range<usize>>

Searches the pattern starting from the given colexicographic interval. Returns the colexicographic interval after searching the pattern, if found. The pattern is given in ascii characters.

Source

pub fn reconstruct_padded_spectrum(&self, n_threads: usize) -> Vec<u8>
where Self: Sync,

Reconstruct all k-mers in the padded k-spectrum in the data structure. The reconstructed k-mers concatenated into a single ascii string in colexicographic order. The i-th k-mer starts at position i*k in the string.

Source

pub fn set_lookup_table(&mut self, prefix_lookup_table: PrefixLookupTable)

Set the prefix lookup table of the data structure.

Source

pub fn get_lookup_table(&self) -> &PrefixLookupTable

Get the prefix lookup table of the data structure.

Source

pub fn sbwt(&self) -> &SS

Get the SBWT set sequence of the data structure.

Source

pub fn from_subset_seq( subset_rank: SS, n_kmers: usize, k: usize, precalc_prefix_length: usize, ) -> Self

Construct from existing crate::subsetseq::SubsetSeq.

Source

pub fn push_all_labels_forward( &self, labels_in: &[u8], labels_out: &mut [u8], n_threads: usize, )
where Self: Sync,

Push labels forward along the SBWT graph. The effect is this: if labels_in is a list of labels, one for each node in the SBWT in colex order, then the output is those labels pushed forward along the edges in the SBWT graph. Those nodes that do not have a predecessor (just the root of the graph) get a dollar.

Source

pub fn push_all_labels_forward_compact( &self, labels_in: &CompactIntVector<3>, labels_out: &mut CompactIntVector<3>, n_threads: usize, )
where Self: Sync,

A version of SbwtIndex::push_all_labels_forward that stores the labels as compact integer vectors with elements {0,1, …, 4}. The symbol 0 is the “dollar” and symbols 1,2,3,4 are A,C,G and T. (see the documentation of SbwtIndex::push_all_labels_forward.

Source

pub fn build_last_column(&self) -> Vec<u8>

Build the last column of the SBWT matrix.

Source

pub fn build_last_column_compact(&self) -> CompactIntVector<3>

Build the last column of the SBWT matrix.

Source

pub fn compute_dummy_node_marks(&self) -> BitVec

Compute a bit vector marking the dummy nodes in the SBWT

Trait Implementations§

Source§

impl<SS: Clone + SubsetSeq> Clone for SbwtIndex<SS>

Source§

fn clone(&self) -> SbwtIndex<SS>

Returns a duplicate of the value. Read more
1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl<SS: Debug + SubsetSeq> Debug for SbwtIndex<SS>

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl<SS: SubsetSeq> ExtendRight for SbwtIndex<SS>

Source§

fn extend_right(&self, I: Range<usize>, c: u8) -> Range<usize>

Right extensions implemented time O(t), where t is the time for a rank query in the subset rank query implementation of SS. This is O(1) for SubsetMatrix.

Source§

impl<SS: PartialEq + SubsetSeq> PartialEq for SbwtIndex<SS>

Source§

fn eq(&self, other: &SbwtIndex<SS>) -> bool

Tests for self and other values to be equal, and is used by ==.
1.0.0 · Source§

fn ne(&self, other: &Rhs) -> bool

Tests for !=. The default implementation is almost always sufficient, and should not be overridden without very good reason.
Source§

impl<SS: Eq + SubsetSeq> Eq for SbwtIndex<SS>

Source§

impl<SS: SubsetSeq> StructuralPartialEq for SbwtIndex<SS>

Auto Trait Implementations§

§

impl<SS> Freeze for SbwtIndex<SS>
where SS: Freeze,

§

impl<SS> RefUnwindSafe for SbwtIndex<SS>
where SS: RefUnwindSafe,

§

impl<SS> Send for SbwtIndex<SS>
where SS: Send,

§

impl<SS> Sync for SbwtIndex<SS>
where SS: Sync,

§

impl<SS> Unpin for SbwtIndex<SS>
where SS: Unpin,

§

impl<SS> UnsafeUnpin for SbwtIndex<SS>
where SS: UnsafeUnpin,

§

impl<SS> UnwindSafe for SbwtIndex<SS>
where SS: UnwindSafe,

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
Source§

impl<T> Conv for T

Source§

fn conv<T>(self) -> T
where Self: Into<T>,

Converts self into T using Into<T>. Read more
Source§

impl<T> FmtForward for T

Source§

fn fmt_binary(self) -> FmtBinary<Self>
where Self: Binary,

Causes self to use its Binary implementation when Debug-formatted.
Source§

fn fmt_display(self) -> FmtDisplay<Self>
where Self: Display,

Causes self to use its Display implementation when Debug-formatted.
Source§

fn fmt_lower_exp(self) -> FmtLowerExp<Self>
where Self: LowerExp,

Causes self to use its LowerExp implementation when Debug-formatted.
Source§

fn fmt_lower_hex(self) -> FmtLowerHex<Self>
where Self: LowerHex,

Causes self to use its LowerHex implementation when Debug-formatted.
Source§

fn fmt_octal(self) -> FmtOctal<Self>
where Self: Octal,

Causes self to use its Octal implementation when Debug-formatted.
Source§

fn fmt_pointer(self) -> FmtPointer<Self>
where Self: Pointer,

Causes self to use its Pointer implementation when Debug-formatted.
Source§

fn fmt_upper_exp(self) -> FmtUpperExp<Self>
where Self: UpperExp,

Causes self to use its UpperExp implementation when Debug-formatted.
Source§

fn fmt_upper_hex(self) -> FmtUpperHex<Self>
where Self: UpperHex,

Causes self to use its UpperHex implementation when Debug-formatted.
Source§

fn fmt_list(self) -> FmtList<Self>
where &'a Self: for<'a> IntoIterator,

Formats each item in a sequence. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T> Pipe for T
where T: ?Sized,

Source§

fn pipe<R>(self, func: impl FnOnce(Self) -> R) -> R
where Self: Sized,

Pipes by value. This is generally the method you want to use. Read more
Source§

fn pipe_ref<'a, R>(&'a self, func: impl FnOnce(&'a Self) -> R) -> R
where R: 'a,

Borrows self and passes that borrow into the pipe function. Read more
Source§

fn pipe_ref_mut<'a, R>(&'a mut self, func: impl FnOnce(&'a mut Self) -> R) -> R
where R: 'a,

Mutably borrows self and passes that borrow into the pipe function. Read more
Source§

fn pipe_borrow<'a, B, R>(&'a self, func: impl FnOnce(&'a B) -> R) -> R
where Self: Borrow<B>, B: 'a + ?Sized, R: 'a,

Borrows self, then passes self.borrow() into the pipe function. Read more
Source§

fn pipe_borrow_mut<'a, B, R>( &'a mut self, func: impl FnOnce(&'a mut B) -> R, ) -> R
where Self: BorrowMut<B>, B: 'a + ?Sized, R: 'a,

Mutably borrows self, then passes self.borrow_mut() into the pipe function. Read more
Source§

fn pipe_as_ref<'a, U, R>(&'a self, func: impl FnOnce(&'a U) -> R) -> R
where Self: AsRef<U>, U: 'a + ?Sized, R: 'a,

Borrows self, then passes self.as_ref() into the pipe function.
Source§

fn pipe_as_mut<'a, U, R>(&'a mut self, func: impl FnOnce(&'a mut U) -> R) -> R
where Self: AsMut<U>, U: 'a + ?Sized, R: 'a,

Mutably borrows self, then passes self.as_mut() into the pipe function.
Source§

fn pipe_deref<'a, T, R>(&'a self, func: impl FnOnce(&'a T) -> R) -> R
where Self: Deref<Target = T>, T: 'a + ?Sized, R: 'a,

Borrows self, then passes self.deref() into the pipe function.
Source§

fn pipe_deref_mut<'a, T, R>( &'a mut self, func: impl FnOnce(&'a mut T) -> R, ) -> R
where Self: DerefMut<Target = T> + Deref, T: 'a + ?Sized, R: 'a,

Mutably borrows self, then passes self.deref_mut() into the pipe function.
Source§

impl<T> Pointable for T

Source§

const ALIGN: usize

The alignment of pointer.
Source§

type Init = T

The type for initializers.
Source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
Source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
Source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
Source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
Source§

impl<T> Tap for T

Source§

fn tap(self, func: impl FnOnce(&Self)) -> Self

Immutable access to a value. Read more
Source§

fn tap_mut(self, func: impl FnOnce(&mut Self)) -> Self

Mutable access to a value. Read more
Source§

fn tap_borrow<B>(self, func: impl FnOnce(&B)) -> Self
where Self: Borrow<B>, B: ?Sized,

Immutable access to the Borrow<B> of a value. Read more
Source§

fn tap_borrow_mut<B>(self, func: impl FnOnce(&mut B)) -> Self
where Self: BorrowMut<B>, B: ?Sized,

Mutable access to the BorrowMut<B> of a value. Read more
Source§

fn tap_ref<R>(self, func: impl FnOnce(&R)) -> Self
where Self: AsRef<R>, R: ?Sized,

Immutable access to the AsRef<R> view of a value. Read more
Source§

fn tap_ref_mut<R>(self, func: impl FnOnce(&mut R)) -> Self
where Self: AsMut<R>, R: ?Sized,

Mutable access to the AsMut<R> view of a value. Read more
Source§

fn tap_deref<T>(self, func: impl FnOnce(&T)) -> Self
where Self: Deref<Target = T>, T: ?Sized,

Immutable access to the Deref::Target of a value. Read more
Source§

fn tap_deref_mut<T>(self, func: impl FnOnce(&mut T)) -> Self
where Self: DerefMut<Target = T> + Deref, T: ?Sized,

Mutable access to the Deref::Target of a value. Read more
Source§

fn tap_dbg(self, func: impl FnOnce(&Self)) -> Self

Calls .tap() only in debug builds, and is erased in release builds.
Source§

fn tap_mut_dbg(self, func: impl FnOnce(&mut Self)) -> Self

Calls .tap_mut() only in debug builds, and is erased in release builds.
Source§

fn tap_borrow_dbg<B>(self, func: impl FnOnce(&B)) -> Self
where Self: Borrow<B>, B: ?Sized,

Calls .tap_borrow() only in debug builds, and is erased in release builds.
Source§

fn tap_borrow_mut_dbg<B>(self, func: impl FnOnce(&mut B)) -> Self
where Self: BorrowMut<B>, B: ?Sized,

Calls .tap_borrow_mut() only in debug builds, and is erased in release builds.
Source§

fn tap_ref_dbg<R>(self, func: impl FnOnce(&R)) -> Self
where Self: AsRef<R>, R: ?Sized,

Calls .tap_ref() only in debug builds, and is erased in release builds.
Source§

fn tap_ref_mut_dbg<R>(self, func: impl FnOnce(&mut R)) -> Self
where Self: AsMut<R>, R: ?Sized,

Calls .tap_ref_mut() only in debug builds, and is erased in release builds.
Source§

fn tap_deref_dbg<T>(self, func: impl FnOnce(&T)) -> Self
where Self: Deref<Target = T>, T: ?Sized,

Calls .tap_deref() only in debug builds, and is erased in release builds.
Source§

fn tap_deref_mut_dbg<T>(self, func: impl FnOnce(&mut T)) -> Self
where Self: DerefMut<Target = T> + Deref, T: ?Sized,

Calls .tap_deref_mut() only in debug builds, and is erased in release builds.
Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T> TryConv for T

Source§

fn try_conv<T>(self) -> Result<T, Self::Error>
where Self: TryInto<T>,

Attempts to convert self into T using TryInto<T>. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T> Erased for T