Struct SbwtIndex

Source

pub struct SbwtIndex<SS: SubsetSeq> { /* private fields */ }

Expand description

The SBWT index data structure. Construct with SbwtIndexBuilder. For the SubsetSeq trait implementation, we recommend using the bit matrix implementation SubsetMatrix.

§SBWT index

The SBWT index is a compressed index for searching for k-mers in a set of k-mers. It can be seen a version of the FM-index on sets of k-mers. Here we give a brief overview of the data structure and key concepts that are helpful for understanding the API. For further details, see the paper.

§SBWT graph

To understand the SBWT index, it’s helpful to first understand the SBWT graph. The node-centric de Bruijn graph of a set of k-mers R (also called a k-spectrum) is a directed graph where the nodes are the k-mers in R, and there is an edge from x to y if x[1..k) = y[0..k-1). The label of the edge is the last character of y. The SBWT graph is a modified version of the node-centric de Bruijn graph. There are two modifications:

We add a set R’ of “dummy nodes”. First, the source set R’ ⊆ R is the subset of k-mers in R that do not have an incoming edge in the de Bruijn graph, that is x ∈ R’ iff x[0..k-1) is not the suffix of any k-mer in R. The set of dummy nodes is the set of all proper prefixes of the k-mers in R’. We pad the dummy nodes with dollar symbols from the left so that all dummy nodes have length k. The set R ∪ R’ is called the padded k-spectrum. We add all de Bruijn graph edges between overlaps of the padded k-spectrum, that is, for every x ∈ R ∪ R’ and y ∈ R ∪ R’ such that x[1..k) = y[0..k-1), we add an edge from x to y in the SBWT graph. As a special case, the empty string $^k is always included in R’, and the incoming self-loop edge labeled with $ is not included.
For every node in R ∪ R’ that has more than one incoming edge, we delete all incoming edges except the one that comes from the colexicographically smallest k-mer. A k-mer x is colexicographically smaller than k-mer y iff the reverse of x is lexicographically smaller than the reverse of y.

For example, below is the SBWT graph of all 4-mers of strings [“TGTTTG”, “TTGCTAT”, “ACGTAGTATAT”, “TGTAAA”]. The source node set is shown in blue, the dummy node set R’ in orange, and the de Bruijn graph edges that have been removed are shown in red.

SBWT graph

§SBWT definition

The SBWT is a sequence of subsets of {A,C,G,T} such that the i-th subset contains the edge labels of outgoing edges from the i-th k-mer in the SBWT graph in colexicographic order of the k-mers. If we take the example from above and stack the nodes in colexicographic order, we get:

SBWT graph

The SBWT subset sequence in this example is the sequence of outgoing edge label sets read from top to bottom: {A,T}, {C}, {}, {A}, {T}, {T}, {A,G,T}, {}, {}, {G}, {T}, {T}, {T}, {T}, {C}, {G}, {A}, {}, {}, {A}, {A}, {A}, {A,T}, {T}, {G}. The key property that makes the SBWT index work is that the i-th outgoing edge with a given label on the right column in the picture is the same edge as the i-th incoming edge with the same label on the left column. Thanks to this property, once the subset sequnce has been constructed, the k-mers and the graph can be discarded, with no loss of information. That is, the k-mers and the graph can be reconstructed from the subset sequence alone.

Given the subset sequence, there is also an algorithm to search for a k-mer in O(k) time. The algorithm is essentially the same as the search algorithm on the FM-index. Given a k-mer P[0..k], at iteration i, we have the range of rows in the picture whose k-mers have P[0..i) as a suffix. When i = 0, we have the empty suffix P[0..0), which is a suffix of all k-mers. To update the range for the next iteration, we follow the edges labeled with the next character of the pattern, using the key property mentioned above, as shown in the figure below for query k-mer TATA. This update step can be implemented efficiently by preprocessing the SBWT set sequence for subset rank queries. See this paper for more on how to implement these rank queries. We provide an implementation based on a bit matrix at SubsetMatrix. Any struct implementing the SubsetSeq trait can be used to query the SBWT.

SBWT graph

Struct SbwtIndex Copy item path

§SBWT index

§SBWT graph

§SBWT definition

Implementations§

impl<SS: SubsetSeq> SbwtIndex<SS>

pub fn n_kmers(&self) -> usize

pub fn n_sets(&self) -> usize

pub fn k(&self) -> usize

pub fn char_idx(&self, c: u8) -> usize

pub fn alphabet(&self) -> &[u8] ⓘ

pub fn interval_of_empty_string(&self) -> Range<usize>

pub fn serialize<W: Write>(&self, out: &mut W) -> Result<usize>

pub fn load<R: Read>(input: &mut R) -> Result<Self>

pub fn inlabel(&self, i: usize) -> Option<u8>

pub fn build_select(&mut self)

pub fn lf_step(&self, i: usize, char_idx: usize) -> usize

pub fn inverse_lf_step(&self, i: usize) -> Option<usize>

pub fn push_kmer_to_vec(&self, colex_rank: usize, buf: &mut Vec<u8>)

pub fn access_kmer(&self, colex_rank: usize) -> Vec<u8> ⓘ

pub fn search(&self, pattern: &[u8]) -> Option<Range<usize>>

pub fn search_from( &self, interval: Range<usize>, pattern: &[u8], ) -> Option<Range<usize>>

pub fn reconstruct_padded_spectrum(&self, n_threads: usize) -> Vec<u8> ⓘwhere Self: Sync,

pub fn set_lookup_table(&mut self, prefix_lookup_table: PrefixLookupTable)

pub fn get_lookup_table(&self) -> &PrefixLookupTable

pub fn sbwt(&self) -> &SS

pub fn from_subset_seq( subset_rank: SS, n_kmers: usize, k: usize, precalc_prefix_length: usize, ) -> Self

pub fn push_all_labels_forward( &self, labels_in: &[u8], labels_out: &mut [u8], n_threads: usize, )where Self: Sync,

pub fn push_all_labels_forward_compact( &self, labels_in: &CompactIntVector<3>, labels_out: &mut CompactIntVector<3>, n_threads: usize, )where Self: Sync,

pub fn build_last_column(&self) -> Vec<u8> ⓘ

pub fn build_last_column_compact(&self) -> CompactIntVector<3>

pub fn compute_dummy_node_marks(&self) -> BitVec ⓘ

Trait Implementations§

impl<SS: Clone + SubsetSeq> Clone for SbwtIndex<SS>

fn clone(&self) -> SbwtIndex<SS>

fn clone_from(&mut self, source: &Self)

impl<SS: Debug + SubsetSeq> Debug for SbwtIndex<SS>

fn fmt(&self, f: &mut Formatter<'_>) -> Result

impl<SS: SubsetSeq> ExtendRight for SbwtIndex<SS>

fn extend_right(&self, I: Range<usize>, c: u8) -> Range<usize>

impl<SS: PartialEq + SubsetSeq> PartialEq for SbwtIndex<SS>

fn eq(&self, other: &SbwtIndex<SS>) -> bool

fn ne(&self, other: &Rhs) -> bool

impl<SS: Eq + SubsetSeq> Eq for SbwtIndex<SS>

impl<SS: SubsetSeq> StructuralPartialEq for SbwtIndex<SS>

Auto Trait Implementations§

impl<SS> Freeze for SbwtIndex<SS>where SS: Freeze,

impl<SS> RefUnwindSafe for SbwtIndex<SS>where SS: RefUnwindSafe,

impl<SS> Send for SbwtIndex<SS>where SS: Send,

impl<SS> Sync for SbwtIndex<SS>where SS: Sync,

impl<SS> Unpin for SbwtIndex<SS>where SS: Unpin,

impl<SS> UnsafeUnpin for SbwtIndex<SS>where SS: UnsafeUnpin,

impl<SS> UnwindSafe for SbwtIndex<SS>where SS: UnwindSafe,

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> CloneToUninit for Twhere T: Clone,

unsafe fn clone_to_uninit(&self, dest: *mut u8)

impl<T> Conv for T

fn conv<T>(self) -> Twhere Self: Into<T>,

impl<T> FmtForward for T

fn fmt_binary(self) -> FmtBinary<Self>where Self: Binary,

fn fmt_display(self) -> FmtDisplay<Self>where Self: Display,

fn fmt_lower_exp(self) -> FmtLowerExp<Self>where Self: LowerExp,

fn fmt_lower_hex(self) -> FmtLowerHex<Self>where Self: LowerHex,

fn fmt_octal(self) -> FmtOctal<Self>where Self: Octal,

fn fmt_pointer(self) -> FmtPointer<Self>where Self: Pointer,

fn fmt_upper_exp(self) -> FmtUpperExp<Self>where Self: UpperExp,

fn fmt_upper_hex(self) -> FmtUpperHex<Self>where Self: UpperHex,

fn fmt_list(self) -> FmtList<Self>where &'a Self: for<'a> IntoIterator,

impl<T> From<T> for T

fn from(t: T) -> T

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<T> IntoEither for T

fn into_either(self, into_left: bool) -> Either<Self, Self>

Struct SbwtIndex

pub fn reconstruct_padded_spectrum(&self, n_threads: usize) -> Vec<u8> ⓘ
where Self: Sync,

pub fn push_all_labels_forward( &self, labels_in: &[u8], labels_out: &mut [u8], n_threads: usize, )
where Self: Sync,

pub fn push_all_labels_forward_compact( &self, labels_in: &CompactIntVector<3>, labels_out: &mut CompactIntVector<3>, n_threads: usize, )
where Self: Sync,

impl<SS> Freeze for SbwtIndex<SS>
where SS: Freeze,

impl<SS> RefUnwindSafe for SbwtIndex<SS>
where SS: RefUnwindSafe,

impl<SS> Send for SbwtIndex<SS>
where SS: Send,

impl<SS> Sync for SbwtIndex<SS>
where SS: Sync,

impl<SS> Unpin for SbwtIndex<SS>
where SS: Unpin,

impl<SS> UnsafeUnpin for SbwtIndex<SS>
where SS: UnsafeUnpin,

impl<SS> UnwindSafe for SbwtIndex<SS>
where SS: UnwindSafe,

impl<T> Any for T
where T: 'static + ?Sized,

impl<T> Borrow<T> for T
where T: ?Sized,

impl<T> BorrowMut<T> for T
where T: ?Sized,

impl<T> CloneToUninit for T
where T: Clone,

fn conv<T>(self) -> T
where Self: Into<T>,

fn fmt_binary(self) -> FmtBinary<Self>
where Self: Binary,

fn fmt_display(self) -> FmtDisplay<Self>
where Self: Display,

fn fmt_lower_exp(self) -> FmtLowerExp<Self>
where Self: LowerExp,

fn fmt_lower_hex(self) -> FmtLowerHex<Self>
where Self: LowerHex,

fn fmt_octal(self) -> FmtOctal<Self>
where Self: Octal,

fn fmt_pointer(self) -> FmtPointer<Self>
where Self: Pointer,

fn fmt_upper_exp(self) -> FmtUpperExp<Self>
where Self: UpperExp,

fn fmt_upper_hex(self) -> FmtUpperHex<Self>
where Self: UpperHex,

fn fmt_list(self) -> FmtList<Self>
where &'a Self: for<'a> IntoIterator,

impl<T, U> Into<U> for T
where U: From<T>,

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

impl<T> Pipe for T
where T: ?Sized,

fn pipe<R>(self, func: impl FnOnce(Self) -> R) -> R
where Self: Sized,

fn pipe_ref<'a, R>(&'a self, func: impl FnOnce(&'a Self) -> R) -> R
where R: 'a,

fn pipe_ref_mut<'a, R>(&'a mut self, func: impl FnOnce(&'a mut Self) -> R) -> R
where R: 'a,

fn pipe_borrow<'a, B, R>(&'a self, func: impl FnOnce(&'a B) -> R) -> R
where Self: Borrow<B>, B: 'a + ?Sized, R: 'a,

fn pipe_borrow_mut<'a, B, R>( &'a mut self, func: impl FnOnce(&'a mut B) -> R, ) -> R
where Self: BorrowMut<B>, B: 'a + ?Sized, R: 'a,

fn pipe_as_ref<'a, U, R>(&'a self, func: impl FnOnce(&'a U) -> R) -> R
where Self: AsRef<U>, U: 'a + ?Sized, R: 'a,

fn pipe_as_mut<'a, U, R>(&'a mut self, func: impl FnOnce(&'a mut U) -> R) -> R
where Self: AsMut<U>, U: 'a + ?Sized, R: 'a,

fn pipe_deref<'a, T, R>(&'a self, func: impl FnOnce(&'a T) -> R) -> R
where Self: Deref<Target = T>, T: 'a + ?Sized, R: 'a,

fn pipe_deref_mut<'a, T, R>( &'a mut self, func: impl FnOnce(&'a mut T) -> R, ) -> R
where Self: DerefMut<Target = T> + Deref, T: 'a + ?Sized, R: 'a,

fn tap_borrow<B>(self, func: impl FnOnce(&B)) -> Self
where Self: Borrow<B>, B: ?Sized,

fn tap_borrow_mut<B>(self, func: impl FnOnce(&mut B)) -> Self
where Self: BorrowMut<B>, B: ?Sized,

fn tap_ref<R>(self, func: impl FnOnce(&R)) -> Self
where Self: AsRef<R>, R: ?Sized,

fn tap_ref_mut<R>(self, func: impl FnOnce(&mut R)) -> Self
where Self: AsMut<R>, R: ?Sized,

fn tap_deref<T>(self, func: impl FnOnce(&T)) -> Self
where Self: Deref<Target = T>, T: ?Sized,

fn tap_deref_mut<T>(self, func: impl FnOnce(&mut T)) -> Self
where Self: DerefMut<Target = T> + Deref, T: ?Sized,

fn tap_borrow_dbg<B>(self, func: impl FnOnce(&B)) -> Self
where Self: Borrow<B>, B: ?Sized,

fn tap_borrow_mut_dbg<B>(self, func: impl FnOnce(&mut B)) -> Self
where Self: BorrowMut<B>, B: ?Sized,

fn tap_ref_dbg<R>(self, func: impl FnOnce(&R)) -> Self
where Self: AsRef<R>, R: ?Sized,

fn tap_ref_mut_dbg<R>(self, func: impl FnOnce(&mut R)) -> Self
where Self: AsMut<R>, R: ?Sized,

fn tap_deref_dbg<T>(self, func: impl FnOnce(&T)) -> Self
where Self: Deref<Target = T>, T: ?Sized,

fn tap_deref_mut_dbg<T>(self, func: impl FnOnce(&mut T)) -> Self
where Self: DerefMut<Target = T> + Deref, T: ?Sized,

impl<T> ToOwned for T
where T: Clone,

fn try_conv<T>(self) -> Result<T, Self::Error>
where Self: TryInto<T>,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

impl<V, T> VZip<V> for T
where V: MultiLane<T>,