pub struct SbwtIndex<SS: SubsetSeq> { /* private fields */ }Expand description
The SBWT index data structure. Construct with SbwtIndexBuilder. For the SubsetSeq trait implementation, we recommend using the bit matrix implementation SubsetMatrix.
§SBWT index
The SBWT index is a compressed index for searching for k-mers in a set of k-mers. It can be seen a version of the FM-index on sets of k-mers. Here we give a brief overview of the data structure and key concepts that are helpful for understanding the API. For further details, see the paper.
§SBWT graph
To understand the SBWT index, it’s helpful to first understand the SBWT graph. The node-centric de Bruijn graph of a set of k-mers R (also called a k-spectrum) is a directed graph where the nodes are the k-mers in R, and there is an edge from x to y if x[1..k) = y[0..k-1). The label of the edge is the last character of y. The SBWT graph is a modified version of the node-centric de Bruijn graph. There are two modifications:
- We add a set R’ of “dummy nodes”. First, the source set R’ ⊆ R is the subset of k-mers in R that do not have an incoming edge in the de Bruijn graph, that is x ∈ R’ iff x[0..k-1) is not the suffix of any k-mer in R. The set of dummy nodes is the set of all proper prefixes of the k-mers in R’. We pad the dummy nodes with dollar symbols from the left so that all dummy nodes have length k. The set R ∪ R’ is called the padded k-spectrum. We add all de Bruijn graph edges between overlaps of the padded k-spectrum, that is, for every x ∈ R ∪ R’ and y ∈ R ∪ R’ such that x[1..k) = y[0..k-1), we add an edge from x to y in the SBWT graph. As a special case, the empty string $^k is always included in R’, and the incoming self-loop edge labeled with $ is not included.
- For every node in R ∪ R’ that has more than one incoming edge, we delete all incoming edges except the one that comes from the colexicographically smallest k-mer. A k-mer x is colexicographically smaller than k-mer y iff the reverse of x is lexicographically smaller than the reverse of y.
For example, below is the SBWT graph of all 4-mers of strings [“TGTTTG”, “TTGCTAT”, “ACGTAGTATAT”, “TGTAAA”]. The source node set is shown in blue, the dummy node set R’ in orange, and the de Bruijn graph edges that have been removed are shown in red.
§SBWT definition
The SBWT is a sequence of subsets of {A,C,G,T} such that the i-th subset contains the edge labels of outgoing edges from the i-th k-mer in the SBWT graph in colexicographic order of the k-mers. If we take the example from above and stack the nodes in colexicographic order, we get:
The SBWT subset sequence in this example is the sequence of outgoing edge label sets read from top to bottom: {A,T}, {C}, {}, {A}, {T}, {T}, {A,G,T}, {}, {}, {G}, {T}, {T}, {T}, {T}, {C}, {G}, {A}, {}, {}, {A}, {A}, {A}, {A,T}, {T}, {G}. The key property that makes the SBWT index work is that the i-th outgoing edge with a given label on the right column in the picture is the same edge as the i-th incoming edge with the same label on the left column. Thanks to this property, once the subset sequnce has been constructed, the k-mers and the graph can be discarded, with no loss of information. That is, the k-mers and the graph can be reconstructed from the subset sequence alone.
Given the subset sequence, there is also an algorithm to search for a k-mer in O(k) time. The algorithm is essentially the same as the search algorithm on the FM-index. Given a k-mer P[0..k], at iteration i, we have the range of rows in the picture whose k-mers have P[0..i) as a suffix. When i = 0, we have the empty suffix P[0..0), which is a suffix of all k-mers. To update the range for the next iteration, we follow the edges labeled with the next character of the pattern, using the key property mentioned above, as shown in the figure below for query k-mer TATA. This update step can be implemented efficiently by preprocessing the SBWT set sequence for subset rank queries. See this paper for more on how to implement these rank queries. We provide an implementation based on a bit matrix at SubsetMatrix. Any struct implementing the SubsetSeq trait can be used to query the SBWT.
Implementations§
Source§impl<SS: SubsetSeq> SbwtIndex<SS>
impl<SS: SubsetSeq> SbwtIndex<SS>
Sourcepub fn char_idx(&self, c: u8) -> usize
pub fn char_idx(&self, c: u8) -> usize
Maps A,C,G,T to 0,1,2,3. Panics if c is not a DNA character.
pub fn interval_of_empty_string(&self) -> Range<usize>
Sourcepub fn serialize<W: Write>(&self, out: &mut W) -> Result<usize>
pub fn serialize<W: Write>(&self, out: &mut W) -> Result<usize>
Writes the index to the writer and retuns the number of bytes written. The array can later be loaded with SbwtIndex::load.
Sourcepub fn load<R: Read>(input: &mut R) -> Result<Self>
pub fn load<R: Read>(input: &mut R) -> Result<Self>
Loads an index that was previously serialized with SbwtIndex::serialize.
Sourcepub fn inlabel(&self, i: usize) -> Option<u8>
pub fn inlabel(&self, i: usize) -> Option<u8>
Returns the edge label on the incoming edge to the i-th node in colexicographic order in the SBWT graph (not the same graph as the de Bruijn graph!), or None if the node has no incoming edge. This is well-defined since every node in the SBWT graph has at most 1 incoming edge.
Sourcepub fn build_select(&mut self)
pub fn build_select(&mut self)
Build select support for the SBWT subset sequence. This is required for SbwtIndex::inverse_lf_step and SbwtIndex::access_kmer.
Sourcepub fn lf_step(&self, i: usize, char_idx: usize) -> usize
pub fn lf_step(&self, i: usize, char_idx: usize) -> usize
A low-level function returning C[char_idx] + SBWT.rank(char_idx, i).
Sourcepub fn inverse_lf_step(&self, i: usize) -> Option<usize>
pub fn inverse_lf_step(&self, i: usize) -> Option<usize>
The inverse function of SbwtIndex::lf_step.
Requires that the select support has been built.
Returns None if called on the source node of the SBWT graph.
Sourcepub fn push_kmer_to_vec(&self, colex_rank: usize, buf: &mut Vec<u8>)
pub fn push_kmer_to_vec(&self, colex_rank: usize, buf: &mut Vec<u8>)
Requires that the select support has been built.
Accesses the k-mer associated with the node with colexicographic rank
colex_rank. Note that this k-mer may contain dollar symbols if it’s
a dummy node. The k-mer string is pushed to buf.
Sourcepub fn access_kmer(&self, colex_rank: usize) -> Vec<u8> ⓘ
pub fn access_kmer(&self, colex_rank: usize) -> Vec<u8> ⓘ
Requires that the select support has been built.
Accesses the k-mer associated with the node with colexicographic rank
colex_rank. Note that this k-mer may contain dollar symbols if it’s
a dummy node. The k-mer is retuned as a Vec<u8>. To push to an existing
buffer for better performance, see SbwtIndex::push_kmer_to_vec.
Sourcepub fn search(&self, pattern: &[u8]) -> Option<Range<usize>>
pub fn search(&self, pattern: &[u8]) -> Option<Range<usize>>
Returns the colexicographic interval of the pattern, if found. The pattern is given in ascii characters.
Sourcepub fn search_from(
&self,
interval: Range<usize>,
pattern: &[u8],
) -> Option<Range<usize>>
pub fn search_from( &self, interval: Range<usize>, pattern: &[u8], ) -> Option<Range<usize>>
Searches the pattern starting from the given colexicographic interval. Returns the colexicographic interval after searching the pattern, if found. The pattern is given in ascii characters.
Sourcepub fn reconstruct_padded_spectrum(&self, n_threads: usize) -> Vec<u8> ⓘwhere
Self: Sync,
pub fn reconstruct_padded_spectrum(&self, n_threads: usize) -> Vec<u8> ⓘwhere
Self: Sync,
Reconstruct all k-mers in the padded k-spectrum in the data structure. The reconstructed k-mers concatenated into a single ascii string in colexicographic order. The i-th k-mer starts at position i*k in the string.
Sourcepub fn set_lookup_table(&mut self, prefix_lookup_table: PrefixLookupTable)
pub fn set_lookup_table(&mut self, prefix_lookup_table: PrefixLookupTable)
Set the prefix lookup table of the data structure.
Sourcepub fn get_lookup_table(&self) -> &PrefixLookupTable
pub fn get_lookup_table(&self) -> &PrefixLookupTable
Get the prefix lookup table of the data structure.
Sourcepub fn from_subset_seq(
subset_rank: SS,
n_kmers: usize,
k: usize,
precalc_prefix_length: usize,
) -> Self
pub fn from_subset_seq( subset_rank: SS, n_kmers: usize, k: usize, precalc_prefix_length: usize, ) -> Self
Construct from existing crate::subsetseq::SubsetSeq.
Sourcepub fn push_all_labels_forward(
&self,
labels_in: &[u8],
labels_out: &mut [u8],
n_threads: usize,
)where
Self: Sync,
pub fn push_all_labels_forward(
&self,
labels_in: &[u8],
labels_out: &mut [u8],
n_threads: usize,
)where
Self: Sync,
Push labels forward along the SBWT graph.
The effect is this: if labels_in is a list of labels, one for each node in the SBWT in colex order, then the output
is those labels pushed forward along the edges in the SBWT graph. Those nodes that do not have a predecessor
(just the root of the graph) get a dollar.
Sourcepub fn push_all_labels_forward_compact(
&self,
labels_in: &CompactIntVector<3>,
labels_out: &mut CompactIntVector<3>,
n_threads: usize,
)where
Self: Sync,
pub fn push_all_labels_forward_compact(
&self,
labels_in: &CompactIntVector<3>,
labels_out: &mut CompactIntVector<3>,
n_threads: usize,
)where
Self: Sync,
A version of SbwtIndex::push_all_labels_forward that stores the labels as compact integer vectors with elements {0,1, …, 4}. The symbol 0 is the “dollar” and symbols 1,2,3,4 are A,C,G and T. (see the documentation of SbwtIndex::push_all_labels_forward.
Sourcepub fn build_last_column(&self) -> Vec<u8> ⓘ
pub fn build_last_column(&self) -> Vec<u8> ⓘ
Build the last column of the SBWT matrix.
Sourcepub fn build_last_column_compact(&self) -> CompactIntVector<3>
pub fn build_last_column_compact(&self) -> CompactIntVector<3>
Build the last column of the SBWT matrix.
Sourcepub fn compute_dummy_node_marks(&self) -> BitVec ⓘ
pub fn compute_dummy_node_marks(&self) -> BitVec ⓘ
Compute a bit vector marking the dummy nodes in the SBWT
Trait Implementations§
Source§impl<SS: SubsetSeq> ExtendRight for SbwtIndex<SS>
impl<SS: SubsetSeq> ExtendRight for SbwtIndex<SS>
Source§fn extend_right(&self, I: Range<usize>, c: u8) -> Range<usize>
fn extend_right(&self, I: Range<usize>, c: u8) -> Range<usize>
Right extensions implemented time O(t), where t is the time for a rank query in the
subset rank query implementation of SS. This is O(1) for SubsetMatrix.
impl<SS: Eq + SubsetSeq> Eq for SbwtIndex<SS>
impl<SS: SubsetSeq> StructuralPartialEq for SbwtIndex<SS>
Auto Trait Implementations§
impl<SS> Freeze for SbwtIndex<SS>where
SS: Freeze,
impl<SS> RefUnwindSafe for SbwtIndex<SS>where
SS: RefUnwindSafe,
impl<SS> Send for SbwtIndex<SS>where
SS: Send,
impl<SS> Sync for SbwtIndex<SS>where
SS: Sync,
impl<SS> Unpin for SbwtIndex<SS>where
SS: Unpin,
impl<SS> UnsafeUnpin for SbwtIndex<SS>where
SS: UnsafeUnpin,
impl<SS> UnwindSafe for SbwtIndex<SS>where
SS: UnwindSafe,
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
Source§impl<T> FmtForward for T
impl<T> FmtForward for T
Source§fn fmt_binary(self) -> FmtBinary<Self>where
Self: Binary,
fn fmt_binary(self) -> FmtBinary<Self>where
Self: Binary,
self to use its Binary implementation when Debug-formatted.Source§fn fmt_display(self) -> FmtDisplay<Self>where
Self: Display,
fn fmt_display(self) -> FmtDisplay<Self>where
Self: Display,
self to use its Display implementation when
Debug-formatted.Source§fn fmt_lower_exp(self) -> FmtLowerExp<Self>where
Self: LowerExp,
fn fmt_lower_exp(self) -> FmtLowerExp<Self>where
Self: LowerExp,
self to use its LowerExp implementation when
Debug-formatted.Source§fn fmt_lower_hex(self) -> FmtLowerHex<Self>where
Self: LowerHex,
fn fmt_lower_hex(self) -> FmtLowerHex<Self>where
Self: LowerHex,
self to use its LowerHex implementation when
Debug-formatted.Source§fn fmt_octal(self) -> FmtOctal<Self>where
Self: Octal,
fn fmt_octal(self) -> FmtOctal<Self>where
Self: Octal,
self to use its Octal implementation when Debug-formatted.Source§fn fmt_pointer(self) -> FmtPointer<Self>where
Self: Pointer,
fn fmt_pointer(self) -> FmtPointer<Self>where
Self: Pointer,
self to use its Pointer implementation when
Debug-formatted.Source§fn fmt_upper_exp(self) -> FmtUpperExp<Self>where
Self: UpperExp,
fn fmt_upper_exp(self) -> FmtUpperExp<Self>where
Self: UpperExp,
self to use its UpperExp implementation when
Debug-formatted.Source§fn fmt_upper_hex(self) -> FmtUpperHex<Self>where
Self: UpperHex,
fn fmt_upper_hex(self) -> FmtUpperHex<Self>where
Self: UpperHex,
self to use its UpperHex implementation when
Debug-formatted.Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§impl<T> Pipe for Twhere
T: ?Sized,
impl<T> Pipe for Twhere
T: ?Sized,
Source§fn pipe<R>(self, func: impl FnOnce(Self) -> R) -> Rwhere
Self: Sized,
fn pipe<R>(self, func: impl FnOnce(Self) -> R) -> Rwhere
Self: Sized,
Source§fn pipe_ref<'a, R>(&'a self, func: impl FnOnce(&'a Self) -> R) -> Rwhere
R: 'a,
fn pipe_ref<'a, R>(&'a self, func: impl FnOnce(&'a Self) -> R) -> Rwhere
R: 'a,
self and passes that borrow into the pipe function. Read moreSource§fn pipe_ref_mut<'a, R>(&'a mut self, func: impl FnOnce(&'a mut Self) -> R) -> Rwhere
R: 'a,
fn pipe_ref_mut<'a, R>(&'a mut self, func: impl FnOnce(&'a mut Self) -> R) -> Rwhere
R: 'a,
self and passes that borrow into the pipe function. Read moreSource§fn pipe_borrow<'a, B, R>(&'a self, func: impl FnOnce(&'a B) -> R) -> R
fn pipe_borrow<'a, B, R>(&'a self, func: impl FnOnce(&'a B) -> R) -> R
Source§fn pipe_borrow_mut<'a, B, R>(
&'a mut self,
func: impl FnOnce(&'a mut B) -> R,
) -> R
fn pipe_borrow_mut<'a, B, R>( &'a mut self, func: impl FnOnce(&'a mut B) -> R, ) -> R
Source§fn pipe_as_ref<'a, U, R>(&'a self, func: impl FnOnce(&'a U) -> R) -> R
fn pipe_as_ref<'a, U, R>(&'a self, func: impl FnOnce(&'a U) -> R) -> R
self, then passes self.as_ref() into the pipe function.Source§fn pipe_as_mut<'a, U, R>(&'a mut self, func: impl FnOnce(&'a mut U) -> R) -> R
fn pipe_as_mut<'a, U, R>(&'a mut self, func: impl FnOnce(&'a mut U) -> R) -> R
self, then passes self.as_mut() into the pipe
function.Source§fn pipe_deref<'a, T, R>(&'a self, func: impl FnOnce(&'a T) -> R) -> R
fn pipe_deref<'a, T, R>(&'a self, func: impl FnOnce(&'a T) -> R) -> R
self, then passes self.deref() into the pipe function.Source§impl<T> Pointable for T
impl<T> Pointable for T
Source§impl<T> Tap for T
impl<T> Tap for T
Source§fn tap_borrow<B>(self, func: impl FnOnce(&B)) -> Self
fn tap_borrow<B>(self, func: impl FnOnce(&B)) -> Self
Borrow<B> of a value. Read moreSource§fn tap_borrow_mut<B>(self, func: impl FnOnce(&mut B)) -> Self
fn tap_borrow_mut<B>(self, func: impl FnOnce(&mut B)) -> Self
BorrowMut<B> of a value. Read moreSource§fn tap_ref<R>(self, func: impl FnOnce(&R)) -> Self
fn tap_ref<R>(self, func: impl FnOnce(&R)) -> Self
AsRef<R> view of a value. Read moreSource§fn tap_ref_mut<R>(self, func: impl FnOnce(&mut R)) -> Self
fn tap_ref_mut<R>(self, func: impl FnOnce(&mut R)) -> Self
AsMut<R> view of a value. Read moreSource§fn tap_deref<T>(self, func: impl FnOnce(&T)) -> Self
fn tap_deref<T>(self, func: impl FnOnce(&T)) -> Self
Deref::Target of a value. Read moreSource§fn tap_deref_mut<T>(self, func: impl FnOnce(&mut T)) -> Self
fn tap_deref_mut<T>(self, func: impl FnOnce(&mut T)) -> Self
Deref::Target of a value. Read moreSource§fn tap_dbg(self, func: impl FnOnce(&Self)) -> Self
fn tap_dbg(self, func: impl FnOnce(&Self)) -> Self
.tap() only in debug builds, and is erased in release builds.Source§fn tap_mut_dbg(self, func: impl FnOnce(&mut Self)) -> Self
fn tap_mut_dbg(self, func: impl FnOnce(&mut Self)) -> Self
.tap_mut() only in debug builds, and is erased in release
builds.Source§fn tap_borrow_dbg<B>(self, func: impl FnOnce(&B)) -> Self
fn tap_borrow_dbg<B>(self, func: impl FnOnce(&B)) -> Self
.tap_borrow() only in debug builds, and is erased in release
builds.Source§fn tap_borrow_mut_dbg<B>(self, func: impl FnOnce(&mut B)) -> Self
fn tap_borrow_mut_dbg<B>(self, func: impl FnOnce(&mut B)) -> Self
.tap_borrow_mut() only in debug builds, and is erased in release
builds.Source§fn tap_ref_dbg<R>(self, func: impl FnOnce(&R)) -> Self
fn tap_ref_dbg<R>(self, func: impl FnOnce(&R)) -> Self
.tap_ref() only in debug builds, and is erased in release
builds.Source§fn tap_ref_mut_dbg<R>(self, func: impl FnOnce(&mut R)) -> Self
fn tap_ref_mut_dbg<R>(self, func: impl FnOnce(&mut R)) -> Self
.tap_ref_mut() only in debug builds, and is erased in release
builds.Source§fn tap_deref_dbg<T>(self, func: impl FnOnce(&T)) -> Self
fn tap_deref_dbg<T>(self, func: impl FnOnce(&T)) -> Self
.tap_deref() only in debug builds, and is erased in release
builds.