Expand description
§Introduction
This crate contains an implementation of the Bit Matrix SBWT data structure, as described in Small Searchable k-Spectra via Subset Rank Queries on the Spectral Burrows-Wheeler Transform, for the DNA alphabet ACGT. A CLI for the main features of the library can be found at https://github.com/jnalanko/sbwt-rs-cli. The data structure uses a variant of the Burrows-Wheeler transform to compress a set of k-mers in way that allows fast lookup queries. If the input k-mers are consecutive k-mers from longer underlying sequences, the index takes typically around 5 bits per distinct k-mer, supporting k-mer lookup queries at a speed of around 1 μs / k-mer on modern hardware.
Queries can be further sped up by using the Longest common suffix array, (see here) taking roughly log(k) bits of space per k-mer. The LCS array also enables the computation of k-bounded matching statistics and shortest frequency-bounded suffixes (both described here) Finally, the crate provides an interface for traversing the node-centric de Bruijn graph of the k-mers.
§API Quick start
use sbwt::*;
use std::io::BufReader;
use std::io::BufWriter;
use std::fs::File;
use std::path::Path;
// Build the sbwt
let seqs: Vec<&[u8]> = vec![b"AACTGACTGATCGTCTTGACTCGTTTATCTACGGT", b"ACTGACAGCTCTGCGATGCGA"];
let seq_stream = sbwt::SliceSeqStream::new(seqs.as_slice());
let (sbwt, lcs) = SbwtIndexBuilder::new()
.k(6).n_threads(4).build_lcs(true).add_rev_comp(true)
.algorithm(BitPackedKmerSorting::new()
.mem_gb(2)
.dedup_batches(false)
.temp_dir(Path::new("./temp")))
.run(seq_stream);
// Query a k-mer
let query_kmer = b"GACTCG";
if sbwt.search(query_kmer).is_some() {
println!("{} is found", &String::from_utf8_lossy(query_kmer));
}
// Query all k-mers of a longer query using k-bounded matching statistics
let lcs = lcs.unwrap(); // Ok because we used build_lcs(true)
let streaming_index = StreamingIndex::new(&sbwt, &lcs);
let long_query = b"TGATACGTCTTAGTGACTCGTTT";
for (i, (len, range)) in streaming_index.matching_statistics(long_query).iter().enumerate() {
// Kmer ending at long_query[i] exists iff len == k
println!("Longest match ending at {} has length {} and colex range {:?}", i, len, range);
}
// Write index to disk for later use
sbwt.serialize(&mut BufWriter::new(File::create("index.sbwt").unwrap())).unwrap();
lcs.serialize(&mut BufWriter::new(File::create("index.lcs").unwrap())).unwrap();
§Reverse complements and input preprocessing
If your input is very large, you may want to preprocess it to remove duplicate k-mers reduce construction time, memory and disk space. We recommend using a specialized tool such as GGCAT, Cuttlefish or BCALM2 for this purpose.
This crate considers a k-mer distinct from its reverse complement. In the use case where a k-mer is considered equal to its reverse complement, you need to either feed both directions to the index, or feed just one direction but query each k-mer both ways. The former approach can be easily implemented by enabling reverse complements in the builder. For the latter approach, we recommend using one of the unitig construction tools mentioned above to turn the data into canonical unitigs containing each k-mer in only one orientation. After this, it is very important to reorient the unitigs to maintain a consistent orientation for neighboring k-mers since as explained here, each k-mer without a direct predecessor increases the index size. The payoff of this approach is up to two times smaller index size compared to indexing both orientations explicitly, with the drawback that now queries need to be run in both orientations.
§De Bruijn graph operations
With the addition of two extra bitvectors, the data structure supports traversal on the node-centric de-Bruijn graph of the input k-mers. The set of nodes is the set of distinct k-mers, and there is an edge from x to y iff x[1..k) = y[0..k-1). The label of the edge is the last character of y. The graph is not aware of reverse complements, so if you want to traverse the bi-directed de Bruijn graph that includes edges where one or both of the endpoints are reverse complemented, you need to take care of that logic yourself. See Dbg for the API.
§Construction algorithms
The crate currently provides one construction algorithm: bitpacked k-mer sorting. This extracts all k-mers in the input, packs each k-mer to 2k bits, and sorts them in parallel. This requires O(nk) time and disk space, so it is suitable only for small k. There is still a lot of room for optimization in the implementation. For larger k, a suffix-array-based construction algorithm is planned.
§Details on the space usage of the index
The index exploits overlaps between k-mers to encode them in small space.
We say that a k-mer x is a source k-mer if it has no incoming edges in the
node-centric de Bruijn graph of the input k-mers S, that is, there does not
exist a k-mer y ∈ S such that y[1..k) = x[0..k-1).
The number of bits in SbwtIndex<SubsetMatrix>
is 5(n + n’) plus a small constant,
where n is the number
of distinct k-mers in the dataset, and n’ is the number of nodes in the trie
of all prefixes of length k-1 of all source k-mers (See here for more
details on the inner workings of the SBWT to understand what is going on). When the k-mers are from
biological sequences, and unitigs are oriented consistently, the term n’ is typically negligible, but if the k-mers
are for example a randomly sampled subset, then the benefit of overlaps is
lost, and the term n’ dominates. In the worst case, n’ can be up to n(k-1) + 1.
§Limitations
The implementation only supports the DNA alphabet ACGT. For best compression, the input k-mers should originate from a longer underlying sequence so that sbwt is able to exploit the overlaps for better compression. For non-overlapping k-mer sets, a simple hash table is likely a better choice.
Modules§
Structs§
- A construction algorithm based on sorting of bit-packed k-mers using temporary disk space.
- An array that stores the lengths of the longest common suffixes of consecutive k-mers in colexicographic order.
- A table storing the SBWT intervals of all 4^p possible p-mers.
- The SBWT index data structure. Construct with SbwtIndexBuilder. For the SubsetSeq trait implementation, we recommend using the bit matrix implementation SubsetMatrix.
- A builder for constructing an SBWT index.
- Creates a crate::SeqStream out of a slice of ASCII sequences.
- An index that uses right extensions and left contractions to find matches in a streaming fashion.
- An implementation of SubsetSeq with a matrix of sigma indicator bit vectors: the i-th bit of the j-th bit vector is 1 if and only if the i-th subset contains the j-th character. Rank and select queries are reduced to bit vector rank and select queries on the indicator bit vectors.
- Creates a crate::SeqStream out of a slice of ascii vectors.
Enums§
- An enum listing SbwtIndex types built on different subset rank implementations provided in this crate. Currently, only the SubsetMatrix structure is supported, but in the future there might be more.
Traits§
- Contracting a search pattern from the left.
- Extending a search pattern to the right.
- Any struct implementing this interface can be used in SbwtIndexBuilder to construct the SBWT index and optionally the LCS array.
- A stream of ASCII-encoded DNA-sequences. This is not necessarily a standard Rust iterator because we want to support streaming sequences from disk, which is not possible with a regular iterator due to lifetime constraints of the Iterator trait.
- This trait represents a sequence of subsets from alphabet {0, 1, …, sigma-1}, where sigma is the alphabet size. The trait provides access to the subsets and rank and select queries for the elements inside the subsets.
Functions§
- Loads an index that was previously serialized either with the C++ API https://github.com/algbio/SBWT, or the associated CLI. Supports only version v0.1.
- Loads an index that was stored with write_sbwt_index_variant. This includes a type identifier to load the correct subset rank variant.
- Given an iterator of canonical unitigs of the node-centric de Bruijn graph of order k, returns a vector of orientations, one for each sequence, aiming to minimize the number of unitigs which do not have an incoming edge in the de Bruijn graph of order k. Canonical unitigs means that each k-mer occur in only one orientation. The input can be a VecSeqStream, a SliceSeqStream or any struct implementing SeqStream.
- Reverses the given ASCII DNA sequence and replaces each nucleotide with its complement.
- Loads an index that is wrapped in an enum describing the used subset rank structure type. The format includes a type identifier so the correct variant can later be loaded with load_sbwt_index_variant.