Crate debruijn

Source
Expand description

§debruijn: a De Bruijn graph library for DNA seqeunces in Rust.

This library provides tools for efficient construction DeBruijn graphs (dBG) from DNA sequences, tracking arbitrary metadata associated with kmers in the graph, and performing path-compression of unbranched graph paths to improve speed and reduce memory consumption.

Most applications of debruijn will follow this general workflow:

  1. You generate a set of sequences to make a dBG from.
  2. You pass those sequences to the filter_kmers function, which converts the sequences into kmers, while tracking ‘metadata’ about each kmer in a very customizable way. The metadata could be read count, a set of colors, a set of read counts split by haplotype, a UMI count, etc.
  3. The the library will convert the kmers to a compressed dBG. You can also customize the rules for how to compress the dBG and how to ‘combine’ the per-kmer metadata.

Then you can use the final compressed dBG how you like. There are some methods for simplifying and re-building the graph, but those could be developed more.

§Examples

All the data structures in debruijn-rs are specialized to the 4 base DNA alphabet, and use 2-bit packed encoding of base-pairs into integer types, and efficient methods for reverse complement, enumerating kmers from longer sequences, and transfering data between sequences.

§Encodings

Most methods for ingesting sequence data into the library have a form named ‘bytes’, which expects bases encoded as the integers 0,1,2,3, and a separate form names ‘ascii’, which expects bases encoded as the ASCII letters A,C,G,T.

Modules§

clean_graph
DeBruijn graph simplification routines. Currently tip-removal is implemented.
compression
Create compressed DeBruijn graphs from uncompressed DeBruijn graphs, or a collection of disjoint DeBruijn graphs.
dna_string
A 2-bit encoding of arbitrary length DNA sequences.
filter
Methods for converting sequences into kmers, filtering observed kmers before De Bruijn graph construction, and summarizing ‘color’ annotations.
graph
Containers for path-compressed De Bruijn graphs
kmer
Represent kmers with statically know length in compact integer types
msp
Methods for minimum substring partitioning of a DNA string
neighbors
vmer
Variable-length DNA strings packed into fixed-size structs.

Structs§

DnaBytes
A newtype wrapper around a Vec<u8> with implementations of the Mer and Vmer traits.
DnaSlice
A newtype wrapper around a &[u8] with implementations of the Mer and Vmer traits.
Exts
Store single-base extensions for a DNA Debruijn graph.
KmerExtsIter
Iterate over the (Kmer, Exts) tuples of a sequence and it’s extensions efficiently
KmerIter
Iterate over the Kmers of a DNA sequence efficiently
MerIter
Iterator over bases of a DNA sequence (bases will be unpacked into bytes).

Enums§

Dir
Direction of motion in a DeBruijn graph

Traits§

Kmer
Encapsulates a Kmer sequence with statically known K.
Mer
Trait for interacting with DNA sequences
MerImmut
An immutable interface to a Mer sequence.
Vmer
A DNA sequence with run-time variable length, up to a statically known maximum length

Functions§

base_to_bits
Convert an ASCII-encoded DNA base to a 2-bit representation
bits_to_ascii
Convert a 2-bit representation of a base to a char
bits_to_base
Convert a 2-bit representation of a base to a char
complement
The complement of a 2-bit encoded base
dna_only_base_to_bits
is_valid_base
Convert an ASCII-encoded DNA base to a 2-bit representation