Crate bio

Expand description

Rust-bio, a bioinformatics library for Rust.

This library provides implementations of many algorithms and data structures that are useful for bioinformatics. All provided implementations are rigorously tested via continuous integration.

For getting started with using rust-bio, see the Getting started section below. For navigating the documentation of the available modules, see the Modules section below. If you want to contribute to rust-bio, see the Contribute section in the repo.

Currently, rust-bio provides

most major pattern matching algorithms,
a convenient alphabet implementation,
pairwise alignment,
suffix arrays,
the Burrows-Wheeler-transform (BWT)
the Full-text index in Minute space index (FM-index),
FMD-Index for finding supermaximal exact matches,
a q-gram index,
utilities to work with PSSMs,
an open reading frame (ORF) search algorithm,
a rank/select data structure,
serde support for all data structures when built with nightly feature,
readers and writers for FASTQ, FASTA and BED,
helper functions for combinatorics and dealing with log probabilities,
an implementation of the Hidden Markov Model and related algorithms.

For reading and writing SAM/BAM/CRAM, VCF/BCF files or tabix indexed files, have a look at rust-htslib.

Getting started

We explain how to use Rust-Bio step-by-step. Users who already have experience with Rust can skip right to Step 3: Use Rust-Bio in your project. Users who already know rust-bio might want to jump right into the modules docs

Step 1: Setting up Rust

Rust can be installed following the instruction for rustup.

Step 2: Setting up a new Rust project

Since Rust-Bio is a library, you need to setup your own new Rust project to use Rust-Bio. With Rust, projects and their dependencies are managed with the builtin package manager Cargo. To create a new Rust project, issue

cargo new hello_world --bin
cd hello_world

in your terminal. The flag --bin tells Cargo to create an executable project instead of a library. In this section of the Rust docs, you find details about what Cargo just created for you.

Your new project can be compiled with

cargo build

If dependencies in your project are out of date, update with

cargo update

Execute the compiled code with

cargo run

If you are new to Rust, we suggest to proceed with learning Rust via the Rust docs.

Step 3: Use Rust-Bio in your project

To use Rust-Bio in your Rust project, add the following to your Cargo.toml

[dependencies]
bio = "*"

and import the crate from your source code:

extern crate bio;

Example: FM-index and FASTQ

An example of using rust-bio:

// Import some modules
use bio::alphabets;
use bio::data_structures::bwt::{bwt, less, Occ};
use bio::data_structures::fmindex::{BackwardSearchResult, FMIndex, FMIndexable};
use bio::data_structures::suffix_array::suffix_array;
use bio::io::fastq;
use bio::io::fastq::FastqRead;
use std::io;

// a given text
let text = b"ACAGCTCGATCGGTA$";
let pattern = b"ATCG";

// Create an FM-Index for the given text.

// instantiate an alphabet
let alphabet = alphabets::dna::iupac_alphabet();
// calculate a suffix array
let sa = suffix_array(text);
// calculate the Burrows-Wheeler-transform
let bwt = bwt(text, &sa);
// calculate the vectors less and Occ (occurrences)
let less = less(&bwt, &alphabet);
let occ = Occ::new(&bwt, 3, &alphabet);
// set up FMIndex
let fmindex = FMIndex::new(&bwt, &less, &occ);
// do a backwards search for the pattern
let interval = fmindex.backward_search(pattern.iter());
let mut partial_match_len = 0;
// get the locations where the pattern matched (completely in this case).
let positions = match interval {
    BackwardSearchResult::Complete(saint) => saint.occ(&sa),
    BackwardSearchResult::Partial(saint, l) => {
        partial_match_len = l;
        saint.occ(&sa)
    }
    BackwardSearchResult::Absent => Vec::new(),
};
// Iterate over a FASTQ file, use the alphabet to validate read
// sequences and search for exact matches in the FM-Index.

// create FASTQ reader
let mut reader = fastq::Reader::new(io::stdin());
let mut record = fastq::Record::new();
let mut partial_match_len = 0;
reader.read(&mut record).expect("Failed to parse record");
while !record.is_empty() {
    let check = record.check();
    if check.is_err() {
        panic!("I got a rubbish record!")
    }
    // obtain sequence
    let seq = record.seq();
    // check, whether seq is in the expected alphabet
    if alphabet.is_word(seq) {
        let interval = fmindex.backward_search(seq.iter());
        // get the positions where seq matched completely
        // or where the maximal matching suffix of seq occurred.
        let positions = match interval {
            BackwardSearchResult::Complete(saint) => saint.occ(&sa),
            BackwardSearchResult::Partial(saint, l) => {
                partial_match_len = l;
                saint.occ(&sa)
            }
            BackwardSearchResult::Absent => Vec::new(),
        };
    }
    reader.read(&mut record).expect("Failed to parse record");
}

Documentation and further examples for each module can be found in the module descriptions below.

Example: Multithreaded

use bio::alphabets;
use bio::data_structures::bwt::{bwt, less, Occ};
use bio::data_structures::fmindex::{BackwardSearchResult,FMIndex, FMIndexable};
use bio::data_structures::suffix_array::suffix_array;
use std::sync::Arc;
use std::thread;

let text = b"ACGGATGCTGGATCGGATCGCGCTAGCTA$";
let patterns = vec![b"ACCG", b"TGCT"];

// Create an FM-Index for a given text.
let alphabet = alphabets::dna::iupac_alphabet();
let sa = suffix_array(text);
let bwt = Arc::new(bwt(text, &sa));
let less = Arc::new(less(bwt.as_ref(), &alphabet));
let occ = Arc::new(Occ::new(bwt.as_ref(), 3, &alphabet));
let fmindex = Arc::new(FMIndex::new(bwt, less, occ));

// Spawn threads to perform backward searches for each interval
let interval_calculators = patterns
    .into_iter()
    .map(|pattern| {
        let fmindex = fmindex.clone();
        thread::spawn(move || fmindex.backward_search(pattern.iter()))
    })
    .collect::<Vec<_>>();

// Loop through the results, extracting the positions array for each pattern
for interval_calculator in interval_calculators {
    let positions = match interval_calculator.join().unwrap() {
        BackwardSearchResult::Complete(saint) => saint.occ(&sa),
        _ => Vec::new()
    };
}

Documentation and further examples for each module can be found in the module descriptions below.

Benchmarks

Since Rust-Bio is based on a compiled language, similar performance to C/C++ based libraries can be expected. Indeed, we find the pattern matching algorithms of Rust-Bio to perform in the range of the C++ library Seqan:

Algorithm	Rust-Bio	Seqan
BNDM	77ms	80ms
Horspool	122ms	125ms
BOM	103ms	107ms
Shift-And	241ms	545ms

We measured 10000 iterations of searching pattern GCGCGTACACACCGCCCG in the sequence of the hg38 MT chromosome. Initialization time of each algorithm for the given pattern was included in each iteration. Benchmarks were conducted with Cargo bench for Rust-Bio and Python timeit for Seqan on an Intel Core i5-3427U CPU. Benchmarking Seqan from Python timeit entails an overhead of 1.46ms for calling a C++ binary. This overhead was subtracted from above Seqan run times. Note that this benchmark only compares the two libraries to exemplify that Rust-Bio has comparable speed to C++ libraries: all used algorithms have their advantages for specific text and pattern structures and lengths (see the pattern matching section in the documentation)./!

Modules

alignment

Various alignment and distance computing algorithms.

alphabets

Implementation of alphabets and useful utilities.

data_structures

Various useful data structures.

io

Readers and writers for common bioinformatics file formats.

pattern_matching

This module contains various useful pattern matching algorithms. The implementations are based on the lecture notes “Algorithmen auf Sequenzen”, Kopczynski, Marschall, Martin and Rahmann, 2008 - 2015.

scores

seq_analysis

Sequence analysis algorithms.

stats

Mathematical and statistical tools.

utils

Common utilities.