bio/
lib.rs

1// Copyright 2014-2016 Johannes Köster.
2// Licensed under the MIT license (http://opensource.org/licenses/MIT)
3// This file may not be copied, modified, or distributed
4// except according to those terms.
5
6#![doc(
7    html_logo_url = "https://raw.githubusercontent.com/rust-bio/rust-bio/master/img/bioferris.svg",
8    html_favicon_url = "https://raw.githubusercontent.com/rust-bio/rust-bio/master/img/bioferris.svg"
9)]
10
11//! # Rust-bio, a bioinformatics library for Rust.
12//! This library provides implementations of many algorithms and data structures
13//! that are useful for bioinformatics.
14//! All provided implementations are rigorously tested via continuous
15//! integration.
16//!
17//! For **getting started** with using `rust-bio`, see [the `Getting started` section below](#getting-started).
18//! For navigating the documentation of the available modules, see [the `Modules` section below](#modules).
19//! If you want to contribute to `rust-bio`, see [the `Contribute` section in the repo](https://github.com/rust-bio/rust-bio#contribute).
20//!
21//! Currently, rust-bio provides
22//!
23//! * most major pattern matching algorithms,
24//! * a convenient alphabet implementation,
25//! * pairwise alignment,
26//! * suffix arrays,
27//! * the [Burrows-Wheeler-transform (BWT)](https://www.semanticscholar.org/paper/A-Block-sorting-Lossless-Data-Compression-Algorithm-Burrows-Wheeler/af56e6d4901dcd0f589bf969e604663d40f1be5d)
28//! * the [Full-text index in Minute space index (FM-index)](https://doi.org/10.1109/SFCS.2000.892127),
29//! * FMD-Index for finding supermaximal exact matches,
30//! * a q-gram index,
31//! * utilities to work with [PSSMs](https://en.wikipedia.org/wiki/Position_weight_matrix),
32//! * an open reading frame (ORF) search algorithm,
33//! * a rank/select data structure,
34//! * [serde](https://github.com/serde-rs/serde) support for all data structures when built with `nightly` feature,
35//! * readers and writers for FASTQ, FASTA and BED,
36//! * helper functions for combinatorics and dealing with log probabilities,
37//! * an implementation of the Hidden Markov Model and related algorithms.
38//!
39//! For reading and writing SAM/BAM/CRAM, VCF/BCF files or tabix indexed files, have a look at [rust-htslib](https://docs.rs/rust-htslib).
40//!
41//! # Getting started
42//!
43//! We explain how to use Rust-Bio step-by-step.
44//! Users who already have experience with Rust can skip right to [Step 3: Use Rust-Bio in your project](https://docs.rs/bio/#step-3-use-rust-bio-in-your-project).
45//! Users who already know `rust-bio` might want to jump right into the [modules docs](https://docs.rs/bio/#modules)
46//!
47//! ## Step 1: Setting up Rust
48//!
49//! Rust can be installed following the instruction for [rustup](https://rustup.rs/).
50//!
51//!
52//! ## Step 2: Setting up a new Rust project
53//!
54//! Since Rust-Bio is a library, you need to setup your own new Rust project to use Rust-Bio.
55//! With Rust, projects and their dependencies are managed with the builtin package manager [Cargo](https://doc.rust-lang.org/cargo/index.html).
56//! To create a new Rust project, issue
57//!
58//! ```bash
59//! cargo new hello_world --bin
60//! cd hello_world
61//! ```
62//! in your terminal. The flag `--bin` tells Cargo to create an executable project instead of a library.
63//! In [this section](https://doc.rust-lang.org/nightly/book/hello-cargo.html#a-new-project) of the Rust docs, you find details about what Cargo just created for you.
64//!
65//! Your new project can be compiled with
66//! ```bash
67//! cargo build
68//! ```
69//! If dependencies in your project are out of date, update with
70//! ```bash
71//! cargo update
72//! ```
73//! Execute the compiled code with
74//! ```bash
75//! cargo run
76//! ```
77//! If you are new to Rust, we suggest to proceed with [learning Rust](https://www.rust-lang.org/learn) via the Rust docs.
78//!
79//! ## Step 3: Use Rust-Bio in your project
80//!
81//! To use Rust-Bio in your Rust project, add the following to your `Cargo.toml`
82//!
83//! ```toml
84//! [dependencies]
85//! bio = "1"
86//! ```
87//!
88//! Now Rust-Bio modules can be used directly in your source code, for example:
89//!
90//! ```rust
91//! use bio::io::fastq;
92//! ```
93//!
94//! ## Example: FM-index and FASTQ
95//!
96//! An example of using `rust-bio`:
97//!
98//! ```no_run
99//! // Import some modules
100//! use bio::alphabets;
101//! use bio::data_structures::bwt::{bwt, less, Occ};
102//! use bio::data_structures::fmindex::{BackwardSearchResult, FMIndex, FMIndexable};
103//! use bio::data_structures::suffix_array::suffix_array;
104//! use bio::io::fastq;
105//! use bio::io::fastq::FastqRead;
106//! use std::io;
107//!
108//! // a given text
109//! let text = b"ACAGCTCGATCGGTA$";
110//! let pattern = b"ATCG";
111//!
112//! // Create an FM-Index for the given text.
113//!
114//! // instantiate an alphabet
115//! let alphabet = alphabets::dna::iupac_alphabet();
116//! // calculate a suffix array
117//! let sa = suffix_array(text);
118//! // calculate the Burrows-Wheeler-transform
119//! let bwt = bwt(text, &sa);
120//! // calculate the vectors less and Occ (occurrences)
121//! let less = less(&bwt, &alphabet);
122//! let occ = Occ::new(&bwt, 3, &alphabet);
123//! // set up FMIndex
124//! let fmindex = FMIndex::new(&bwt, &less, &occ);
125//! // do a backwards search for the pattern
126//! let interval = fmindex.backward_search(pattern.iter());
127//! let mut partial_match_len = 0;
128//! // get the locations where the pattern matched (completely in this case).
129//! let positions = match interval {
130//!     BackwardSearchResult::Complete(saint) => saint.occ(&sa),
131//!     BackwardSearchResult::Partial(saint, l) => {
132//!         partial_match_len = l;
133//!         saint.occ(&sa)
134//!     }
135//!     BackwardSearchResult::Absent => Vec::new(),
136//! };
137//! // Iterate over a FASTQ file, use the alphabet to validate read
138//! // sequences and search for exact matches in the FM-Index.
139//!
140//! // create FASTQ reader
141//! let mut reader = fastq::Reader::new(io::stdin());
142//! let mut record = fastq::Record::new();
143//! let mut partial_match_len = 0;
144//! reader.read(&mut record).expect("Failed to parse record");
145//! while !record.is_empty() {
146//!     let check = record.check();
147//!     if check.is_err() {
148//!         panic!("I got a rubbish record!")
149//!     }
150//!     // obtain sequence
151//!     let seq = record.seq();
152//!     // check, whether seq is in the expected alphabet
153//!     if alphabet.is_word(seq) {
154//!         let interval = fmindex.backward_search(seq.iter());
155//!         // get the positions where seq matched completely
156//!         // or where the maximal matching suffix of seq occurred.
157//!         let positions = match interval {
158//!             BackwardSearchResult::Complete(saint) => saint.occ(&sa),
159//!             BackwardSearchResult::Partial(saint, l) => {
160//!                 partial_match_len = l;
161//!                 saint.occ(&sa)
162//!             }
163//!             BackwardSearchResult::Absent => Vec::new(),
164//!         };
165//!     }
166//!     reader.read(&mut record).expect("Failed to parse record");
167//! }
168//! ```
169//!
170//! Documentation and further examples for each module can be found in the module descriptions below.
171//!
172//!
173//! ## Example: Multithreaded
174//!
175//! ```rust
176//! use bio::alphabets;
177//! use bio::data_structures::bwt::{bwt, less, Occ};
178//! use bio::data_structures::fmindex::{BackwardSearchResult, FMIndex, FMIndexable};
179//! use bio::data_structures::suffix_array::suffix_array;
180//! use std::sync::Arc;
181//! use std::thread;
182//!
183//! let text = b"ACGGATGCTGGATCGGATCGCGCTAGCTA$";
184//! let patterns = vec![b"ACCG", b"TGCT"];
185//!
186//! // Create an FM-Index for a given text.
187//! let alphabet = alphabets::dna::iupac_alphabet();
188//! let sa = suffix_array(text);
189//! let bwt = Arc::new(bwt(text, &sa));
190//! let less = Arc::new(less(bwt.as_ref(), &alphabet));
191//! let occ = Arc::new(Occ::new(bwt.as_ref(), 3, &alphabet));
192//! let fmindex = Arc::new(FMIndex::new(bwt, less, occ));
193//!
194//! // Spawn threads to perform backward searches for each interval
195//! let interval_calculators = patterns
196//!     .into_iter()
197//!     .map(|pattern| {
198//!         let fmindex = fmindex.clone();
199//!         thread::spawn(move || fmindex.backward_search(pattern.iter()))
200//!     })
201//!     .collect::<Vec<_>>();
202//!
203//! // Loop through the results, extracting the positions array for each pattern
204//! for interval_calculator in interval_calculators {
205//!     let positions = match interval_calculator.join().unwrap() {
206//!         BackwardSearchResult::Complete(saint) => saint.occ(&sa),
207//!         _ => Vec::new(),
208//!     };
209//! }
210//! ```
211//!
212//! Documentation and further examples for each module can be found in the module descriptions below.
213//!
214//! # Benchmarks
215//!
216//! Since Rust-Bio is based on a compiled language, similar performance to C/C++ based libraries can be expected. Indeed, we find the pattern matching algorithms of Rust-Bio to perform in the range of the C++ library Seqan:
217//!
218//! | Algorithm | Rust-Bio | Seqan   |
219//! | --------- | -------: | ------: |
220//! | BNDM      | 77ms     | 80ms    |
221//! | Horspool  | 122ms    | 125ms   |
222//! | BOM       | 103ms    | 107ms   |
223//! | Shift-And | 241ms    | 545ms   |
224//!
225//! We measured 10000 iterations of searching pattern `GCGCGTACACACCGCCCG` in the sequence of the hg38 MT chromosome.
226//! Initialization time of each algorithm for the given pattern was included in each iteration. Benchmarks were conducted with *Cargo bench* for Rust-Bio and *Python timeit* for Seqan on an Intel Core i5-3427U CPU.
227//! Benchmarking Seqan from *Python timeit* entails an overhead of 1.46ms for calling a C++ binary. This overhead was subtracted from above Seqan run times.
228//! Note that this benchmark only compares the two libraries to exemplify that Rust-Bio has comparable speed to C++ libraries: all used algorithms have their advantages for specific text and pattern structures and lengths (see [the pattern matching section in the documentation](https://docs.rs/bio/0.28.2/bio/pattern_matching/index.html))./!
229
230#[macro_use]
231extern crate approx;
232
233#[macro_use]
234extern crate custom_derive;
235
236#[macro_use]
237extern crate newtype_derive;
238
239#[macro_use]
240extern crate serde_derive;
241
242#[macro_use]
243extern crate strum_macros;
244
245#[cfg(feature = "phylogeny")]
246#[macro_use]
247extern crate pest_derive;
248
249pub mod alignment;
250pub mod alphabets;
251pub mod data_structures;
252pub mod io;
253pub mod pattern_matching;
254pub mod scores;
255pub mod seq_analysis;
256pub mod stats;
257pub mod utils;
258pub use bio_types;