1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
/*! # LT FM-Index
`lt-fm-index` is library for locate and count nucleotide and amino acid sequence string.
`lt-fm-index` use lookup table (LT) in count table
**CAVEAT!** This `crate` is not stable. Functions can be changed without notification.
## Description
- Fm-index is a data structure used for pattern matching.
- `LT` is precalculated count table containing all kmer occurrences.
- With `LT`, you can find the first k-mer pattern at once.
## Features
- `LtFmIndex` is generated from `Text`
- `LtFmIndex` have two functions for `Pattern`
- count: Count the number of times the `Pattern` appears in `Text`.
- locate: Locate the start index in which the `Pattern` appears in `Text`.
- Supports **four** types of text.
- `NucleotideOnly` supports a text with only genetic nucleotide sequence (ACGT).
- `NucleotideWithNoise` supports a text containing non-nucleotide sequence.
- `AminoacidOnly` supports a text with only amino acid sequence.
- `AminoacidWithNoise` supports a text containing non-amino acid sequence.
- The last character of each text type is treated as a wildcard.
- The last characters of each text type are *T*, *_*, *Y* and *_*.
- Wildcard is assigned to all non-supported characters.
- For example, in `NucleotideOnly`, pattern of *ACGTXYZ* can be matched with *ACGTTTT*. Because *X*, *Y* and *Z* are not in *ACG* (nucleotide except *T*). And `lt-fm-index` generated with text of *ACGTXYZ* indexes the text as *ACGTTTT*.
- BWT is stored with rank count tables in every 64 or 128 intervals.
## Examples
### 1. Use `LtFmIndex` to count and locate pattern.
```rust
use lt_fm_index::{FmIndex, LtFmIndexConfig};
// (1) Define configuration for lt-fm-index
let config = LtFmIndexConfig::for_nucleotide()
.with_noise()
.change_kmer_size(4).unwrap()
.change_sampling_ratio(4).unwrap()
.change_bwt_interval_to_128();
// (2) Generate fm-index with text
let text = b"CTCCGTACACCTGTTTCGTATCGGANNNN".to_vec();
let lt_fm_index = config.generate(text).unwrap(); // text is consumed
// (3) Match with pattern
let pattern = b"TA".to_vec();
// - count
let count = lt_fm_index.count(&pattern);
assert_eq!(count, 2);
// - locate
let locations = lt_fm_index.locate(&pattern);
assert_eq!(locations, vec![5,18]);
```
### 2. Write and read `LtFmIndex`
```rust
use lt_fm_index::{LtFmIndexConfig, LtFmIndexAll, IO};
// (1) Generate `FmIndex`
let config = LtFmIndexConfig::for_nucleotide();
let text = b"CTCCGTACACCTGTTTCGTATCGGA".to_vec();
let lt_fm_index = config.generate(text).unwrap(); // text is consumed
// (2) Write fm-index to buffer (or file path)
let mut buffer = Vec::new();
lt_fm_index.write_to(&mut buffer).unwrap();
// (3) Read fm-index from buffer (or file path)
let lt_fm_index_buf = LtFmIndexAll::read_from(&buffer[..]).unwrap();
assert_eq!(lt_fm_index, lt_fm_index_buf);
```
## Repository
[https://github.com/baku4/lt-fm-index](https://github.com/baku4/lt-fm-index)
## Doc
[https://docs.rs/lt-fm-index/](https://docs.rs/lt-fm-index/)
## Reference
- Ferragina, P., et al. (2004). An Alphabet-Friendly FM-Index, Springer Berlin Heidelberg: 150-160.
- Anderson, T. and T. J. Wheeler (2021). An optimized FM-index library for nucleotide and amino acid search, Cold Spring Harbor Laboratory.
- Wang, Y., X. Li, D. Zang, G. Tan and N. Sun (2018). Accelerating FM-index Search for Genomic Data Processing, ACM.
- Yuta Mori. [`libdivsufsort`](https://github.com/y-256/libdivsufsort)
*/
// Error propagation
use anyhow::Result;
use anyhow::bail as error_msg;
// Serialization
use serde::{Serialize, Deserialize, de::DeserializeOwned};
mod structure;
mod proto;
pub mod use_case;
mod io;
mod config;
#[doc(hidden)]
pub mod deprecated;
#[doc(hidden)]
pub mod tests;
/// [Text] to generate [LtFmIndex]
pub type Text = Vec<u8>;
/// [Pattern] to use in [LtFmIndex]
pub type Pattern<'a> = &'a [u8];
/// [FmIndex] can count and locate [Pattern].
pub trait FmIndex {
/// Count the number of times the [Pattern] appears.
fn count(&self, pattern: Pattern) -> u64;
/// Locate the start index in which the [Pattern] appears. The result vector is not sorted.
fn locate(&self, pattern: Pattern) -> Vec<u64>;
}
pub use use_case::LtFmIndexAll;
pub use config::LtFmIndexConfig;
pub use io::IO;