LtFmIndex
lt-fm-index
is a library to (1) locate or (2) count the pattern in the large text of nucleotide and amino acid sequences.
Description
- FmIndex is a data structure for exact pattern matching.
LtFmIndex
is FmIndex using lookup table, the precalculated count of k-mer occurrences.- The lookup table can locate the first k-mer of pattern at once.
Features
LtFmIndex
is built fromText
(Vec<u8>
).LtFmIndex
have two functions.count
: Count the number of times thePattern
(&[u8]
) appears in theText
.locate
: Locate the start index in which thePattern
appears in theText
.
- Four types of
Text
are supported.NucleotideOnly
: consists of {ACG*}NucleotideWithNoise
: consists of {ACGT*}AminoacidOnly
: consists of {ACDEFGHIKLMNPQRSTVW*}AminoacidWithNoise
: consists of {ACDEFGHIKLMNPQRSTVWY*}
- The
*
of each type is treated as a wildcard that can be matched with any characters.- For example,
- If the TextType is
NucleotideOnly
,LtFmIndex
stores the text of ACGTXYZ as ACG****. - If the TextType is
NucleotideWithNoise
,LtFmIndex
stores the same text (ACGTXYZ) as ACGT*** - If the indexed text is ACGT***, the patterns of ACGTXXX, ACGT@@@, and ACGTX@# give the same result.
- If the TextType is
- For example,
- Using
fastbwt
feature can accelerate the indexing, but needscmake
to buildlibdivsufsort
and cannot be built as WASM.
Examples
1. Use LtFmIndex
to count and locate a pattern.
use LtFmIndexBuilder;
// (1) Define builder for lt-fm-index
let builder = new
.text_type_is_inferred
.set_suffix_array_sampling_ratio.unwrap
.set_lookup_table_kmer_size.unwrap;
// (2) Generate lt-fm-index with text
let text = b"CTCCGTACACCTGTTTCGTATCGGANNNN".to_vec;
let lt_fm_index = builder.build.unwrap; // text is consumed
// (3) Match with pattern
let pattern = b"TA".to_vec;
// - count
let count = lt_fm_index.count;
assert_eq!;
// - locate
let locations = lt_fm_index.locate;
assert_eq!;
2. Save and load LtFmIndex
use ;
// (1) Generate lt-fm-index
let text = b"CTCCGTACACCTGTTTCGTATCGGA".to_vec;
let lt_fm_index_to_save = new.build.unwrap;
// (2) Save lt-fm-index to buffer
let mut buffer = Vec new;
lt_fm_index_to_save.save_to.unwrap;
// (3) Load lt-fm-index from buffer
let lt_fm_index_loaded = load_from.unwrap;
assert_eq!;
Repository
https://github.com/baku4/lt-fm-index
Doc
Reference
- Ferragina, P., et al. (2004). An Alphabet-Friendly FM-Index, Springer Berlin Heidelberg: 150-160.
- Anderson, T. and T. J. Wheeler (2021). An optimized FM-index library for nucleotide and amino acid search, Cold Spring Harbor Laboratory.
- Wang, Y., X. Li, D. Zang, G. Tan and N. Sun (2018). Accelerating FM-index Search for Genomic Data Processing, ACM.
- Yuta Mori.
libdivsufsort