Crate lt_fm_index
source ·Expand description
LtFmIndex
lt-fm-index
is a library to (1) locate or (2) count the pattern in the large text of nucleotide and amino acid sequences.
Description
- FmIndex is a data structure for exact pattern matching.
LtFmIndex
is FmIndex using lookup table, the precalculated count of k-mer occurrences.- The lookup table can locate the first k-mer of pattern at once.
Features
LtFmIndex
is built fromText
(Vec<u8>
).LtFmIndex
have two functions.count
: Count the number of times thePattern
(&[u8]
) appears in theText
.locate
: Locate the start index in which thePattern
appears in theText
.
- Four types of
Text
are supported.NucleotideOnly
: consists of {ACG*}NucleotideWithNoise
: consists of {ACGT*}AminoacidOnly
: consists of {ACDEFGHIKLMNPQRSTVW*}AminoacidWithNoise
: consists of {ACDEFGHIKLMNPQRSTVWY*}
- The
*
of each type is treated as a wildcard that can be matched with any characters.- For example,
- If the TextType is
NucleotideOnly
,LtFmIndex
stores the text of ACGTXYZ as ACG****. - If the TextType is
NucleotideWithNoise
,LtFmIndex
stores the same text (ACGTXYZ) as ACGT*** - If the indexed text is ACGT***, the patterns of ACGTXXX, ACGT@@@, and ACGTX@# give the same result.
- If the TextType is
- For example,
- Using
fastbwt
feature can accelerate the indexing, but needscmake
to buildlibdivsufsort
and cannot be built as WASM.
Examples
1. Use LtFmIndex
to count and locate a pattern.
use lt_fm_index::LtFmIndexBuilder;
// (1) Define builder for lt-fm-index
let builder = LtFmIndexBuilder::new()
.text_type_is_inferred()
.set_suffix_array_sampling_ratio(2).unwrap()
.set_lookup_table_kmer_size(4).unwrap();
// (2) Generate lt-fm-index with text
let text = b"CTCCGTACACCTGTTTCGTATCGGANNNN".to_vec();
let lt_fm_index = builder.build(text).unwrap(); // text is consumed
// (3) Match with pattern
let pattern = b"TA".to_vec();
// - count
let count = lt_fm_index.count(&pattern);
assert_eq!(count, 2);
// - locate
let locations = lt_fm_index.locate(&pattern);
assert_eq!(locations, vec![5,18]);
2. Save and load LtFmIndex
use lt_fm_index::{LtFmIndex, LtFmIndexBuilder};
// (1) Generate lt-fm-index
let text = b"CTCCGTACACCTGTTTCGTATCGGA".to_vec();
let lt_fm_index_to_save = LtFmIndexBuilder::new().build(text).unwrap();
// (2) Save lt-fm-index to buffer
let mut buffer = Vec::new();
lt_fm_index_to_save.save_to(&mut buffer).unwrap();
// (3) Load lt-fm-index from buffer
let lt_fm_index_loaded = LtFmIndex::load_from(&buffer[..]).unwrap();
assert_eq!(lt_fm_index_to_save, lt_fm_index_loaded);
Repository
https://github.com/baku4/lt-fm-index
Doc
Reference
- Ferragina, P., et al. (2004). An Alphabet-Friendly FM-Index, Springer Berlin Heidelberg: 150-160.
- Anderson, T. and T. J. Wheeler (2021). An optimized FM-index library for nucleotide and amino acid search, Cold Spring Harbor Laboratory.
- Wang, Y., X. Li, D. Zang, G. Tan and N. Sun (2018). Accelerating FM-index Search for Genomic Data Processing, ACM.
- Yuta Mori.
libdivsufsort
Modules
Errors
Structs
The index to (1) count or (2) locate the pattern in the indexed text.
The safe and concise builder for LtFmIndex
Enums
Bwt block size marker
Using the larger block size makes the index small but slightly slow.
Text type marker