lt-fm-index 0.6.0

Fm-index using k-mer lookup table for exact pattern matching
Documentation

LtFmIndex

CI crates.io

lt-fm-index is a library to (1) locate or (2) count the pattern in the large text of nucleotide and amino acid sequences.

Description

  • FmIndex is a data structure for exact pattern matching.
  • LtFmIndex is FmIndex using lookup table, the precalculated count of k-mer occurrences.
    • The lookup table can locate the first k-mer of pattern at once.

Features

  • LtFmIndex is built from Text (Vec<u8>).
  • LtFmIndex have two functions.
    1. count: Count the number of times the Pattern (&[u8]) appears in the Text.
    2. locate: Locate the start index in which the Pattern appears in the Text.
  • Four types of Text are supported.
    • NucleotideOnly: consists of {ACG*}
    • NucleotideWithNoise: consists of {ACGT*}
    • AminoacidOnly: consists of {ACDEFGHIKLMNPQRSTVW*}
    • AminoacidWithNoise: consists of {ACDEFGHIKLMNPQRSTVWY*}
  • The * of each type is treated as a wildcard that can be matched with any characters.
    • For example,
      • If the TextType is NucleotideOnly, LtFmIndex stores the text of ACGTXYZ as ACG****.
      • If the TextType is NucleotideWithNoise, LtFmIndex stores the same text (ACGTXYZ) as ACGT***
      • If the indexed text is ACGT***, the patterns of ACGTXXX, ACGT@@@, and ACGTX@# give the same result.
  • Using fastbwt feature can accelerate the indexing, but needs cmake to build libdivsufsort and cannot be built as WASM.

Examples

1. Use LtFmIndex to count and locate a pattern.

use lt_fm_index::LtFmIndexBuilder;

// (1) Define builder for lt-fm-index
let builder = LtFmIndexBuilder::new()
    .text_type_is_inferred()
    .set_suffix_array_sampling_ratio(2).unwrap()
    .set_lookup_table_kmer_size(4).unwrap();

// (2) Generate lt-fm-index with text
let text = b"CTCCGTACACCTGTTTCGTATCGGANNNN".to_vec();
let lt_fm_index = builder.build(text).unwrap(); // text is consumed

// (3) Match with pattern
let pattern = b"TA".to_vec();
//   - count
let count = lt_fm_index.count(&pattern);
assert_eq!(count, 2);
//   - locate
let locations = lt_fm_index.locate(&pattern);
assert_eq!(locations, vec![5,18]);

2. Save and load LtFmIndex

use lt_fm_index::{LtFmIndex, LtFmIndexBuilder};

// (1) Generate lt-fm-index
let text = b"CTCCGTACACCTGTTTCGTATCGGA".to_vec();
let lt_fm_index_to_save = LtFmIndexBuilder::new().build(text).unwrap();

// (2) Save lt-fm-index to buffer
let mut buffer = Vec::new();
lt_fm_index_to_save.save_to(&mut buffer).unwrap();

// (3) Load lt-fm-index from buffer
let lt_fm_index_loaded = LtFmIndex::load_from(&buffer[..]).unwrap();

assert_eq!(lt_fm_index_to_save, lt_fm_index_loaded);

Repository

https://github.com/baku4/lt-fm-index

Doc

https://docs.rs/lt-fm-index/

Reference

  • Ferragina, P., et al. (2004). An Alphabet-Friendly FM-Index, Springer Berlin Heidelberg: 150-160.
  • Anderson, T. and T. J. Wheeler (2021). An optimized FM-index library for nucleotide and amino acid search, Cold Spring Harbor Laboratory.
  • Wang, Y., X. Li, D. Zang, G. Tan and N. Sun (2018). Accelerating FM-index Search for Genomic Data Processing, ACM.
  • Yuta Mori. libdivsufsort