Crate lt_fm_index

source ·
Expand description

LtFmIndex

CI crates.io

lt-fm-index is a library to (1) locate or (2) count the pattern in the large text of nucleotide and amino acid sequences.

Description

  • FmIndex is a data structure for exact pattern matching.
  • LtFmIndex is FmIndex using lookup table, the precalculated count of k-mer occurrences.
    • The lookup table can locate the first k-mer of pattern at once.

Features

  • LtFmIndex is built from Text (Vec<u8>).
  • LtFmIndex have two functions.
    1. count: Count the number of times the Pattern (&[u8]) appears in the Text.
    2. locate: Locate the start index in which the Pattern appears in the Text.
  • Four types of Text are supported.
    • NucleotideOnly: consists of {ACG*}
    • NucleotideWithNoise: consists of {ACGT*}
    • AminoacidOnly: consists of {ACDEFGHIKLMNPQRSTVW*}
    • AminoacidWithNoise: consists of {ACDEFGHIKLMNPQRSTVWY*}
  • The * of each type is treated as a wildcard that can be matched with any characters.
    • For example,
      • If the TextType is NucleotideOnly, LtFmIndex stores the text of ACGTXYZ as ACG****.
      • If the TextType is NucleotideWithNoise, LtFmIndex stores the same text (ACGTXYZ) as ACGT***
      • If the indexed text is ACGT***, the patterns of ACGTXXX, ACGT@@@, and ACGTX@# give the same result.
  • Using fastbwt feature can accelerate the indexing, but needs cmake to build libdivsufsort and cannot be built as WASM.

Examples

1. Use LtFmIndex to count and locate a pattern.

use lt_fm_index::LtFmIndexBuilder;

// (1) Define builder for lt-fm-index
let builder = LtFmIndexBuilder::new()
    .text_type_is_inferred()
    .set_suffix_array_sampling_ratio(2).unwrap()
    .set_lookup_table_kmer_size(4).unwrap();

// (2) Generate lt-fm-index with text
let text = b"CTCCGTACACCTGTTTCGTATCGGANNNN".to_vec();
let lt_fm_index = builder.build(text).unwrap(); // text is consumed

// (3) Match with pattern
let pattern = b"TA".to_vec();
//   - count
let count = lt_fm_index.count(&pattern);
assert_eq!(count, 2);
//   - locate
let locations = lt_fm_index.locate(&pattern);
assert_eq!(locations, vec![5,18]);

2. Save and load LtFmIndex

use lt_fm_index::{LtFmIndex, LtFmIndexBuilder};

// (1) Generate lt-fm-index
let text = b"CTCCGTACACCTGTTTCGTATCGGA".to_vec();
let lt_fm_index_to_save = LtFmIndexBuilder::new().build(text).unwrap();

// (2) Save lt-fm-index to buffer
let mut buffer = Vec::new();
lt_fm_index_to_save.save_to(&mut buffer).unwrap();

// (3) Load lt-fm-index from buffer
let lt_fm_index_loaded = LtFmIndex::load_from(&buffer[..]).unwrap();

assert_eq!(lt_fm_index_to_save, lt_fm_index_loaded);

Repository

https://github.com/baku4/lt-fm-index

Doc

https://docs.rs/lt-fm-index/

Reference

  • Ferragina, P., et al. (2004). An Alphabet-Friendly FM-Index, Springer Berlin Heidelberg: 150-160.
  • Anderson, T. and T. J. Wheeler (2021). An optimized FM-index library for nucleotide and amino acid search, Cold Spring Harbor Laboratory.
  • Wang, Y., X. Li, D. Zang, G. Tan and N. Sun (2018). Accelerating FM-index Search for Genomic Data Processing, ACM.
  • Yuta Mori. libdivsufsort

Modules

Errors

Structs

The index to (1) count or (2) locate the pattern in the indexed text.
The safe and concise builder for LtFmIndex

Enums

Bwt block size marker Using the larger block size makes the index small but slightly slow.
Text type marker