Expand description
This crate provides implementations of FM-Index and its variants.
FM-Index, originally proposed by Paolo Ferragina and Giovanni Manzini 1, is a compressed full-text index which supports the following queries:
count
: Given a pattern string, counts the number of its occurrences.locate
: Given a pattern string, lists all positions of its occurrences.extract
: Given an integer, gets the character of the text at that position.
fm-index
crate does not support the third query (extracting a character
from arbitrary position). Instead, it provides backward/forward iterators
that return the text characters starting from a search result.
§Implementations
This section describes the implementations of FM-Index and its variants.
Throughout this documentation, we use the following notations:
- n: the size of a text string
- σ: the number of bits needed to represent the characters in the text.
This can be controlled by
Text::max_character
.
§FM-Index (FMIndex
, FMIndexWithLocate
)
This is a standard FM-index suitable for single-text use cases. It offers a good balance between space efficiency and query performance, making it a strong default choice for most applications.
The implementation is based on Ferragina and Manzini’s original work 1. The suffix array is constructed using the SA-IS algorithm 2 in O(n) time.
The data structure consists of the following components:
- A wavelet matrix (
vers_vecs::WaveletMatrix
) that stores the Burrows–Wheeler Transform (BWT) of the text. Its space complexity is O(n log σ) bits. - A
Vec<usize>
of length O(σ), which stores the number of characters smaller than a given character in the text. - (Only in
FMIndexWithLocate
) A sampled suffix array of length O(n / 2^l), used to determine the positions of pattern occurrences.
§Run-Length FM-Index (RLFMIndex
, RLFMIndexWithLocate
)
This variant is optimized for highly repetitive texts. It offers better compression than the standard FM-index when the Burrows–Wheeler Transform contains long runs of repeated characters. However, it is generally slower in query performance.
The implementation is based on run-length encoded FM-indexing 3. The suffix array is constructed with the SA-IS algorithm 2 in O(n) time.
The data structure consists of the following components:
- A wavelet matrix (
vers_vecs::WaveletMatrix
) that stores the run heads of the BWT. The space complexity is O(r log σ) bits, where r is the number of runs in the BWT. - A succinct bit vector that encodes the run lengths of the BWT.
- A second succinct bit vector that encodes the run lengths of the BWT when sorted by the lexicographic order of run heads.
- A
Vec<usize>
of length O(σ), which stores the number of characters smaller than a given character in the run heads. - (Only in
RLFMIndexWithLocate
) A sampled suffix array of length O(n / 2^l), used to determine the positions of pattern occurrences.
§FM-Index for Multiple Texts (FMIndexMultiPieces
, FMIndexMultiPiecesWithLocate
)
This index is designed for multiple texts (text pieces) separated by a null character (\0
).
The implementation is based on SXSI 4.
Each text piece is assigned a unique ID (PieceId
).
The index supports locating the text piece ID for each search result.
It also supports searching for patterns that are prefixes or suffixes of
individual text pieces.
The data structure consists of the following components:
- A wavelet matrix (
vers_vecs::WaveletMatrix
) that stores the concatenated text pieces, separated by null characters (\0
). The space complexity is O(n log σ) bits. - A
Vec<usize>
of length O(σ), which stores the number of characters smaller than a given character. - A
Vec<usize>
of length O(d), where d is the number of text pieces. This array maps between the suffix array and the text piece IDs. - (Only in
FMIndexMultiPiecesWithLocate
) A sampled suffix array for the text pieces. Its length is O(n / 2^l), and it is used to determine the position of each pattern occurrence in the text.
Ferragina, P., & Manzini, G. (2000). Opportunistic data structures with applications. Proceedings 41st Annual Symposium on Foundations of Computer Science, 390–398. https://doi.org/10.1109/SFCS.2000.892127 ↩
Nong, G., Zhang, S., & Chan, W. H. (2010). Two efficient algorithms for linear time suffix array construction. IEEE transactions on computers, 60(10), 1471-1484. https://doi.org/10.1109/tc.2010.188 ↩
Mäkinen, V., & Navarro, G. (2005). Succinct suffix arrays based on run-length encoding. In Combinatorial Pattern Matching: 16th Annual Symposium, CPM 2005, Jeju Island, Korea, June 19-22, 2005. Proceedings 16 (pp. 45-56). Springer Berlin Heidelberg. https://doi.org/10.1007/11496656_5 ↩
Arroyuelo, A., Claude, F., Maneth, S., Mäkinen, V., Navarro, G., Nguyen, K., Siren, J., & Välimäki, N. (2011). Fast In-Memory XPath Search over Compressed Text and Tree Indexes (No. arXiv:0907.2089). arXiv. https://doi.org/10.48550/arXiv.0907.2089 ↩
Claude, F., & Navarro, G. (2012). The wavelet matrix. In International Symposium on String Processing and Information Retrieval (pp. 167-179). Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-34109-0_18 ↩
Structs§
- FMIndex
- FMIndex, count only.
- FMIndex
Match - Match in the text for FMIndex.
- FMIndex
Match With Locate - Match in the text for FMIndex with locate support.
- FMIndex
Multi Pieces - MultiText index, count only.
- FMIndex
Multi Pieces Match - Match in the text for MultiText index.
- FMIndex
Multi Pieces Match With Locate - Match in the text for MultiText index with locate support.
- FMIndex
Multi Pieces Search - Search result for MultiText index, count only.
- FMIndex
Multi Pieces Search With Locate - Search result for MultiText index with locate support.
- FMIndex
Multi Pieces With Locate - MultiText index with locate support.
- FMIndex
Search - Search result for FMIndex, count only.
- FMIndex
Search With Locate - Search result for FMIndex with locate support.
- FMIndex
With Locate - FMIndex with locate support.
- PieceId
- A unique id identifying a single piece (text fragment) in a text.
- RLFM
Index - RLFMIndex, count only.
- RLFM
Index Match - Match in the text for RLFMIndex.
- RLFM
Index Match With Locate - Match in the text for RLFMIndex with locate support.
- RLFM
Index Search - Search result for RLFMIndex, count only.
- RLFM
Index Search With Locate - Search result for RLFMIndex with locate support.
- RLFM
Index With Locate - RLFMIndex with locate support.
- Text
- A text structure used as the target for pattern searching in the index.
Enums§
- Error
- An error that can occur when constructing a search index.
Traits§
- Character
- A character is a type that can be used to store data and to compose a search pattern over this data.
- Match
- A match in the text.
- Match
With Locate - A match in the text that contains its location on the text.
- Match
With Piece Id - A match in the text that contains the ID of the piece where the pattern is found.
- Search
- The result of a search.
- Search
Index - Trait for searching in an index.
- Search
Index With Multi Pieces - Trait for searching in an index that supports multiple texts.