fm_index/
lib.rs

1//! This crate provides implementations of *FM-Index* and its variants.
2//!
3//! FM-Index, originally proposed by Paolo Ferragina and Giovanni Manzini [^1],
4//! is a compressed full-text index which supports the following queries:
5//!
6//! - `count`: Given a pattern string, counts the number of its occurrences.
7//! - `locate`: Given a pattern string, lists all positions of its occurrences.
8//! - `extract`: Given an integer, gets the character of the text at that
9//!   position.
10//!
11//! `fm-index` crate does not support the third query (extracting a character
12//! from arbitrary position). Instead, it provides backward/forward iterators
13//! that return the text characters starting from a search result.
14//!
15//! # Implementations
16//!
17//! This section describes the implementations of FM-Index and its variants.
18//!
19//! Throughout this documentation, we use the following notations:
20//!
21//! - _n_: the size of a text string
22//! - _σ_: the number of bits needed to represent the characters in the text.
23//!   This can be controlled by [`Text::max_character`].
24//!
25//! ## FM-Index ([`FMIndex`], [`FMIndexWithLocate`])
26//!
27//! This is a standard FM-index suitable for single-text use cases. It offers
28//! a good balance between space efficiency and query performance, making it
29//! a strong default choice for most applications.
30//!
31//! The implementation is based on Ferragina and Manzini's original work [^1].
32//! The suffix array is constructed using the SA-IS algorithm [^3] in _O(n)_ time.
33//!
34//! The data structure consists of the following components:
35//!
36//! - A wavelet matrix ([`vers_vecs::WaveletMatrix`]) that stores the Burrows–Wheeler
37//!   Transform (BWT) of the text. Its space complexity is _O(n log σ)_ bits.
38//! - A `Vec<usize>` of length _O(σ)_, which stores the number of characters
39//!   smaller than a given character in the text.
40//! - (Only in [`FMIndexWithLocate`]) A sampled suffix array of length _O(n / 2^l)_,
41//!   used to determine the positions of pattern occurrences.
42//!
43//! ## Run-Length FM-Index ([`RLFMIndex`], [`RLFMIndexWithLocate`])
44//!
45//! This variant is optimized for highly repetitive texts. It offers better compression
46//! than the standard FM-index when the Burrows–Wheeler Transform contains long runs
47//! of repeated characters. However, it is generally slower in query performance.
48//!
49//! The implementation is based on run-length encoded FM-indexing [^2]. The suffix
50//! array is constructed with the SA-IS algorithm [^3] in _O(n)_ time.
51//!
52//! The data structure consists of the following components:
53//!
54//! - A wavelet matrix ([`vers_vecs::WaveletMatrix`]) that stores the run heads
55//!   of the BWT. The space complexity is _O(r log σ)_ bits, where _r_ is the number
56//!   of runs in the BWT.
57//! - A succinct bit vector that encodes the run lengths of the BWT.
58//! - A second succinct bit vector that encodes the run lengths of the BWT
59//!   when sorted by the lexicographic order of run heads.
60//! - A `Vec<usize>` of length _O(σ)_, which stores the number of characters
61//!   smaller than a given character in the run heads.
62//! - (Only in [`RLFMIndexWithLocate`]) A sampled suffix array of length _O(n / 2^l)_,
63//!   used to determine the positions of pattern occurrences.
64//!
65//! ## FM-Index for Multiple Texts ([`FMIndexMultiPieces`], [`FMIndexMultiPiecesWithLocate`])
66//!
67//! This index is designed for multiple texts (text pieces) separated by a null character (`\0`).
68//! The implementation is based on SXSI [^5].
69//!
70//! Each text piece is assigned a unique ID ([`PieceId`]).
71//! The index supports locating the text piece ID for each search result.
72//! It also supports searching for patterns that are prefixes or suffixes of
73//! individual text pieces.
74//!
75//! The data structure consists of the following components:
76//!
77//! - A wavelet matrix ([`vers_vecs::WaveletMatrix`]) that stores the concatenated
78//!   text pieces, separated by null characters (`\0`). The space complexity is
79//!   _O(n log σ)_ bits.
80//! - A `Vec<usize>` of length _O(σ)_, which stores the number of characters
81//!   smaller than a given character.
82//! - A `Vec<usize>` of length _O(d)_, where _d_ is the number of text
83//!   pieces. This array maps between the suffix array and the text piece IDs.
84//! - (Only in [`FMIndexMultiPiecesWithLocate`]) A sampled suffix array for the text pieces.
85//!   Its length is _O(n / 2^l)_, and it is used to determine the position
86//!   of each pattern occurrence in the text.
87//!
88//! [^1]: Ferragina, P., & Manzini, G. (2000). Opportunistic data structures
89//!     with applications. Proceedings 41st Annual Symposium on Foundations
90//!     of Computer Science, 390–398. <https://doi.org/10.1109/SFCS.2000.892127>
91//!
92//! [^2]: Mäkinen, V., & Navarro, G. (2005). Succinct suffix arrays based on
93//!     run-length encoding. In Combinatorial Pattern Matching: 16th Annual Symposium,
94//!     CPM 2005, Jeju Island, Korea, June 19-22, 2005. Proceedings 16 (pp. 45-56).
95//!     Springer Berlin Heidelberg. <https://doi.org/10.1007/11496656_5>
96//!
97//! [^3]: Nong, G., Zhang, S., & Chan, W. H. (2010). Two efficient algorithms
98//!     for linear time suffix array construction. IEEE transactions on computers,
99//!     60(10), 1471-1484. <https://doi.org/10.1109/tc.2010.188>
100//!
101//! [^4]: Claude F., Navarro G. (2012). The Wavelet Matrix. In:
102//!     Calderón-Benavides L., González-Caro C., Chávez E., Ziviani N. (eds)
103//!     String Processing and Information Retrieval. SPIRE 2012.
104//!
105//! [^4]: Claude, F., & Navarro, G. (2012). The wavelet matrix.
106//!     In International Symposium on String Processing and Information Retrieval (pp. 167-179).
107//!     Berlin, Heidelberg: Springer Berlin Heidelberg.
108//!     <https://doi.org/10.1007/978-3-642-34109-0_18>
109//!
110//! [^5]: Arroyuelo, A., Claude, F., Maneth, S., Mäkinen, V., Navarro, G., Nguyen,
111//!     K., Siren, J., & Välimäki, N. (2011). Fast In-Memory XPath Search over
112//!     Compressed Text and Tree Indexes (No. arXiv:0907.2089).
113//!     arXiv. <https://doi.org/10.48550/arXiv.0907.2089>
114
115#![allow(clippy::len_without_is_empty)]
116#![warn(missing_docs)]
117
118mod backend;
119mod character;
120mod error;
121mod fm_index;
122mod frontend;
123mod heap_size;
124mod multi_pieces;
125mod piece;
126mod rlfmi;
127mod suffix_array;
128#[cfg(test)]
129mod testutil;
130mod text;
131mod util;
132mod wrapper;
133
134pub use character::Character;
135pub use error::Error;
136pub use frontend::{
137    FMIndex, FMIndexMatch, FMIndexMatchWithLocate, FMIndexMultiPieces, FMIndexMultiPiecesMatch,
138    FMIndexMultiPiecesMatchWithLocate, FMIndexMultiPiecesSearch,
139    FMIndexMultiPiecesSearchWithLocate, FMIndexMultiPiecesWithLocate, FMIndexSearch,
140    FMIndexSearchWithLocate, FMIndexWithLocate, Match, MatchWithLocate, MatchWithPieceId,
141    RLFMIndex, RLFMIndexMatch, RLFMIndexMatchWithLocate, RLFMIndexSearch,
142    RLFMIndexSearchWithLocate, RLFMIndexWithLocate, Search, SearchIndex,
143    SearchIndexWithMultiPieces,
144};
145pub use piece::PieceId;
146pub use text::Text;
147
148#[doc = include_str!("../README.md")]
149#[cfg(doctest)]
150pub struct ReadmeDoctests;