Expand description
Burrows–Wheeler transform and FM-index with backward search.
References:
- Michael Burrows & David J. Wheeler, “A Block-sorting Lossless Data Compression Algorithm”, Digital SRC Research Report 124, 1994 — the BWT and its LF-mapping inverse.
- Paolo Ferragina & Giovanni Manzini, “Opportunistic Data Structures with
Applications”, FOCS 2000, pp. 390–398 — the FM-index (
Carray +Occrank) and backward search forcount/locate.
§Construction
A unique sentinel $ — modelled here as a rank smaller than every real byte
— is appended to the text. The BWT is then read off the suffix array (built
by crate::string::SuffixArray, reused verbatim): for each suffix position
sa[i], the BWT character is the byte preceding that suffix cyclically,
T[sa[i] − 1], with the position 0 mapping to the sentinel.
T = "banana", T$ = "banana$"
sorted rotations → BWT = "annb$aa" (last column)§FM-index
Two precomputed tables turn the BWT into a searchable index:
C[c]— the number of characters inT$that are strictly smaller thanc; equivalently the index of the first row beginning withc.Occ(c, i)— the number of occurrences ofcinBWT[0..i](a rank query). Stored here as a full|Σ| × (n+1)prefix-sum table forO(1)rank, which isO(n |Σ|)space — appropriate for the byte alphabet and exact, deterministic queries.
The LF-mapping LF(i) = C[BWT[i]] + Occ(BWT[i], i) sends row i to the
row obtained by rotating its first column to the last; iterating it from the
sentinel row reconstructs T right-to-left, which is exactly the inverse
BWT.
§Backward search
Matching a pattern p proceeds from its last character to its first,
maintaining the half-open SA range [lo, hi) of rows whose suffix is
prefixed by the processed pattern tail:
lo ← C[c] + Occ(c, lo)
hi ← C[c] + Occ(c, hi)When the range becomes empty the pattern is absent. FmIndex::count
returns hi − lo; FmIndex::locate maps each row in the final range back
to a text position through the stored suffix array.
Inputs are raw bytes (&[u8]).
Structs§
- FmIndex
- An FM-index over a byte string: BWT,
Carray, andOccrank table, with the suffix array retained forlocate.