Skip to main content

Module fm_index

Module fm_index 

Source
Expand description

Burrows–Wheeler transform and FM-index with backward search.

References:

  • Michael Burrows & David J. Wheeler, “A Block-sorting Lossless Data Compression Algorithm”, Digital SRC Research Report 124, 1994 — the BWT and its LF-mapping inverse.
  • Paolo Ferragina & Giovanni Manzini, “Opportunistic Data Structures with Applications”, FOCS 2000, pp. 390–398 — the FM-index (C array + Occ rank) and backward search for count/locate.

§Construction

A unique sentinel $ — modelled here as a rank smaller than every real byte — is appended to the text. The BWT is then read off the suffix array (built by crate::string::SuffixArray, reused verbatim): for each suffix position sa[i], the BWT character is the byte preceding that suffix cyclically, T[sa[i] − 1], with the position 0 mapping to the sentinel.

T = "banana", T$ = "banana$"
sorted rotations →  BWT = "annb$aa"   (last column)

§FM-index

Two precomputed tables turn the BWT into a searchable index:

  • C[c] — the number of characters in T$ that are strictly smaller than c; equivalently the index of the first row beginning with c.
  • Occ(c, i) — the number of occurrences of c in BWT[0..i] (a rank query). Stored here as a full |Σ| × (n+1) prefix-sum table for O(1) rank, which is O(n |Σ|) space — appropriate for the byte alphabet and exact, deterministic queries.

The LF-mapping LF(i) = C[BWT[i]] + Occ(BWT[i], i) sends row i to the row obtained by rotating its first column to the last; iterating it from the sentinel row reconstructs T right-to-left, which is exactly the inverse BWT.

Matching a pattern p proceeds from its last character to its first, maintaining the half-open SA range [lo, hi) of rows whose suffix is prefixed by the processed pattern tail:

lo ← C[c] + Occ(c, lo)
hi ← C[c] + Occ(c, hi)

When the range becomes empty the pattern is absent. FmIndex::count returns hi − lo; FmIndex::locate maps each row in the final range back to a text position through the stored suffix array.

Inputs are raw bytes (&[u8]).

Structs§

FmIndex
An FM-index over a byte string: BWT, C array, and Occ rank table, with the suffix array retained for locate.