Module z_algorithm

Expand description

Gusfield’s Z-algorithm: the Z-array in linear time, with exact matching.

Reference: Dan Gusfield, “Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology”, Cambridge University Press, 1997, §1.3–1.5. The Z-array (sometimes “Z-function”) and its use for linear-time exact pattern matching via the P $ T construction are due to that exposition; the underlying fundamental preprocessing predates it.

§The Z-array

For a string s of length n, the Z-array z is defined by

z[0] = 0                                    (by convention; see below)
z[i] = length of the longest substring starting at i
       that is also a prefix of s,          for 1 ≤ i < n.

Equivalently z[i] = max { k : s[0..k] == s[i..i+k] }. So if s = "aabxaabxcaabxaabxay" then z[4] = 4 because s[4..8] == "aabx" == s[0..4] but s[8] = 'c' != 'a' = s[4].

§The `z[0]` convention

Position 0 matches the whole prefix trivially (s[0..n] == s[0..n]), so a literal reading would give z[0] = n. That value carries no information and complicates the matching application, so — following Gusfield and the common competitive-programming convention — this implementation sets z[0] = 0. Callers that need the “real” value can use n directly. This choice is fixed and tested (z0_convention_is_zero).

§Linear time via the Z-box

The algorithm maintains the interval [l, r) (the Z-box) that is the right-most match of a prefix discovered so far: s[l..r] == s[0..r-l]. For a new index i:

if i < r, the box already tells us that s[i..r] == s[i-l..r-l], so z[i] ≥ min(z[i-l], r-i) for free; only the part beyond r needs explicit comparisons;
otherwise (i ≥ r) we compare from scratch starting at i.

Every explicit character comparison either fails (ending the current i) or advances r, and r only ever increases, so the total comparison count is O(n).

§Exact pattern matching

To find every occurrence of a pattern p (length m) in a text t (length n), form s = p ++ [sep] ++ t where sep is a byte occurring in neither p nor t, and run the Z-algorithm. An occurrence of p ends-aligns with each index i (in s) for which z[i] ≥ m; the corresponding text offset is i − (m + 1). Because no Z-value may reach across the separator (it equals nothing in p), z[i] is capped at m, so there are no false positives — a property exercised directly by tests.

When no separator byte is available (p and t between them use all 256 byte values) the construction would be unsafe, so z_search instead falls back to scanning a prefix-length array computed on the concatenation without a separator but bounded explicitly at m; this keeps the routine total and panic-free. See z_search.

Inputs are raw bytes (&[u8]).

Functions§

z_array: Compute the Z-array of s in O(n) time using the Z-box method.
z_search: Find every occurrence of pattern in text using the Z-algorithm on the pattern ++ sep ++ text construction, returning start offsets in ascending order.