Expand description
Gusfield’s Z-algorithm: the Z-array in linear time, with exact matching.
Reference: Dan Gusfield, “Algorithms on Strings, Trees, and Sequences:
Computer Science and Computational Biology”, Cambridge University Press,
1997, §1.3–1.5. The Z-array (sometimes “Z-function”) and its use for
linear-time exact pattern matching via the P $ T construction are due to
that exposition; the underlying fundamental preprocessing predates it.
§The Z-array
For a string s of length n, the Z-array z is defined by
z[0] = 0 (by convention; see below)
z[i] = length of the longest substring starting at i
that is also a prefix of s, for 1 ≤ i < n.Equivalently z[i] = max { k : s[0..k] == s[i..i+k] }. So if
s = "aabxaabxcaabxaabxay" then z[4] = 4 because s[4..8] == "aabx" == s[0..4] but s[8] = 'c' != 'a' = s[4].
§The z[0] convention
Position 0 matches the whole prefix trivially (s[0..n] == s[0..n]), so a
literal reading would give z[0] = n. That value carries no information and
complicates the matching application, so — following Gusfield and the common
competitive-programming convention — this implementation sets z[0] = 0.
Callers that need the “real” value can use n directly. This choice is fixed
and tested (z0_convention_is_zero).
§Linear time via the Z-box
The algorithm maintains the interval [l, r) (the Z-box) that is the
right-most match of a prefix discovered so far: s[l..r] == s[0..r-l]. For a
new index i:
- if
i < r, the box already tells us thats[i..r] == s[i-l..r-l], soz[i] ≥ min(z[i-l], r-i)for free; only the part beyondrneeds explicit comparisons; - otherwise (
i ≥ r) we compare from scratch starting ati.
Every explicit character comparison either fails (ending the current i) or
advances r, and r only ever increases, so the total comparison count is
O(n).
§Exact pattern matching
To find every occurrence of a pattern p (length m) in a text t
(length n), form s = p ++ [sep] ++ t where sep is a byte occurring in
neither p nor t, and run the Z-algorithm. An occurrence of p ends-aligns
with each index i (in s) for which z[i] ≥ m; the corresponding text
offset is i − (m + 1). Because no Z-value may reach across the separator
(it equals nothing in p), z[i] is capped at m, so there are no false
positives — a property exercised directly by tests.
When no separator byte is available (p and t between them use all 256
byte values) the construction would be unsafe, so z_search instead falls
back to scanning a prefix-length array computed on the concatenation
without a separator but bounded explicitly at m; this keeps the routine
total and panic-free. See z_search.
Inputs are raw bytes (&[u8]).