1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright the Vortex contributors
//! Flat `u8` transition table DFA for contains matching (`LIKE '%needle%'`).
//!
//! Uses an escape-sentinel strategy: the FSST escape code maps to a sentinel
//! state, and the next literal byte is looked up in a separate byte-level
//! transition table.
//! This is to support needles up to u8::MAX long.
//!
//! ## Construction (needle = `"aba"`, symbols = `[0:"ab", 1:"ba"]`)
//!
//! ### Step 1: KMP (Knuth–Morris–Pratt) byte-level transition table
//!
//! See: <https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm>
//!
//! Build a `(state × byte) → state` table using the KMP failure function.
//! States 0..2 track match progress, state 3 is accept (sticky).
//!
//! ```text
//! Input byte
//! State 'a' 'b' other
//! ───── ──── ──── ─────
//! 0 1 0 0 ← want 'a'
//! 1 1 2 0 ← matched "a", want 'b' (KMP: 'a'→stay at 1)
//! 2 3✓ 0 0 ← matched "ab", want 'a'
//! 3✓ 3✓ 3✓ 3✓ ← accept (sticky)
//! ```
//!
//! ### Step 2: Symbol-level transitions
//!
//! For each `(state, symbol)` pair, simulate feeding the symbol's bytes
//! through the byte table:
//!
//! ```text
//! Symbol 0 = "ab" (2 bytes):
//! state 0 + 'a' → 1, + 'b' → 2 ⟹ sym_trans[0][0] = 2
//! state 1 + 'a' → 1, + 'b' → 2 ⟹ sym_trans[1][0] = 2
//! state 2 + 'a' → 3✓ ⟹ sym_trans[2][0] = 3✓ (accept)
//!
//! Symbol 1 = "ba" (2 bytes):
//! state 0 + 'b' → 0, + 'a' → 1 ⟹ sym_trans[0][1] = 1
//! state 1 + 'b' → 2, + 'a' → 3✓ ⟹ sym_trans[1][1] = 3✓ (accept)
//! state 2 + 'b' → 0, + 'a' → 1 ⟹ sym_trans[2][1] = 1
//! ```
//!
//! ### Step 3: Fused 256-wide table with escape sentinel
//!
//! Merge symbol transitions into a 256-wide table. Code bytes 0–1 use symbol
//! transitions, code 255 (ESCAPE_CODE) maps to the sentinel (4), and
//! unused code bytes default to 0:
//!
//! ```text
//! Code byte
//! State 0("ab") 1("ba") 2..254 255(ESC)
//! ───── ─────── ─────── ────── ────────
//! 0 2 1 0 4(S)
//! 1 2 3✓ 0 4(S)
//! 2 3✓ 1 0 4(S)
//! 3✓ 3✓ 3✓ 3✓ 3✓
//! ```
//!
//! When the scanner sees sentinel (4), it reads the next byte and looks it
//! up in the byte-level escape table (from step 1).
//!
//! TODO(joe): for short needles (≤7 bytes), a branchless escape-folded DFA
//! with hierarchical 4-byte composition is ~2x faster. For needles ≤127 bytes,
//! an escape-folded flat DFA (2N+1 states) avoids the sentinel branch.
//! See commit 7faf9f36f for those implementations.
use Symbol;
use VortexExpect;
use VortexResult;
use vortex_bail;
use build_fused_table;
use build_symbol_transitions;
use kmp_byte_transitions;
/// Flat `u8` transition table DFA for contains matching.
///
/// The escape code maps to a sentinel state; the next literal byte is looked
/// up in a separate byte-level escape table.
pub