1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
/// A state identifier specifically tailored for lazy DFAs.
///
/// A lazy state ID logically represents a pointer to a DFA state. In practice,
/// by limiting the number of DFA states it can address, it reserves some
/// bits of its representation to encode some additional information. That
/// additional information is called a "tag." That tag is used to record
/// whether the state it points to is an unknown, dead, quit, start or match
/// state.
///
/// When implementing a low level search routine with a lazy DFA, it is
/// necessary to query the type of the current state to know what to do:
///
/// * **Unknown** - The state has not yet been computed. The
/// parameters used to get this state ID must be re-passed to
/// [`DFA::next_state`](crate::hybrid::dfa::DFA::next_state), which will never
/// return an unknown state ID.
/// * **Dead** - A dead state only has transitions to itself. It indicates that
/// the search cannot do anything else and should stop with whatever result it
/// has.
/// * **Quit** - A quit state indicates that the automaton could not answer
/// whether a match exists or not. Correct search implementations must return a
/// [`MatchError::quit`](crate::MatchError::quit) when a DFA enters a quit
/// state.
/// * **Start** - A start state is a state in which a search can begin.
/// Lazy DFAs usually have more than one start state. Branching on
/// this isn't required for correctness, but a common optimization is
/// to run a prefilter when a search enters a start state. Note that
/// start states are *not* tagged automatically, and one must enable the
/// [`Config::specialize_start_states`](crate::hybrid::dfa::Config::specialize_start_states)
/// setting for start states to be tagged. The reason for this is
/// that a DFA search loop is usually written to execute a prefilter once it
/// enters a start state. But if there is no prefilter, this handling can be
/// quite disastrous as the DFA may ping-pong between the special handling code
/// and a possible optimized hot path for handling untagged states. When start
/// states aren't specialized, then they are untagged and remain in the hot
/// path.
/// * **Match** - A match state indicates that a match has been found.
/// Depending on the semantics of your search implementation, it may either
/// continue until the end of the haystack or a dead state, or it might quit
/// and return the match immediately.
///
/// As an optimization, the [`is_tagged`](LazyStateID::is_tagged) predicate
/// can be used to determine if a tag exists at all. This is useful to avoid
/// branching on all of the above types for every byte searched.
///
/// # Example
///
/// This example shows how `LazyStateID` can be used to implement a correct
/// search routine with minimal branching. In particular, this search routine
/// implements "leftmost" matching, which means that it doesn't immediately
/// stop once a match is found. Instead, it continues until it reaches a dead
/// state.
///
/// Notice also how a correct search implementation deals with
/// [`CacheError`](crate::hybrid::CacheError)s returned by some of
/// the lazy DFA routines. When a `CacheError` occurs, it returns
/// [`MatchError::gave_up`](crate::MatchError::gave_up).
///
/// ```
/// use regex_automata::{
/// hybrid::dfa::{Cache, DFA},
/// HalfMatch, MatchError, Input,
/// };
///
/// fn find_leftmost_first(
/// dfa: &DFA,
/// cache: &mut Cache,
/// haystack: &[u8],
/// ) -> Result<Option<HalfMatch>, MatchError> {
/// // The start state is determined by inspecting the position and the
/// // initial bytes of the haystack. Note that start states can never
/// // be match states (since DFAs in this crate delay matches by 1
/// // byte), so we don't need to check if the start state is a match.
/// let mut sid = dfa.start_state_forward(
/// cache,
/// &Input::new(haystack),
/// )?;
/// let mut last_match = None;
/// // Walk all the bytes in the haystack. We can quit early if we see
/// // a dead or a quit state. The former means the automaton will
/// // never transition to any other state. The latter means that the
/// // automaton entered a condition in which its search failed.
/// for (i, &b) in haystack.iter().enumerate() {
/// sid = dfa
/// .next_state(cache, sid, b)
/// .map_err(|_| MatchError::gave_up(i))?;
/// if sid.is_tagged() {
/// if sid.is_match() {
/// last_match = Some(HalfMatch::new(
/// dfa.match_pattern(cache, sid, 0),
/// i,
/// ));
/// } else if sid.is_dead() {
/// return Ok(last_match);
/// } else if sid.is_quit() {
/// // It is possible to enter into a quit state after
/// // observing a match has occurred. In that case, we
/// // should return the match instead of an error.
/// if last_match.is_some() {
/// return Ok(last_match);
/// }
/// return Err(MatchError::quit(b, i));
/// }
/// // Implementors may also want to check for start states and
/// // handle them differently for performance reasons. But it is
/// // not necessary for correctness. Note that in order to check
/// // for start states, you'll need to enable the
/// // 'specialize_start_states' config knob, otherwise start
/// // states will not be tagged.
/// }
/// }
/// // Matches are always delayed by 1 byte, so we must explicitly walk
/// // the special "EOI" transition at the end of the search.
/// sid = dfa
/// .next_eoi_state(cache, sid)
/// .map_err(|_| MatchError::gave_up(haystack.len()))?;
/// if sid.is_match() {
/// last_match = Some(HalfMatch::new(
/// dfa.match_pattern(cache, sid, 0),
/// haystack.len(),
/// ));
/// }
/// Ok(last_match)
/// }
///
/// // We use a greedy '+' operator to show how the search doesn't just stop
/// // once a match is detected. It continues extending the match. Using
/// // '[a-z]+?' would also work as expected and stop the search early.
/// // Greediness is built into the automaton.
/// let dfa = DFA::new(r"[a-z]+")?;
/// let mut cache = dfa.create_cache();
/// let haystack = "123 foobar 4567".as_bytes();
/// let mat = find_leftmost_first(&dfa, &mut cache, haystack)?.unwrap();
/// assert_eq!(mat.pattern().as_usize(), 0);
/// assert_eq!(mat.offset(), 10);
///
/// // Here's another example that tests our handling of the special
/// // EOI transition. This will fail to find a match if we don't call
/// // 'next_eoi_state' at the end of the search since the match isn't found
/// // until the final byte in the haystack.
/// let dfa = DFA::new(r"[0-9]{4}")?;
/// let mut cache = dfa.create_cache();
/// let haystack = "123 foobar 4567".as_bytes();
/// let mat = find_leftmost_first(&dfa, &mut cache, haystack)?.unwrap();
/// assert_eq!(mat.pattern().as_usize(), 0);
/// assert_eq!(mat.offset(), 15);
///
/// // And note that our search implementation above automatically works
/// // with multi-DFAs. Namely, `dfa.match_pattern(match_state, 0)` selects
/// // the appropriate pattern ID for us.
/// let dfa = DFA::new_many(&[r"[a-z]+", r"[0-9]+"])?;
/// let mut cache = dfa.create_cache();
/// let haystack = "123 foobar 4567".as_bytes();
/// let mat = find_leftmost_first(&dfa, &mut cache, haystack)?.unwrap();
/// assert_eq!(mat.pattern().as_usize(), 1);
/// assert_eq!(mat.offset(), 3);
/// let mat = find_leftmost_first(&dfa, &mut cache, &haystack[3..])?.unwrap();
/// assert_eq!(mat.pattern().as_usize(), 0);
/// assert_eq!(mat.offset(), 7);
/// let mat = find_leftmost_first(&dfa, &mut cache, &haystack[10..])?.unwrap();
/// assert_eq!(mat.pattern().as_usize(), 1);
/// assert_eq!(mat.offset(), 5);
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
;
/// This error occurs when a lazy state ID could not be constructed.
///
/// This occurs when given an integer exceeding the maximum lazy state ID
/// value.
///
/// When the `std` feature is enabled, this implements the `Error` trait.
pub