ib_matcher/regex/mod.rs
1/*!
2This module provides routines for searching strings for matches of a [regular
3expression] (aka "regex"). The regex syntax supported by this crate is similar
4to other regex engines, but it lacks several features that are not known how to
5implement efficiently. This includes, but is not limited to, look-around and
6backreferences. In exchange, all regex searches in this crate have worst case
7`O(m * n)` time complexity, where `m` is proportional to the size of the regex
8and `n` is proportional to the size of the string being searched.
9
10[regular expression]: https://en.wikipedia.org/wiki/Regular_expression
11
12If you just want API documentation, then skip to the [`cp::Regex`] or [`lita::Regex`] type. See also [choosing a matcher](crate#choosing-a-matcher).
13
14Most of the API is the same as [`regex-automata`](https://docs.rs/regex-automata/), the regex engine used by [`regex`](https://docs.rs/regex/).
15
16# Syntax
17Supported syntax:
18- Traditional regex (same as the `regex` crate)
19
20 See [`ib_matcher::syntax::regex`](crate::syntax::regex) for details.
21
22 The following examples all use this syntax.
23- glob: See [`ib_matcher::syntax::glob`](crate::syntax::glob).
24
25# Usage
26```sh
27$ cargo add ib_matcher --features regex
28```
29
30```
31use ib_matcher::regex::cp::Regex;
32
33fn main() {
34 let re = Regex::new(r"Hello (?<name>\w+)!").unwrap();
35 let mut caps = re.create_captures();
36 let hay = "Hello Murphy!";
37 let Ok(()) = re.captures(hay, &mut caps) else {
38 println!("no match!");
39 return;
40 };
41 println!("The name is: {}", &hay[caps.get_group_by_name("name").unwrap()]);
42}
43```
44
45# Examples
46
47This section provides a few examples, in tutorial style, showing how to
48search a haystack with a regex. There are more examples throughout the API
49documentation.
50
51Before starting though, it's worth defining a few terms:
52
53* A **regex** is a Rust value whose type is `Regex`. We use `re` as a
54variable name for a regex.
55* A **pattern** is the string that is used to build a regex. We use `pat` as
56a variable name for a pattern.
57* A **haystack** is the string that is searched by a regex. We use `hay` as a
58variable name for a haystack.
59
60Sometimes the words "regex" and "pattern" are used interchangeably.
61
62General use of regular expressions in this crate proceeds by compiling a
63**pattern** into a **regex**, and then using that regex to search, split or
64replace parts of a **haystack**.
65
66### Validating a particular date format
67
68This examples shows how to confirm whether a haystack, in its entirety, matches
69a particular date format:
70
71```rust
72use ib_matcher::regex::cp::Regex;
73
74let re = Regex::new(r"^\d{4}-\d{2}-\d{2}$").unwrap();
75assert!(re.is_match("2010-03-14"));
76```
77
78Notice the use of the `^` and `$` anchors. In this crate, every regex search is
79run with an implicit `(?s:.)*?` at the beginning of its pattern, which allows
80the regex to match anywhere in a haystack. Anchors, as above, can be used to
81ensure that the full haystack matches a pattern.
82
83This crate is also Unicode aware by default, which means that `\d` might match
84more than you might expect it to. For example:
85
86```rust
87use ib_matcher::regex::cp::Regex;
88
89let re = Regex::new(r"^\d{4}-\d{2}-\d{2}$").unwrap();
90assert!(re.is_match("𝟚𝟘𝟙𝟘-𝟘𝟛-𝟙𝟜"));
91```
92
93To only match an ASCII decimal digit, all of the following are equivalent:
94
95* `[0-9]`
96* `(?-u:\d)`
97* `[[:digit:]]`
98* `[\d&&\p{ascii}]`
99
100### Finding dates in a haystack
101
102In the previous example, we showed how one might validate that a haystack,
103in its entirety, corresponded to a particular date format. But what if we wanted
104to extract all things that look like dates in a specific format from a haystack?
105To do this, we can use an iterator API to find all matches (notice that we've
106removed the anchors and switched to looking for ASCII-only digits):
107
108```rust
109use ib_matcher::regex::cp::Regex;
110
111let re = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}").unwrap();
112let hay = "What do 1865-04-14, 1881-07-02, 1901-09-06 and 1963-11-22 have in common?";
113// 'm' is a 'Match', and 'span()' returns the matching part of the haystack.
114let dates: Vec<&str> = re.find_iter(hay).map(|m| &hay[m.span()]).collect();
115assert_eq!(dates, vec![
116 "1865-04-14",
117 "1881-07-02",
118 "1901-09-06",
119 "1963-11-22",
120]);
121```
122
123### Finding a middle initial
124
125We'll start off with a very simple example: a regex that looks for a specific
126name but uses a wildcard to match a middle initial. Our pattern serves as
127something like a template that will match a particular name with *any* middle
128initial.
129
130```rust
131use ib_matcher::regex::cp::Regex;
132
133// We use 'unwrap()' here because it would be a bug in our program if the
134// pattern failed to compile to a regex. Panicking in the presence of a bug
135// is okay.
136let re = Regex::new(r"Homer (.)\. Simpson").unwrap();
137let mut caps = re.create_captures();
138let hay = "Homer J. Simpson";
139let Ok(()) = re.captures(hay, &mut caps) else { return };
140assert_eq!("J", &hay[caps.get_group(1).unwrap()]);
141```
142
143There are a few things worth noticing here in our first example:
144
145* The `.` is a special pattern meta character that means "match any single
146character except for new lines." (More precisely, in this crate, it means
147"match any UTF-8 encoding of any Unicode scalar value other than `\n`.")
148* We can match an actual `.` literally by escaping it, i.e., `\.`.
149* We use Rust's [raw strings] to avoid needing to deal with escape sequences in
150both the regex pattern syntax and in Rust's string literal syntax. If we didn't
151use raw strings here, we would have had to use `\\.` to match a literal `.`
152character. That is, `r"\."` and `"\\."` are equivalent patterns.
153* We put our wildcard `.` instruction in parentheses. These parentheses have a
154special meaning that says, "make whatever part of the haystack matches within
155these parentheses available as a capturing group." After finding a match, we
156access this capture group with `caps.get_group(1)`.
157
158[raw strings]: https://doc.rust-lang.org/stable/reference/tokens.html#raw-string-literals
159
160Otherwise, we execute a search using `re.captures(hay)` and return from our
161function if no match occurred. We then reference the middle initial by asking
162for the part of the haystack that matched the capture group indexed at `1`.
163(The capture group at index 0 is implicit and always corresponds to the entire
164match. In this case, that's `Homer J. Simpson`.)
165
166### Named capture groups
167
168Continuing from our middle initial example above, we can tweak the pattern
169slightly to give a name to the group that matches the middle initial:
170
171```rust
172use ib_matcher::regex::cp::Regex;
173
174// Note that (?P<middle>.) is a different way to spell the same thing.
175let re = Regex::new(r"Homer (?<middle>.)\. Simpson").unwrap();
176let mut caps = re.create_captures();
177let hay = "Homer J. Simpson";
178let Ok(()) = re.captures(hay, &mut caps) else { return };
179assert_eq!("J", &hay[caps.get_group_by_name("middle").unwrap()]);
180```
181
182Giving a name to a group can be useful when there are multiple groups in
183a pattern. It makes the code referring to those groups a bit easier to
184understand.
185
186### Anchored search
187
188This example shows how to use [`Input::anchored`] to run an anchored
189search, even when the regex pattern itself isn't anchored. An anchored
190search guarantees that if a match is found, then the start offset of the
191match corresponds to the offset at which the search was started.
192
193```
194use ib_matcher::regex::{cp::Regex, Anchored, Input, Match};
195
196let re = Regex::new(r"\bfoo\b")?;
197let input = Input::new("xx foo xx").range(3..).anchored(Anchored::Yes);
198// The offsets are in terms of the original haystack.
199assert_eq!(Some(Match::must(0, 3..6)), re.find(input));
200
201// Notice that no match occurs here, because \b still takes the
202// surrounding context into account, even if it means looking back
203// before the start of your search.
204let hay = "xxfoo xx";
205let input = Input::new(hay).range(2..).anchored(Anchored::Yes);
206assert_eq!(None, re.find(input));
207// Indeed, you cannot achieve the above by simply slicing the
208// haystack itself, since the regex engine can't see the
209// surrounding context. This is why 'Input' permits setting
210// the bounds of a search!
211let input = Input::new(&hay[2..]).anchored(Anchored::Yes);
212// WRONG!
213assert_eq!(Some(Match::must(0, 0..3)), re.find(input));
214
215# Ok::<(), Box<dyn std::error::Error>>(())
216```
217
218### Earliest search
219
220This example shows how to use [`Input::earliest`] to run a search that
221might stop before finding the typical leftmost match.
222
223```ignore
224use ib_matcher::regex::{cp::Regex, Anchored, Input, Match};
225
226let re = Regex::new(r"[a-z]{3}|b")?;
227let input = Input::new("abc").earliest(true);
228assert_eq!(Some(Match::must(0, 1..2)), re.find(input));
229
230// Note that "earliest" isn't really a match semantic unto itself.
231// Instead, it is merely an instruction to whatever regex engine
232// gets used internally to quit as soon as it can. For example,
233// this regex uses a different search technique, and winds up
234// producing a different (but valid) match!
235let re = Regex::new(r"abc|b")?;
236let input = Input::new("abc").earliest(true);
237assert_eq!(Some(Match::must(0, 0..3)), re.find(input));
238
239# Ok::<(), Box<dyn std::error::Error>>(())
240```
241
242### Changing the line terminator
243
244This example shows how to enable multi-line mode by default and change
245the line terminator to the NUL byte:
246
247```
248use ib_matcher::regex::{cp::Regex, util::{syntax, look::LookMatcher}, Match};
249
250let mut lookm = LookMatcher::new();
251lookm.set_line_terminator(b'\x00');
252let re = Regex::builder()
253 .syntax(syntax::Config::new().multi_line(true))
254 .configure(Regex::config().look_matcher(lookm))
255 .build(r"^foo$")?;
256let hay = "\x00foo\x00";
257assert_eq!(Some(Match::must(0, 1..4)), re.find(hay));
258
259# Ok::<(), Box<dyn std::error::Error>>(())
260```
261
262### Multi-pattern searches with capture groups
263
264One of the more frustrating limitations of `RegexSet` in the `regex` crate
265(at the time of writing) is that it doesn't report match positions. With this
266crate, multi-pattern support was intentionally designed in from the beginning,
267which means it works in all regex engines and even for capture groups as well.
268
269This example shows how to search for matches of multiple regexes, where each
270regex uses the same capture group names to parse different key-value formats.
271
272```
273use ib_matcher::regex::{cp::Regex, PatternID};
274
275let re = Regex::builder().build_many(&[
276 r#"(?m)^(?<key>[[:word:]]+)=(?<val>[[:word:]]+)$"#,
277 r#"(?m)^(?<key>[[:word:]]+)="(?<val>[^"]+)"$"#,
278 r#"(?m)^(?<key>[[:word:]]+)='(?<val>[^']+)'$"#,
279 r#"(?m)^(?<key>[[:word:]]+):\s*(?<val>[[:word:]]+)$"#,
280])?;
281let hay = r#"
282best_album="Blow Your Face Out"
283best_quote='"then as it was, then again it will be"'
284best_year=1973
285best_simpsons_episode: HOMR
286"#;
287let mut kvs = vec![];
288for caps in re.captures_iter(hay) {
289 // N.B. One could use capture indices '1' and '2' here
290 // as well. Capture indices are local to each pattern.
291 // (Just like names are.)
292 let key = &hay[caps.get_group_by_name("key").unwrap()];
293 let val = &hay[caps.get_group_by_name("val").unwrap()];
294 kvs.push((key, val));
295}
296assert_eq!(kvs, vec![
297 ("best_album", "Blow Your Face Out"),
298 ("best_quote", "\"then as it was, then again it will be\""),
299 ("best_year", "1973"),
300 ("best_simpsons_episode", "HOMR"),
301]);
302
303# Ok::<(), Box<dyn std::error::Error>>(())
304```
305*/
306#[cfg(feature = "regex-cp")]
307pub mod cp;
308#[cfg(feature = "regex-lita")]
309pub mod lita;
310#[cfg(feature = "regex-nfa")]
311pub mod nfa;
312#[cfg(feature = "regex-lita")]
313pub use regex_automata::dfa;
314pub mod util;
315
316pub use regex_automata::{
317 Anchored, HalfMatch, Input, Match, MatchError, MatchErrorKind, MatchKind,
318 PatternID, Span,
319};
320#[cfg(feature = "alloc")]
321pub use regex_automata::{PatternSet, PatternSetInsertError, PatternSetIter};