ib_matcher/syntax/regex/
mod.rs

1/*!
2The syntax supported in this module is documented below. (Same as the [`regex`](https://docs.rs/regex/) crate.)
3
4See [`ib_matcher::regex`](crate::regex) for regex engines.
5
6Note that the regular expression parser and abstract syntax are exposed in
7a separate crate, [`regex-syntax`](https://docs.rs/regex-syntax).
8
9### Matching one character
10
11<pre class="rust">
12.             any character except new line (includes new line with s flag)
13[0-9]         any ASCII digit
14\d            digit (\p{Nd})
15\D            not digit
16\pX           Unicode character class identified by a one-letter name
17\p{Greek}     Unicode character class (general category or script)
18\PX           Negated Unicode character class identified by a one-letter name
19\P{Greek}     negated Unicode character class (general category or script)
20</pre>
21
22### Character classes
23
24<pre class="rust">
25[xyz]         A character class matching either x, y or z (union).
26[^xyz]        A character class matching any character except x, y and z.
27[a-z]         A character class matching any character in range a-z.
28[[:alpha:]]   ASCII character class ([A-Za-z])
29[[:^alpha:]]  Negated ASCII character class ([^A-Za-z])
30[x[^xyz]]     Nested/grouping character class (matching any character except y and z)
31[a-y&&xyz]    Intersection (matching x or y)
32[0-9&&[^4]]   Subtraction using intersection and negation (matching 0-9 except 4)
33[0-9--4]      Direct subtraction (matching 0-9 except 4)
34[a-g~~b-h]    Symmetric difference (matching `a` and `h` only)
35[\[\]]        Escaping in character classes (matching [ or ])
36[a&&b]        An empty character class matching nothing
37</pre>
38
39Any named character class may appear inside a bracketed `[...]` character
40class. For example, `[\p{Greek}[:digit:]]` matches any ASCII digit or any
41codepoint in the `Greek` script. `[\p{Greek}&&\pL]` matches Greek letters.
42
43<div class="warning">
44
45Escaping:
46- `\` can escape the following metacharacter (but cannot escape a normal character).
47- `[]]` is valid and matches `]`, but `[[]` is invalid and will cause `unclosed character class` error (because classes are allowed to nest).
48- `[-]`, `[a-]` and `[-a]` are valid and can match `-`.
49- `[a^]` is valid and can match `^`, but `[^]` is not.
50- All other metacharacters are matched literally in `[]`, including `.`, `*`, `|` and `()`.
51</div>
52
53Precedence in character classes, from most binding to least:
54
551. Ranges: `[a-cd]` == `[[a-c]d]`
562. Union: `[ab&&bc]` == `[[ab]&&[bc]]`
573. Intersection, difference, symmetric difference. All three have equivalent
58precedence, and are evaluated in left-to-right order. For example,
59`[\pL--\p{Greek}&&\p{Uppercase}]` == `[[\pL--\p{Greek}]&&\p{Uppercase}]`.
604. Negation: `[^a-z&&b]` == `[^[a-z&&b]]`.
61
62### Composites
63
64<pre class="rust">
65xy    concatenation (x followed by y)
66x|y   alternation (x or y, prefer x)
67</pre>
68
69This example shows how an alternation works, and what it means to prefer a
70branch in the alternation over subsequent branches.
71
72```
73use ib_matcher::regex::{cp::Regex, Match};
74
75let haystack = "samwise";
76// If 'samwise' comes first in our alternation, then it is
77// preferred as a match, even if the regex engine could
78// technically detect that 'sam' led to a match earlier.
79let re = Regex::new(r"samwise|sam").unwrap();
80assert_eq!(re.find(haystack).unwrap(), Match::must(0, 0..7)); // "samwise"
81// But if 'sam' comes first, then it will match instead.
82// In this case, it is impossible for 'samwise' to match
83// because 'sam' is a prefix of it.
84let re = Regex::new(r"sam|samwise").unwrap();
85assert_eq!(re.find(haystack).unwrap(), Match::must(0, 0..3)); // "sam"
86```
87
88### Repetitions
89
90<pre class="rust">
91x*        zero or more of x (greedy)
92x+        one or more of x (greedy)
93x?        zero or one of x (greedy)
94x*?       zero or more of x (ungreedy/lazy)
95x+?       one or more of x (ungreedy/lazy)
96x??       zero or one of x (ungreedy/lazy)
97x{n,m}    at least n x and at most m x (greedy)
98x{n,}     at least n x (greedy)
99x{n}      exactly n x
100x{n,m}?   at least n x and at most m x (ungreedy/lazy)
101x{n,}?    at least n x (ungreedy/lazy)
102x{n}?     exactly n x
103</pre>
104
105### Empty matches
106
107<pre class="rust">
108^               the beginning of a haystack (or start-of-line with multi-line mode)
109$               the end of a haystack (or end-of-line with multi-line mode)
110\A              only the beginning of a haystack (even with multi-line mode enabled)
111\z              only the end of a haystack (even with multi-line mode enabled)
112\b              a Unicode word boundary (\w on one side and \W, \A, or \z on other)
113\B              not a Unicode word boundary
114\b{start}, \<   a Unicode start-of-word boundary (\W|\A on the left, \w on the right)
115\b{end}, \>     a Unicode end-of-word boundary (\w on the left, \W|\z on the right))
116\b{start-half}  half of a Unicode start-of-word boundary (\W|\A on the left)
117\b{end-half}    half of a Unicode end-of-word boundary (\W|\z on the right)
118</pre>
119
120The empty regex is valid and matches the empty string. For example, the
121empty regex matches `abc` at positions `0`, `1`, `2` and `3`. When using the
122top-level [`cp::Regex`](crate::regex::cp::Regex) on `&str` haystacks, an empty match that splits a codepoint
123is guaranteed to never be returned. For example:
124
125```rust
126use ib_matcher::regex;
127
128let re = regex::cp::Regex::new(r"").unwrap();
129let ranges: Vec<_> = re.find_iter("💩").map(|m| m.range()).collect();
130assert_eq!(ranges, vec![0..0, 4..4]);
131```
132
133Note that an empty regex is distinct from a regex that can never match.
134For example, the regex `[a&&b]` is a character class that represents the
135intersection of `a` and `b`. That intersection is empty, which means the
136character class is empty. Since nothing is in the empty set, `[a&&b]` matches
137nothing, not even the empty string.
138
139### Grouping and flags
140
141<pre class="rust">
142(exp)          numbered capture group (indexed by opening parenthesis)
143(?P&lt;name&gt;exp)  named (also numbered) capture group (names must be alpha-numeric)
144(?&lt;name&gt;exp)   named (also numbered) capture group (names must be alpha-numeric)
145(?:exp)        non-capturing group
146(?flags)       set flags within current group
147(?flags:exp)   set flags for exp (non-capturing)
148</pre>
149
150Capture group names must be any sequence of alpha-numeric Unicode codepoints,
151in addition to `.`, `_`, `[` and `]`. Names must start with either an `_` or
152an alphabetic codepoint. Alphabetic codepoints correspond to the `Alphabetic`
153Unicode property, while numeric codepoints correspond to the union of the
154`Decimal_Number`, `Letter_Number` and `Other_Number` general categories.
155
156Flags are each a single character. For example, `(?x)` sets the flag `x`
157and `(?-x)` clears the flag `x`. Multiple flags can be set or cleared at
158the same time: `(?xy)` sets both the `x` and `y` flags and `(?x-y)` sets
159the `x` flag and clears the `y` flag.
160
161All flags are by default disabled unless stated otherwise. They are:
162
163<pre class="rust">
164i     case-insensitive: letters match both upper and lower case
165m     multi-line mode: ^ and $ match begin/end of line
166s     allow . to match \n
167R     enables CRLF mode: when multi-line mode is enabled, \r\n is used
168U     swap the meaning of x* and x*?
169u     Unicode support (enabled by default)
170x     verbose mode, ignores whitespace and allow line comments (starting with `#`)
171</pre>
172
173Note that in verbose mode, whitespace is ignored everywhere, including within
174character classes. To insert whitespace, use its escaped form or a hex literal.
175For example, `\ ` or `\x20` for an ASCII space.
176
177Flags can be toggled within a pattern. Here's an example that matches
178case-insensitively for the first part but case-sensitively for the second part:
179
180```rust
181use ib_matcher::regex::{cp::Regex, Match};
182
183let re = Regex::new(r"(?i)a+(?-i)b+").unwrap();
184let m = re.find("AaAaAbbBBBb").unwrap();
185assert_eq!(m, Match::must(0, 0..7)); // "AaAaAbb"
186```
187
188Notice that the `a+` matches either `a` or `A`, but the `b+` only matches
189`b`.
190
191Multi-line mode means `^` and `$` no longer match just at the beginning/end of
192the input, but also at the beginning/end of lines:
193
194```
195use ib_matcher::regex::{cp::Regex, Match};
196
197let re = Regex::new(r"(?m)^line \d+").unwrap();
198let m = re.find("line one\nline 2\n").unwrap();
199assert_eq!(m, Match::must(0, 9..15)); // "line 2"
200```
201
202Note that `^` matches after new lines, even at the end of input:
203
204```
205use ib_matcher::regex::cp::Regex;
206
207let re = Regex::new(r"(?m)^").unwrap();
208let m = re.find_iter("test\n").last().unwrap();
209assert_eq!((m.start(), m.end()), (5, 5));
210```
211
212When both CRLF mode and multi-line mode are enabled, then `^` and `$` will
213match either `\r` and `\n`, but never in the middle of a `\r\n`:
214
215```
216use ib_matcher::regex::{cp::Regex, Match};
217
218let re = Regex::new(r"(?mR)^foo$").unwrap();
219let m = re.find("\r\nfoo\r\n").unwrap();
220assert_eq!(m, Match::must(0, 2..5)); // "foo"
221```
222
223Unicode mode can also be selectively disabled, although only when the result
224*would not* match invalid UTF-8. One good example of this is using an ASCII
225word boundary instead of a Unicode word boundary, which might make some regex
226searches run faster:
227
228```rust
229use ib_matcher::regex::{cp::Regex, Match};
230
231let re = Regex::new(r"(?-u:\b).+(?-u:\b)").unwrap();
232let m = re.find("$$abc$$").unwrap();
233assert_eq!(m, Match::must(0, 2..5)); // "abc"
234```
235
236### Escape sequences
237
238Note that this includes all possible escape sequences, even ones that are
239documented elsewhere.
240
241<pre class="rust">
242\*              literal *, applies to all ASCII except [0-9A-Za-z<>]
243\a              bell (\x07)
244\f              form feed (\x0C)
245\t              horizontal tab
246\n              new line
247\r              carriage return
248\v              vertical tab (\x0B)
249\A              matches at the beginning of a haystack
250\z              matches at the end of a haystack
251\b              word boundary assertion
252\B              negated word boundary assertion
253\b{start}, \<   start-of-word boundary assertion
254\b{end}, \>     end-of-word boundary assertion
255\b{start-half}  half of a start-of-word boundary assertion
256\b{end-half}    half of a end-of-word boundary assertion
257\123            octal character code, up to three digits (when enabled)
258\x7F            hex character code (exactly two digits)
259\x{10FFFF}      any hex character code corresponding to a Unicode code point
260\u007F          hex character code (exactly four digits)
261\u{7F}          any hex character code corresponding to a Unicode code point
262\U0000007F      hex character code (exactly eight digits)
263\U{7F}          any hex character code corresponding to a Unicode code point
264\p{Letter}      Unicode character class
265\P{Letter}      negated Unicode character class
266\d, \s, \w      Perl character class
267\D, \S, \W      negated Perl character class
268</pre>
269
270### Perl character classes (Unicode friendly)
271
272These classes are based on the definitions provided in
273[UTS#18](https://www.unicode.org/reports/tr18/#Compatibility_Properties):
274
275<pre class="rust">
276\d     digit (\p{Nd})
277\D     not digit
278\s     whitespace (\p{White_Space})
279\S     not whitespace
280\w     word character (\p{Alphabetic} + \p{M} + \d + \p{Pc} + \p{Join_Control})
281\W     not word character
282</pre>
283
284### ASCII character classes
285
286These classes are based on the definitions provided in
287[UTS#18](https://www.unicode.org/reports/tr18/#Compatibility_Properties):
288
289<pre class="rust">
290[[:alnum:]]    alphanumeric ([0-9A-Za-z])
291[[:alpha:]]    alphabetic ([A-Za-z])
292[[:ascii:]]    ASCII ([\x00-\x7F])
293[[:blank:]]    blank ([\t ])
294[[:cntrl:]]    control ([\x00-\x1F\x7F])
295[[:digit:]]    digits ([0-9])
296[[:graph:]]    graphical ([!-~])
297[[:lower:]]    lower case ([a-z])
298[[:print:]]    printable ([ -~])
299[[:punct:]]    punctuation ([!-/:-@\[-`{-~])
300[[:space:]]    whitespace ([\t\n\v\f\r ])
301[[:upper:]]    upper case ([A-Z])
302[[:word:]]     word characters ([0-9A-Za-z_])
303[[:xdigit:]]   hex digit ([0-9A-Fa-f])
304</pre>
305*/
306pub use regex_syntax::*;
307
308pub mod hir;