Module regex

Module regex 

Source
Available on (crate features syntax-glob or syntax-ev or syntax-regex) and crate feature syntax-regex only.
Expand description

The syntax supported in this module is documented below. (Same as the regex crate.)

See ib_matcher::regex for regex engines.

Note that the regular expression parser and abstract syntax are exposed in a separate crate, regex-syntax.

§Matching one character

.             any character except new line (includes new line with s flag)
[0-9]         any ASCII digit
\d            digit (\p{Nd})
\D            not digit
\pX           Unicode character class identified by a one-letter name
\p{Greek}     Unicode character class (general category or script)
\PX           Negated Unicode character class identified by a one-letter name
\P{Greek}     negated Unicode character class (general category or script)

§Character classes

[xyz]         A character class matching either x, y or z (union).
[^xyz]        A character class matching any character except x, y and z.
[a-z]         A character class matching any character in range a-z.
[[:alpha:]]   ASCII character class ([A-Za-z])
[[:^alpha:]]  Negated ASCII character class ([^A-Za-z])
[x[^xyz]]     Nested/grouping character class (matching any character except y and z)
[a-y&&xyz]    Intersection (matching x or y)
[0-9&&[^4]]   Subtraction using intersection and negation (matching 0-9 except 4)
[0-9--4]      Direct subtraction (matching 0-9 except 4)
[a-g~~b-h]    Symmetric difference (matching `a` and `h` only)
[\[\]]        Escaping in character classes (matching [ or ])
[a&&b]        An empty character class matching nothing

Any named character class may appear inside a bracketed [...] character class. For example, [\p{Greek}[:digit:]] matches any ASCII digit or any codepoint in the Greek script. [\p{Greek}&&\pL] matches Greek letters.

Escaping:

  • \ can escape the following metacharacter (but cannot escape a normal character).
  • []] is valid and matches ], but [[] is invalid and will cause unclosed character class error (because classes are allowed to nest).
  • [-], [a-] and [-a] are valid and can match -.
  • [a^] is valid and can match ^, but [^] is not.
  • All other metacharacters are matched literally in [], including ., *, | and ().

Precedence in character classes, from most binding to least:

  1. Ranges: [a-cd] == [[a-c]d]
  2. Union: [ab&&bc] == [[ab]&&[bc]]
  3. Intersection, difference, symmetric difference. All three have equivalent precedence, and are evaluated in left-to-right order. For example, [\pL--\p{Greek}&&\p{Uppercase}] == [[\pL--\p{Greek}]&&\p{Uppercase}].
  4. Negation: [^a-z&&b] == [^[a-z&&b]].

§Composites

xy    concatenation (x followed by y)
x|y   alternation (x or y, prefer x)

This example shows how an alternation works, and what it means to prefer a branch in the alternation over subsequent branches.

use ib_matcher::regex::{cp::Regex, Match};

let haystack = "samwise";
// If 'samwise' comes first in our alternation, then it is
// preferred as a match, even if the regex engine could
// technically detect that 'sam' led to a match earlier.
let re = Regex::new(r"samwise|sam").unwrap();
assert_eq!(re.find(haystack).unwrap(), Match::must(0, 0..7)); // "samwise"
// But if 'sam' comes first, then it will match instead.
// In this case, it is impossible for 'samwise' to match
// because 'sam' is a prefix of it.
let re = Regex::new(r"sam|samwise").unwrap();
assert_eq!(re.find(haystack).unwrap(), Match::must(0, 0..3)); // "sam"

§Repetitions

x*        zero or more of x (greedy)
x+        one or more of x (greedy)
x?        zero or one of x (greedy)
x*?       zero or more of x (ungreedy/lazy)
x+?       one or more of x (ungreedy/lazy)
x??       zero or one of x (ungreedy/lazy)
x{n,m}    at least n x and at most m x (greedy)
x{n,}     at least n x (greedy)
x{n}      exactly n x
x{n,m}?   at least n x and at most m x (ungreedy/lazy)
x{n,}?    at least n x (ungreedy/lazy)
x{n}?     exactly n x

§Empty matches

^               the beginning of a haystack (or start-of-line with multi-line mode)
$               the end of a haystack (or end-of-line with multi-line mode)
\A              only the beginning of a haystack (even with multi-line mode enabled)
\z              only the end of a haystack (even with multi-line mode enabled)
\b              a Unicode word boundary (\w on one side and \W, \A, or \z on other)
\B              not a Unicode word boundary
\b{start}, \<   a Unicode start-of-word boundary (\W|\A on the left, \w on the right)
\b{end}, \>     a Unicode end-of-word boundary (\w on the left, \W|\z on the right))
\b{start-half}  half of a Unicode start-of-word boundary (\W|\A on the left)
\b{end-half}    half of a Unicode end-of-word boundary (\W|\z on the right)

The empty regex is valid and matches the empty string. For example, the empty regex matches abc at positions 0, 1, 2 and 3. When using the top-level cp::Regex on &str haystacks, an empty match that splits a codepoint is guaranteed to never be returned. For example:

use ib_matcher::regex;

let re = regex::cp::Regex::new(r"").unwrap();
let ranges: Vec<_> = re.find_iter("💩").map(|m| m.range()).collect();
assert_eq!(ranges, vec![0..0, 4..4]);

Note that an empty regex is distinct from a regex that can never match. For example, the regex [a&&b] is a character class that represents the intersection of a and b. That intersection is empty, which means the character class is empty. Since nothing is in the empty set, [a&&b] matches nothing, not even the empty string.

§Grouping and flags

(exp)          numbered capture group (indexed by opening parenthesis)
(?P<name>exp)  named (also numbered) capture group (names must be alpha-numeric)
(?<name>exp)   named (also numbered) capture group (names must be alpha-numeric)
(?:exp)        non-capturing group
(?flags)       set flags within current group
(?flags:exp)   set flags for exp (non-capturing)

Capture group names must be any sequence of alpha-numeric Unicode codepoints, in addition to ., _, [ and ]. Names must start with either an _ or an alphabetic codepoint. Alphabetic codepoints correspond to the Alphabetic Unicode property, while numeric codepoints correspond to the union of the Decimal_Number, Letter_Number and Other_Number general categories.

Flags are each a single character. For example, (?x) sets the flag x and (?-x) clears the flag x. Multiple flags can be set or cleared at the same time: (?xy) sets both the x and y flags and (?x-y) sets the x flag and clears the y flag.

All flags are by default disabled unless stated otherwise. They are:

i     case-insensitive: letters match both upper and lower case
m     multi-line mode: ^ and $ match begin/end of line
s     allow . to match \n
R     enables CRLF mode: when multi-line mode is enabled, \r\n is used
U     swap the meaning of x* and x*?
u     Unicode support (enabled by default)
x     verbose mode, ignores whitespace and allow line comments (starting with `#`)

Note that in verbose mode, whitespace is ignored everywhere, including within character classes. To insert whitespace, use its escaped form or a hex literal. For example, \ or \x20 for an ASCII space.

Flags can be toggled within a pattern. Here’s an example that matches case-insensitively for the first part but case-sensitively for the second part:

use ib_matcher::regex::{cp::Regex, Match};

let re = Regex::new(r"(?i)a+(?-i)b+").unwrap();
let m = re.find("AaAaAbbBBBb").unwrap();
assert_eq!(m, Match::must(0, 0..7)); // "AaAaAbb"

Notice that the a+ matches either a or A, but the b+ only matches b.

Multi-line mode means ^ and $ no longer match just at the beginning/end of the input, but also at the beginning/end of lines:

use ib_matcher::regex::{cp::Regex, Match};

let re = Regex::new(r"(?m)^line \d+").unwrap();
let m = re.find("line one\nline 2\n").unwrap();
assert_eq!(m, Match::must(0, 9..15)); // "line 2"

Note that ^ matches after new lines, even at the end of input:

use ib_matcher::regex::cp::Regex;

let re = Regex::new(r"(?m)^").unwrap();
let m = re.find_iter("test\n").last().unwrap();
assert_eq!((m.start(), m.end()), (5, 5));

When both CRLF mode and multi-line mode are enabled, then ^ and $ will match either \r and \n, but never in the middle of a \r\n:

use ib_matcher::regex::{cp::Regex, Match};

let re = Regex::new(r"(?mR)^foo$").unwrap();
let m = re.find("\r\nfoo\r\n").unwrap();
assert_eq!(m, Match::must(0, 2..5)); // "foo"

Unicode mode can also be selectively disabled, although only when the result would not match invalid UTF-8. One good example of this is using an ASCII word boundary instead of a Unicode word boundary, which might make some regex searches run faster:

use ib_matcher::regex::{cp::Regex, Match};

let re = Regex::new(r"(?-u:\b).+(?-u:\b)").unwrap();
let m = re.find("$$abc$$").unwrap();
assert_eq!(m, Match::must(0, 2..5)); // "abc"

§Escape sequences

Note that this includes all possible escape sequences, even ones that are documented elsewhere.

\*              literal *, applies to all ASCII except [0-9A-Za-z<>]
\a              bell (\x07)
\f              form feed (\x0C)
\t              horizontal tab
\n              new line
\r              carriage return
\v              vertical tab (\x0B)
\A              matches at the beginning of a haystack
\z              matches at the end of a haystack
\b              word boundary assertion
\B              negated word boundary assertion
\b{start}, \<   start-of-word boundary assertion
\b{end}, \>     end-of-word boundary assertion
\b{start-half}  half of a start-of-word boundary assertion
\b{end-half}    half of a end-of-word boundary assertion
\123            octal character code, up to three digits (when enabled)
\x7F            hex character code (exactly two digits)
\x{10FFFF}      any hex character code corresponding to a Unicode code point
\u007F          hex character code (exactly four digits)
\u{7F}          any hex character code corresponding to a Unicode code point
\U0000007F      hex character code (exactly eight digits)
\U{7F}          any hex character code corresponding to a Unicode code point
\p{Letter}      Unicode character class
\P{Letter}      negated Unicode character class
\d, \s, \w      Perl character class
\D, \S, \W      negated Perl character class

§Perl character classes (Unicode friendly)

These classes are based on the definitions provided in UTS#18:

\d     digit (\p{Nd})
\D     not digit
\s     whitespace (\p{White_Space})
\S     not whitespace
\w     word character (\p{Alphabetic} + \p{M} + \d + \p{Pc} + \p{Join_Control})
\W     not word character

§ASCII character classes

These classes are based on the definitions provided in UTS#18:

[[:alnum:]]    alphanumeric ([0-9A-Za-z])
[[:alpha:]]    alphabetic ([A-Za-z])
[[:ascii:]]    ASCII ([\x00-\x7F])
[[:blank:]]    blank ([\t ])
[[:cntrl:]]    control ([\x00-\x1F\x7F])
[[:digit:]]    digits ([0-9])
[[:graph:]]    graphical ([!-~])
[[:lower:]]    lower case ([a-z])
[[:print:]]    printable ([ -~])
[[:punct:]]    punctuation ([!-/:-@\[-`{-~])
[[:space:]]    whitespace ([\t\n\v\f\r ])
[[:upper:]]    upper case ([A-Z])
[[:word:]]     word characters ([0-9A-Za-z_])
[[:xdigit:]]   hex digit ([0-9A-Fa-f])

Modules§

ast
Defines an abstract syntax for regular expressions.
hir
utf8
Converts ranges of Unicode scalar values to equivalent ranges of UTF-8 bytes.

Structs§

Parser
A convenience parser for regular expressions.
ParserBuilder
A builder for a regular expression parser.
UnicodeWordError
An error that occurs when the Unicode-aware \w class is unavailable.

Enums§

Error
This error type encompasses any error that can be returned by this crate.

Functions§

escape
Escapes all regular expression meta characters in text.
escape_into
Escapes all meta characters in text and writes the result into buf.
is_escapeable_character
Returns true if the given character can be escaped in a regex.
is_meta_character
Returns true if the given character has significance in a regex.
is_word_byte
Returns true if and only if the given character is an ASCII word character.
is_word_character
Returns true if and only if the given character is a Unicode word character.
parse
A convenience routine for parsing a regex using default options.
try_is_word_character
Returns true if and only if the given character is a Unicode word character.