Match regular expressions on arbitrary bytes.
This module provides a nearly identical API to the one found in the top-level of this crate. There are two important differences:
- Matching is done on
Vec<u8>is used where
Stringwould have been used.
- Unicode support can be disabled even when disabling it would result in matching invalid UTF-8 bytes.
This shows how to find all null-terminated strings in a slice of bytes:
let re = Regex::new(r"(?-u)(?P<cstr>[^\x00]+)\x00").unwrap(); let text = b"foo\x00bar\x00baz\x00"; // Extract all of the strings without the null terminator from each match. // The unwrap is OK here since a match requires the `cstr` capture to match. let cstrs: Vec<&[u8]> = re.captures_iter(text) .map(|c| c.name("cstr").unwrap().as_bytes()) .collect(); assert_eq!(vec![&b"foo"[..], &b"bar"[..], &b"baz"[..]], cstrs);
This shows how to match an arbitrary byte pattern followed by a UTF-8 encoded string (e.g., to extract a title from a Matroska file):
let re = Regex::new( r"(?-u)\x7b\xa9(?:[\x80-\xfe]|[\x40-\xff].)(?u:(.*))" ).unwrap(); let text = b"\x12\xd0\x3b\x5f\x7b\xa9\x85\xe2\x98\x83\x80\x98\x54\x76\x68\x65"; let caps = re.captures(text).unwrap(); // Notice that despite the `.*` at the end, it will only match valid UTF-8 // because Unicode mode was enabled with the `u` flag. Without the `u` flag, // the `.*` would match the rest of the bytes. let mat = caps.get(1).unwrap(); assert_eq!((7, 10), (mat.start(), mat.end())); // If there was a match, Unicode mode guarantees that `title` is valid UTF-8. let title = str::from_utf8(&caps).unwrap(); assert_eq!("☃", title);
In general, if the Unicode flag is enabled in a capture group and that capture is part of the overall match, then the capture is guaranteed to be valid UTF-8.
The supported syntax is pretty much the same as the syntax for Unicode regular expressions with a few changes that make sense for matching arbitrary bytes:
uflag can be disabled even when disabling it might cause the regex to match invalid UTF-8. When the
uflag is disabled, the regex is said to be in "ASCII compatible" mode.
- In ASCII compatible mode, neither Unicode scalar values nor Unicode character classes are allowed.
- In ASCII compatible mode, Perl character classes (
\s) revert to their typical ASCII definition.
- In ASCII compatible mode, word boundaries use the ASCII compatible
\wto determine whether a byte is a word byte or not.
- Hexadecimal notation can be used to specify arbitrary bytes instead of
Unicode codepoints. For example, in ASCII compatible mode,
\xFFmatches the literal byte
\xFF, while in Unicode mode,
\xFFis a Unicode codepoint that matches its UTF-8 encoding of
\xC3\xBF. Similarly for octal notation when enabled.
- In ASCII compatible mode,
.matches any byte except for
\n. When the
sflag is additionally enabled,
.matches any byte.
In general, one should expect performance on
&[u8] to be roughly similar to
CaptureLocations is a low level representation of the raw offsets of each submatch.
An iterator that yields all non-overlapping capture groups matching a particular regular expression.
An iterator over the names of all possible captures.
Captures represents a group of captured byte strings for a single match.
Match represents a single match of a regex in a haystack.
An iterator over all non-overlapping matches for a particular string.
A compiled regular expression for matching arbitrary bytes.
A configurable builder for a regular expression.
Match multiple (possibly overlapping) regular expressions in a single scan.
A configurable builder for a set of regular expressions.
By-reference adaptor for a
A set of matches returned by a regex set.
An owned iterator over the set of matches from a regex set.
A borrowed iterator over the set of matches from a regex set.
Yields all substrings delimited by a regular expression match.
Yields at most
An iterator that yields all capturing matches in the order in which they appear in the regex.
Replacer describes types that can be used to replace matches in a byte string.