Module regex::bytes [] [src]

Match regular expressions on arbitrary bytes.

This module provides a nearly identical API to the one found in the top-level of this crate. There are two important differences:

  1. Matching is done on &[u8] instead of &str. Additionally, Vec<u8> is used where String would have been used.
  2. Regular expressions are compiled with Unicode support disabled by default. This means that while Unicode regular expressions can only match valid UTF-8, regular expressions in this module can match arbitrary bytes. Unicode support can be selectively enabled via the u flag in regular expressions provided by this sub-module.

Example: match null terminated string

This shows how to find all null-terminated strings in a slice of bytes:

let re = Regex::new(r"(?P<cstr>[^\x00]+)\x00").unwrap();
let text = b"foo\x00bar\x00baz\x00";

// Extract all of the strings without the null terminator from each match.
// The unwrap is OK here since a match requires the `cstr` capture to match.
let cstrs: Vec<&[u8]> =
    re.captures_iter(text)
      .map(|c| c.name("cstr").unwrap())
      .collect();
assert_eq!(vec![&b"foo"[..], &b"bar"[..], &b"baz"[..]], cstrs);

Example: selectively enable Unicode support

This shows how to match an arbitrary byte pattern followed by a UTF-8 encoded string (e.g., to extract a title from a Matroska file):

let re = Regex::new(r"\x7b\xa9(?:[\x80-\xfe]|[\x40-\xff].)(?u:(.*))").unwrap();
let text = b"\x12\xd0\x3b\x5f\x7b\xa9\x85\xe2\x98\x83\x80\x98\x54\x76\x68\x65";
let caps = re.captures(text).unwrap();

// Notice that despite the `.*` at the end, it will only match valid UTF-8
// because Unicode mode was enabled with the `u` flag. Without the `u` flag,
// the `.*` would match the rest of the bytes.
assert_eq!((7, 10), caps.pos(1).unwrap());

// If there was a match, Unicode mode guarantees that `title` is valid UTF-8.
let title = str::from_utf8(caps.at(1).unwrap()).unwrap();
assert_eq!("☃", title);

In general, if the Unicode flag is enabled in a capture group and that capture is part of the overall match, then the capture is guaranteed to be valid UTF-8.

Syntax

The supported syntax is pretty much the same as the syntax for Unicode regular expressions with a few changes that make sense for matching arbitrary bytes:

  1. The u flag is disabled by default, but can be selectively enabled. (The opposite is true for the main Regex type.) Disabling the u flag is said to invoke "ASCII compatible" mode.
  2. In ASCII compatible mode, neither Unicode codepoints nor Unicode character classes are allowed.
  3. In ASCII compatible mode, Perl character classes (\w, \d and \s) revert to their typical ASCII definition. \w maps to [[:word:]], \d maps to [[:digit:]] and \s maps to [[:space:]].
  4. In ASCII compatible mode, word boundaries use the ASCII compatible \w to determine whether a byte is a word byte or not.
  5. Hexadecimal notation can be used to specify arbitrary bytes instead of Unicode codepoints. For example, in ASCII compatible mode, \xFF matches the literal byte \xFF, while in Unicode mode, \xFF is a Unicode codepoint that matches its UTF-8 encoding of \xC3\xBF. Similarly for octal notation.
  6. . matches any byte except for \n instead of any codepoint. When the s flag is enabled, . matches any byte.

Performance

In general, one should expect performance on &[u8] to be roughly similar to performance on &str.

Structs

CaptureNames

An iterator over the names of all possible captures.

Captures

Captures represents a group of captured byte strings for a single match.

FindCaptures

An iterator that yields all non-overlapping capture groups matching a particular regular expression.

FindMatches

An iterator over all non-overlapping matches for a particular string.

NoExpand

NoExpand indicates literal byte string replacement.

Regex

A compiled regular expression for matching arbitrary bytes.

RegexBuilder

A configurable builder for a regular expression.

RegexSet

Match multiple (possibly overlapping) regular expressions in a single scan.

SetMatches

A set of matches returned by a regex set.

SetMatchesIntoIter

An owned iterator over the set of matches from a regex set.

SetMatchesIter

A borrowed iterator over the set of matches from a regex set.

Splits

Yields all substrings delimited by a regular expression match.

SplitsN

Yields at most N substrings delimited by a regular expression match.

SubCaptures

An iterator over capture groups for a particular match of a regular expression.

SubCapturesNamed

An Iterator over named capture groups as a tuple with the group name and the value.

SubCapturesPos

An iterator over capture group positions for a particular match of a regular expression.

Traits

Replacer

Replacer describes types that can be used to replace matches in a byte string.