Crate regress

Expand description

§regress - REGex in Rust with EcmaScript Syntax

This crate provides a regular expression engine which targets EcmaScript (aka JavaScript) regular expression syntax.

§Example: test if a string contains a match

use regress::Regex;
let re = Regex::new(r"\d{4}").unwrap();
let matched = re.find("2020-20-05").is_some();
assert!(matched);

§Example: iterating over matches

Here we use a backreference to find doubled characters:

use regress::Regex;
let re = Regex::new(r"(\w)\1").unwrap();
let text = "Frankly, Miss Piggy, I don't give a hoot!";
for m in re.find_iter(text) {
    println!("{}", &text[m.range()])
}
// Output: ss
// Output: gg
// Output: oo

§Example: using capture groups

Capture groups are available in the Match object produced by a successful match. A capture group is a range of byte indexes into the original string.

use regress::Regex;
let re = Regex::new(r"(\d{4})").unwrap();
let text = "Today is 2020-20-05";
let m = re.find(text).unwrap();
let group = m.group(1).unwrap();
println!("Year: {}", &text[group]);
// Output: Year: 2020

§Example: using with Pattern trait (nightly only)

When the pattern feature is enabled and using nightly Rust, Regex can be used with standard string methods:

#![feature(pattern)]
use regress::Regex;
let re = Regex::new(r"\d+").unwrap();
let text = "abc123def456";

// Use with str methods
assert_eq!(text.find(&re), Some(3));
assert!(text.contains(&re));
let parts: Vec<&str> = text.split(&re).collect();
assert_eq!(parts, vec!["abc", "def", ""]);

§Example: escaping strings for literal matching

Use the escape function to escape special regex characters in a string:

use regress::{escape, Regex};
let user_input = "How much $ do you have? (in dollars)";
let escaped = escape(user_input);
let re = Regex::new(&escaped).unwrap();
assert!(re.find(user_input).is_some());

§Supported Syntax

regress targets ES 2018 syntax. You can refer to the many resources about JavaScript regex syntax.

There are some features which have yet to be implemented:

Named character classes liks [[:alpha:]]
Unicode property escapes like \p{Sc}

Note the parser assumes the u (Unicode) flag, as the non-Unicode path is tied to JS’s UCS-2 string encoding and the semantics cannot be usefully expressed in Rust.

§Unicode remarks

regress supports Unicode case folding. For example:

use regress::Regex;
let re = Regex::with_flags("\u{00B5}", "i").unwrap();
assert!(re.find("\u{03BC}").is_some());

Here the U+00B5 (micro sign) was case-insensitively matched against U+03BC (small letter mu).

regress does NOT perform normalization. For example, e-with-accute-accent can be precomposed or decomposed, and these are treated as not equivalent:

use regress::{Regex, Flags};
let re = Regex::new("\u{00E9}").unwrap();
assert!(re.find("\u{0065}\u{0301}").is_none());

This agrees with JavaScript semantics. Perform any required normalization before regex matching.

§Ascii matching

regress has an “ASCII mode” which treats each 8-bit quantity as a separate character. This may provide improved performance if you do not need Unicode semantics, because it can avoid decoding UTF-8 and has simpler (ASCII-only) case-folding.

Example:

use regress::Regex;
let re = Regex::with_flags("BC", "i").unwrap();
assert!(re.find("abcd").is_some());

§Comparison to regex crate

regress supports features that regex does not, in particular backreferences and zero-width lookaround assertions. However the regex crate provides linear-time matching guarantees, while regress does not. This difference is due to the architecture: regex uses finite automata while regress uses “classical backtracking.”

§Comparison to fancy-regex crate

fancy-regex wraps the regex crate and extends it with PCRE-style syntactic features. regress has more complete support for these features: backreferences may be case-insensitive, and lookbehinds may be arbitrary-width.

§Architecture

regress has a parser, intermediate representation, optimizer which acts on the IR, bytecode emitter, and two bytecode interpreters, referred to as “backends”.

The major interpreter is the “classical backtracking” which uses an explicit backtracking stack, similar to JS implementations. There is also the “PikeVM” pseudo-toy backend which is mainly used for testing and verification.

§Crate features

utf16. When enabled, additional APIs are made available that allow matching text formatted in UTF-16 and UCS-2 (&[u16]) without going through a conversion to and from UTF-8 (&str) first. This is particularly useful when interacting with and/or (re)implementing existing systems that use those encodings, such as JavaScript, Windows, and the JVM.
pattern. When enabled (nightly only), implements the std::str::pattern::Pattern trait for Regex, allowing it to be used with standard string methods like str::find, str::contains, str::split, etc.

Structs§

Error: Represents an error encountered during regex compilation.
Flags: Flags used to control regex parsing. The default flags are case-sensitive, not-multiline, and optimizing.
Groups: An iterator over the capture groups of a Match
Match: A Match represents a portion of a string which was found to match a Regex.
NamedGroups: An iterator over the named capture groups of a Match
Regex: A Regex is the compiled version of a pattern.

Functions§

escape: Escapes all special regex characters in a string to make it a literal match.

Type Aliases§

AsciiMatches: An iterator type which yields Matches found in a string, supporting ASCII only.
Matches: An iterator type which yields Matches found in a string.
Range: Range is used to express the extent of a match, as indexes into the input string.

Crate regressCopy item path