Crate fancy_regex

Expand description

An implementation of regexes, supporting a relatively rich set of features, including backreferences and lookaround. Aims to be compatible with Oniguruma syntax when the relevant flag is set.

It builds on top of the excellent regex crate. If you are not familiar with it, make sure you read its documentation and maybe you don’t even need fancy-regex.

If your regex or parts of it does not use any special features, the matching is delegated to the regex crate. That means it has linear runtime. But if you use “fancy” features such as backreferences or look-around, an engine with backtracking needs to be used. In that case, the regex can be slow and take exponential time to run because of what is called “catastrophic backtracking”. This depends on the regex and the input.

§Usage

The API should feel very similar to the regex crate, and involves compiling a regex and then using it to find matches in text.

§Example: Matching text

An example with backreferences to check if a text consists of two identical words:

use fancy_regex::Regex;

let re = Regex::new(r"^(\w+) (\1)$").unwrap();
let result = re.is_match("foo foo");

assert!(result.is_ok());
let did_match = result.unwrap();
assert!(did_match);

Note that like in the regex crate, the regex needs anchors like ^ and $ to match against the entire input text.

§Example: Finding the position of matches

use fancy_regex::Regex;

let re = Regex::new(r"(\d)\1").unwrap();
let result = re.find("foo 22");

assert!(result.is_ok(), "execution was successful");
let match_option = result.unwrap();

assert!(match_option.is_some(), "found a match");
let m = match_option.unwrap();

assert_eq!(m.start(), 4);
assert_eq!(m.end(), 6);
assert_eq!(m.as_str(), "22");

§Example: Capturing groups

use fancy_regex::Regex;

let re = Regex::new(r"(?<!AU)\$(\d+)").unwrap();
let result = re.captures("AU$10, $20");

let captures = result.expect("Error running regex").expect("No match found");
let group = captures.get(1).expect("No group");
assert_eq!(group.as_str(), "20");

§Example: Splitting text

use fancy_regex::Regex;

let re = Regex::new(r"[ \t]+").unwrap();
let target = "a b \t  c\td    e";
let fields: Vec<&str> = re.split(target).map(|x| x.unwrap()).collect();
assert_eq!(fields, vec!["a", "b", "c", "d", "e"]);

let fields: Vec<&str> = re.splitn(target, 3).map(|x| x.unwrap()).collect();
assert_eq!(fields, vec!["a", "b", "c\td    e"]);

§Features

This crate supports several optional features that can be enabled or disabled:

std (enabled by default): Enables standard library support. Disable for no_std environments.
unicode (enabled by default): Enables Unicode support for character classes and word boundaries.
perf (enabled by default): Enables performance optimizations in the underlying regex engine.
variable-lookbehinds (enabled by default): Enables support for variable-length lookbehind assertions (e.g., (?<=a+)). Without this feature, only constant-length lookbehinds are supported. This feature uses reverse DFA matching from the regex-automata crate to efficiently handle variable-length patterns that don’t use backreferences or other fancy features.

§Syntax

The regex syntax is based on the regex crate’s and on Oniguruma, with some additional supported syntax. Where the two conflict, there is a flag to prefer Oniguruma parsing rules. (By default regex crate compatible parsing is used.)

Escapes:

\h : hex digit ([0-9A-Fa-f])
\H : not hex digit ([^0-9A-Fa-f])
\e : escape control character (\x1B)
\K : keep text matched so far out of the overall match (docs)
\G : anchor to where the previous match ended (docs)
\Z : anchor to the end of the text before any trailing newlines
\O : any character including newline
\N : any character except newline
\R : general newline - matches all common line break characters: \n, \v, \f, \r, treating \r\n as an atomic unit

Backreferences:

\1 : match the exact string that the first capture group matched
\2 : backref to the second capture group, etc.
\k<name> : match the exact string that the capture group named name matched
(?P=name) : same as \k<name> for compatibility with Python, etc.
\g<name> : call the subroutine defined in capture group named name
\g<1> : call the subroutine defined in capture group 1. Subroutines can be recursive up to 20 levels deep.

Named capture groups:

(?<name>exp) : match exp, creating capture group named name
(?P<name>exp) : same as (?<name>exp) for compatibility with Python, etc.

Look-around assertions for matching without changing the current position:

(?=exp) : look-ahead, succeeds if exp matches to the right of the current position
(?!exp) : negative look-ahead, succeeds if exp doesn’t match to the right
(?<=exp) : look-behind, succeeds if exp matches to the left of the current position
(?<!exp) : negative look-behind, succeeds if exp doesn’t match to the left

Note: Look-behind assertions with variable length (e.g., (?<=a+)) are supported with the variable-lookbehinds feature (enabled by default). Without this feature, only constant-length look-behinds are supported. Variable-length look-behinds can include word boundaries and other zero-width assertions (e.g., (?<=\ba+)) as long as the rest of the pattern doesn’t use backreferences or other “fancy” features that require backtracking within the lookbehind.

Atomic groups using (?>exp) to prevent backtracking within exp, e.g.:

let re = Regex::new(r"^a(?>bc|b)c$").unwrap();
assert!(re.is_match("abcc").unwrap());
// Doesn't match because `|b` is never tried because of the atomic group
assert!(!re.is_match("abc").unwrap());

Conditionals - if/then/else:

(?(1)) : continue only if first capture group matched
(?(<name>)) or (?('name')) : continue only if capture group named name matched
(?(1)true_branch|false_branch) : if the first capture group matched then execute the true_branch regex expression, else execute false_branch (docs)
(?(condition)true_branch|false_branch) : if the condition matches then execute the true_branch regex expression, else execute false_branch from the point just before the condition was evaluated
(?(DEFINE)(capture group)(?<named_group>another) : define capture groups for later use in subroutine calls

Backtracking control verbs:

(*FAIL) : fail the current backtracking branch

Absent repeater:

(?~abc) : match anything until abc would match or until the end of the haystack if no match

§Subroutines: reusable patterns with stable meaning

§What is a subroutine

Subroutines in fancy-regex are compiled pattern definitions that can be invoked safely and predictably. Any capture group can become a subroutine - it just needs to be “called”.

(?<num>\d*\.\d+|\d+) x \g<num>

In the above example, a capture group called num is defined, to match numbers with or without decimal places. \g<num> executes the capture group again, without the author having to re-type the pattern inside it. The above pattern would match text like 5.2 x 6 for instance.

Think of a subroutine as:

defined by a capture group
executed exactly the way the capture group was originally defined
reusable from multiple places

§Side effects

A subroutine call has one side-effect - it updates the capture group position, which affects backref matching etc.

§Example

Let’s imagine a pattern which will match a digit and capture it into group 1. Then it will call that capture group as a subroutine. Then it will do a backref to group 1.

This will match three consecutive digits. The 2nd and 3rd digits must be identical.

use fancy_regex::{Error, Regex};

let re = Regex::new(r"(\d)\g<1>\1")?;
let result = re.captures("foo 711")?;

let captures = result.unwrap();

let m = captures.get(0).unwrap();

assert_eq!(m.start(), 4);
assert_eq!(m.end(), 7);
assert_eq!(m.as_str(), "711");

let group = captures.get(1).unwrap();
assert_eq!(group.as_str(), "1");

assert!(!re.is_match("foo 717")?);

In the above example, 7 was stored in capture group 1. Then it was replaced with 1 by the subroutine call. Then the backreference to group 1 can only match the literal 1.

§Side effect edge cases

Also, in a lookbehind, a subroutine call would not update the capture group position when the currently stored position for that capture group is further to the right in the haystack. i.e. right-most captures take precedence.

§Flags and the common misconception

When subroutines are first introduced, a very common assumption is:

“If I apply flags at the call site, they should affect the subroutine.”

This assumption is reasonable — many regex engines either behave this way or do not specify the behavior clearly. fancy-regex does not do this, and the reason is central to its design.

Let’s look at a concrete example.

§The pattern

\A(?<word>[a-z]+)\s+(?i:\g<word>)

§The input

hello Mr

At first glance, this pattern appears to say:

At the beginning of the input string
Match a word of lowercase letters and capture it as word
Match some whitespace
Call the subroutine word, but case-insensitively

Many users therefore expect this pattern to match the input above.

It does not.

§What actually happens

The key to understanding this behavior is that flags belong to the subroutine definition, not the call site.

Let’s walk through the execution step by step.

§Execution trace

§Step 1: Assert position at the beginning of the input string

Pattern: \A
Active flags: none
Input position: 0 (start of "hello Mr")

§Step 2: Enter subroutine definition `word`

Pattern: [a-z]+
Active flags: none
Input position: 0 (start of "hello Mr")

The engine greedily matches [a-z]+ against "hello". The range 0 - 5 is stored in capture group 1, whose name is word.

§Step 3: Exit subroutine definition `word` and continue matching

Pattern: \s+
Active flags: none
Input position: 5 (at the space after hello)

The engine matches \s+ against " ".

§Step 4: Call subroutine `word`

Pattern: \g<word> -> (?<word>[a-z]+)
Active flags: none - because the capture group definition had no flags active
Input position: 6 (after " " at the 'M' of "Mr")

This fails immediately, because:

The pattern is case-sensitive
The first character is 'M', which does not fall in the range a-z.

§Step 5: No alternatives available

There are no alternations inside word
There are no backtracking points before the failure
The anchor prevents us from trying other starting positions in the input string

The match fails.

§Why the `i` flag did not apply

The i flag appears only at the call site:

(?i:\g<word>)

However, calling a subroutine does not re-evaluate or modify its definition.

The subroutine word was compiled once, with these properties:

Pattern: [a-z]+
Flags: none
Capture group number: 1

When the subroutine is called, the engine:

Enters the already-compiled definition
Executes it exactly as defined
Ignores any flags applied at the call site

§This is a feature, not a limitation

fancy-regex deliberately enforces this rule to guarantee that:

A subroutine behaves the same everywhere it is used
Flags cannot silently change the meaning of a reused pattern
There is no “action at a distance” from call sites

If call-site flags were allowed to affect subroutines, the same subroutine could behave differently depending on where it was called - making patterns harder to reason about and easier to misuse.

§Expressing the intended behavior

If the intent is for word to be matched case-insensitively, the flag must be applied at the definition:

(?i:(?<word>[a-z]+))\s+\g<word>

or

(?<word>(?i:[a-z]+))\s+\g<word>

Now the subroutine is compiled with the i flag, and every call to it behaves consistently.

§Key takeaway

Subroutines in fancy-regex are compiled once, with fixed flags. Call sites cannot change their behavior.

This rule enables safe reuse, predictable execution, and clear reasoning - especially in larger and more complex patterns. It also matches Oniguruma behavior, so if you plan to use fancy-regex as a memory-safe alternative, you can!

§Compile-time rejection of left-recursive patterns

fancy-regex’s support of subroutines unlocks powerful features such as recursion. With that power comes the risk of defining patterns that can recurse forever.

To guarantee termination and predictable behavior, fancy-regex rejects left-recursive patterns at compile time.

This check is conservative: if a pattern could recurse without consuming input, it is rejected - even if a particular input would not trigger that behavior.

§What is left recursion?

A pattern is left-recursive if it can re-enter itself without consuming any input.

In other words, the engine can make recursive calls while staying at the same input position.

A simplified example looks like this:

(?<expr>\g<expr>a|a)

Here, the subroutine expr can immediately call itself before matching anything. No matter what the input is, this definition allows infinite recursion.

§Why fancy-regex rejects these patterns

Left recursion is problematic because:

It can cause infinite recursion or unbounded backtracking
It cannot be made safe by input inspection alone

Even if a specific input would not trigger the recursion, the pattern itself is unsafe.

Rather than attempting to detect or recover from such cases at runtime, fancy-regex enforces a stronger rule:

Every recursive call must consume input before it can recurse again.

This guarantees that evaluation always makes progress.

§Conservative by design

The left-recursion check is intentionally conservative.

Consider the following pattern:

(?<expr>ab|\g<expr>a)

For the input “ab”, this pattern would terminate successfully. However, fancy-regex still rejects it.

Why?

Because the second alternative allows recursion before any input is consumed. The engine cannot rely on runtime input to guarantee termination.

This is a deliberate design choice:

fancy-regex validates the structure of the pattern, not the behavior of a particular input.

§What is allowed

Recursive patterns are allowed as long as they consume input before recursing.

For example:

(?<paren>\((?:[^()]*|\g<paren>)\))

Here:

Each recursive call to paren is preceded by a literal ‘(’
Input is always consumed before recursion
Termination is guaranteed

This kind of recursion is safe and fully supported.

§How to restructure left-recursive patterns

Left-recursive definitions can often be rewritten in a right-recursive or iterative form.

For example, instead of:

(?<expr>\g<expr>a|a)

You can write:

a+

Or, when recursion is genuinely required:

(?<expr>a\g<expr>?)

In this version, input is consumed before the recursive call, satisfying fancy-regex’s safety rules.

§Recursion

Recursion is when a subroutine calls itself, directly or indirectly.

§Depth limit

fancy-regex supports recursion up to 20 levels deep.

Let’s look at a simple example to prove this:

(a\g<1>?)

Here we have a pattern which defines capture group 1 as consuming the literal a, followed by calling itself between 0 and 1 times greedily.

With 22 a characters as input, only 20 are matched:

use fancy_regex::{Error, Regex};

let pattern = r"(a\g<1>?)";
let re = Regex::new(pattern)?;

let haystack = "aaaaaaaaaaaaaaaaaaaaaa"; // 22 a's
let result = re.find(haystack)?;

let found = result.unwrap();
// match is limited to 20 characters due to recursion depth limit
assert_eq!(found.as_str().len(), 20);

§Unbounded recursion

fancy-regex will return a compile error for patterns which recurse indefinitely.

Let’s look at a simple example to prove this:

(a\g<1>)

Here, capture group 1 consumes the literal a, then calls itself unconditionally. After recursion level 20 is reached, there is not a single path which would return a match.

§Side effects

You may remember that it was stated earlier that the side effect of a subroutine call is that the capture group will be updated. It would be more accurate to say that the capture group is updated for non-recursive subroutine calls only.

Why?

Imagine a pattern like:

(?<foo>a|\(\g<foo>\))

It will match the literal a, or any number of balanced parenthesis surrounding a. If the recursive subroutine call would update the capture group start position, the opening parenthesis would not be included in the capture group. If the recursive subroutine call would update the capture group end position (as well), you’d get the inner most subroutine call’s start position and outer most subroutine call’s end position, which would then be overridden anyway when the capture group at the root level is exited. This would produce exceptionally odd and confusing behavior.

§Backreferences

fancy-regex does not yet support relative recursion level backreferences, and attempting to backreference a capture group which is currently being recursed is at present a compile error.

Example (adapting the previous pattern):

(?<foo>a|\(\g<foo>\)\k<foo>?)

With an input like:

(((a)(a)))

Oniguruma would give you two matches - the two (a)s. fancy-regex would give (if the compile error were removed and no other changes made,) a single match of (a)(a, which is clearly not what anyone would expect. fancy-regex prefers correctness and rejects such patterns rather than exhibiting undefined behavior.

§Absent Operators

The absent operators are worth talking about because they are quite uncommon.

§Absent Repeater

An absent repeater node is defined by the syntax (?~inner_pattern), and it will match any text where the inner pattern does not match (i.e. is absent), including across newlines. This does not add any new abilities to the engine, it just allows to clarify intent and to be more easily optimized under the hood. fancy-regex mainly implements this for Oniguruma compatibility.

It works best or is at least easiest to understand when the inner pattern is a literal.

§Example

Let’s imagine you have some Markdown, containing some code fences.

It might look something like this:

# Some Heading

Given a todo list like this:

**Input:**
```json
{
  "todos": [
    {
      "content": "Create `some_helper_func` helper in some_file.rs that takes a closure to check the error",
      "status": "complete",
      "priority": "high"
    },
    {
      "content": "Update error-asserting tests in some_file.rs to use `some_helper_func`",
      "status": "complete",
      "priority": "high"
    },
    {
      "content": "Run `cargo fmt` and `cargo test`",
      "status": "pending",
      "priority": "medium"
    }
  ]
}
```

You might expect this output:

**Output:**
```text
High priority tasks have now been completed.
```

Some more text.

Let’s say you want to match all input and output codeblocks.

Typically you could do it like:

[*]{2}(?:In|Out)put:[*]{2}\n```(?:[^`]+|`(?!``))+```

This would match everything inside the codeblock which is not a backtick, or backticks which are not followed by another 2 backticks, until it reaches the 3 backticks marking the end of the codeblock. Generally this type of construct can be quite hard to follow and reason about, to be sure it won’t suffer from catastrophic backtracking.

With the absent repeater, the intention becomes a lot easier to understand - match anything that isn’t 3 backticks, followed by 3 backticks.

[*]{2}(?:In|Out)put:[*]{2}\n```(?~```)```

Where it really shines is when you need more complicated expressions, like to match a variable number of backticks from the code fence boundaries, it becomes a lot easier to read than an expanded alternative which would avoid catastrophic backtracking.

use fancy_regex::Regex;

// Match a code fence: opening backticks (3+), content (absent the same backticks), closing backticks
let re = Regex::new(r"(?<!`)(`{3,}(?!`))\w*\n(?~\1)\n(\1)")?;

// A code fence with 4 backticks, where the inner code contains 3 backticks
let input = "````text\nsome code with ``` backticks\n````";
let captures = re.captures(input)?.expect("should match");

// The overall match spans the entire input
let m = captures.get(0).unwrap();
assert_eq!(m.start(), 0);
assert_eq!(m.end(), input.len());
assert_eq!(m.as_str(), input);

// Group 1: the opening 4 backticks
let open = captures.get(1).unwrap();
assert_eq!(open.as_str(), "````");
assert_eq!(open.start(), 0);
assert_eq!(open.end(), 4);

// Group 2: the closing 4 backticks
let close = captures.get(2).unwrap();
assert_eq!(close.as_str(), "````");
assert_eq!(close.start(), input.len() - 4);
assert_eq!(close.end(), input.len());

It also allows the engine to optimize it accordingly.

§Other ways of looking at it

The absent repeater can be considered shorthand for this:

(?((?!absent))\O|)*

Essentially a conditional, which says when the absent expression doesn’t match, match a single character including newlines. When the absent expression does match, match nothing. Repeat greedily.

Structs§

CaptureMatches: An iterator that yields all non-overlapping capture groups matching a particular regular expression.
CaptureNames: An iterator over capture names in a Regex. The iterator returns the name of each group, or None if the group has no name. Because capture group 0 cannot have a name, the first item returned is always None.
Captures: A set of capture groups found for a regex.
Expander: A set of options for expanding a template string using the contents of capture groups.
Match: A single match of a regex or group in an input text
Matches: An iterator over all non-overlapping matches for a particular string.
NoExpand: NoExpand indicates literal string replacement.
Regex: A compiled regular expression.
RegexBuilder: A builder for a Regex to allow configuring options.
RegexOptionsBuilder: A builder for a Regex to allow configuring options.
ReplacerRef: By-reference adaptor for a Replacer
Split: An iterator over all substrings delimited by a regex.
SplitN: An iterator over at most N substrings delimited by a regex.
SubCaptureMatches: Iterator for captured groups in order in which they appear in the regex.

Enums§

Absent: Type of absent operator as used for Oniguruma’s absent functionality.
Assertion: Type of assertions
AstNode: Abstract Syntax Tree node - will be resolved into an Expr before analysis
BacktrackingControlVerb: Type of backtracking control verb which affects how backtracking will behave. See https://www.regular-expressions.info/verb.html
CaptureGroupTarget: Target of a backreference or subroutine call
CompileError: An error as the result of compiling a regex.
Error: An error as the result of parsing, compiling or running a regex.
Expr: Regular expression AST. This is public for now but may change.
ExprChildrenIter: An iterator over the immediate children of an Expr.
ExprChildrenIterMut: An iterator over the immediate children of an Expr for mutable access.
LookAround: Type of look-around assertion as used for a look-around expression.
ParseError: An error for the result of parsing a regex pattern.
RuntimeError: An error as the result of executing a regex.

Traits§

Replacer: Replacer describes types that can be used to replace matches in a string.

Functions§

escape: Escapes special characters in text with ‘\’. Returns a string which, when interpreted as a regex, matches exactly text.

Type Aliases§

Result: Result type for this crate with specific error enum.