Struct Regex

Source

pub struct Regex<'a> { /* private fields */ }

Available on crate features regex-automata and regex-cp only.

Expand description

A compiled regular expression for searching Unicode haystacks.

A Regex can be used to search haystacks, split haystacks into substrings or replace substrings in a haystack with a different substring. All searching is done with an implicit (?s:.)*? at the beginning and end of an pattern. To force an expression to match the whole string (or a prefix or a suffix), you can use anchored search or an anchor like ^ or $ (or \A and \z).

§Overview

The most important methods are as follows:

Regex::new compiles a regex using the default configuration. A Builder permits setting a non-default configuration. (For example, case insensitive matching, verbose mode and others.)
Regex::is_match reports whether a match exists in a particular haystack.
Regex::find reports the byte offsets of a match in a haystack, if one exists. Regex::find_iter returns an iterator over all such matches.
Regex::captures returns a Captures, which reports both the byte offsets of a match in a haystack and the byte offsets of each matching capture group from the regex in the haystack. Regex::captures_iter returns an iterator over all such matches.

§Example

use ib_matcher::regex::cp::Regex;

let re = Regex::new(r"^[0-9]{4}-[0-9]{2}-[0-9]{2}$")?;
assert!(re.is_match("2010-03-14"));

With IbMatcher’s Chinese pinyin and Japanese romaji matching:

// cargo add ib-matcher --features regex,pinyin,romaji
use ib_matcher::{
    matcher::{MatchConfig, PinyinMatchConfig, RomajiMatchConfig},
    regex::{cp::Regex, Match},
};

let config = MatchConfig::builder()
    .pinyin(PinyinMatchConfig::default())
    .romaji(RomajiMatchConfig::default())
    .build();

let re = Regex::builder()
    .ib(config.shallow_clone())
    .build("raki.suta")
    .unwrap();
assert_eq!(re.find("「らき☆すた」"), Some(Match::must(0, 3..18)));

let re = Regex::builder()
    .ib(config.shallow_clone())
    .build("pysou.*?(any|every)thing")
    .unwrap();
assert_eq!(re.find("拼音搜索Everything"), Some(Match::must(0, 0..22)));

let config = MatchConfig::builder()
    .pinyin(PinyinMatchConfig::default())
    .romaji(RomajiMatchConfig::default())
    .mix_lang(true)
    .build();
let re = Regex::builder()
    .ib(config.shallow_clone())
    .build("(?x)^zangsounofuri-?ren # Mixing pinyin and romaji")
    .unwrap();
assert_eq!(re.find("葬送のフリーレン"), Some(Match::must(0, 0..24)));

For more examples and the syntax, see crate::regex.

§Case insensitivity

To enable case insensitivity:

use ib_matcher::{matcher::{PinyinMatchConfig, PlainMatchConfig, MatchConfig}, regex::cp::Regex};

let re = Regex::builder().ib(MatchConfig::default()).build("foo").unwrap();
assert!(re.is_match("FOO"));

// Alternatively, with `case_insensitive()`:
let re = Regex::builder()
    .ib(MatchConfig::builder()
        .case_insensitive(true)
        .pinyin(PinyinMatchConfig::default())
        .build())
    .build("pyss")
    .unwrap();
assert!(re.is_match("PY搜索"));

Note that enabling syntax.case_insensitive will make ib (i.e. pinyin and romaji match) doesn’t work at the moment. You should only set MatchConfigBuilder::case_insensitive (PlainMatchConfigBuilder::case_insensitive).

If you need case insensitive character classes, you need to write (?i:[a-z]) instead at the moment.

§Custom matching callbacks

Custom matching callbacks can be used to implement ad hoc look-around, backreferences, balancing groups/recursion/subroutines, combining domain-specific parsers, etc.

Basic usage:

// cargo add ib-matcher --features regex,regex-callback
use ib_matcher::regex::cp::Regex;

let re = Regex::builder()
    .callback("ascii", |input, at, push| {
        let haystack = &input.haystack()[at..];
        if haystack.len() > 0 && haystack[0].is_ascii() {
            push(1);
        }
    })
    .build(r"(ascii)+\d(ascii)+")
    .unwrap();
let hay = "that4Ｕ this4me";
assert_eq!(&hay[re.find(hay).unwrap().span()], " this4me");

§Look-around

use ib_matcher::regex::cp::Regex;

let re = Regex::builder()
    .callback("lookahead_is_ascii", |input, at, push| {
        let haystack = &input.haystack()[at..];
        if haystack.len() > 0 && haystack[0].is_ascii() {
            push(0);
        }
    })
    .build(r"[\x00-\x7f]+?\d(lookahead_is_ascii)")
    .unwrap();
let hay = "that4Ｕ,this4me1plz";
assert_eq!(
    re.find_iter(hay).map(|m| &hay[m.span()]).collect::<Vec<_>>(),
    vec![",this4", "me1"]
);

§Balancing groups

use std::{cell::RefCell, rc::Rc};
use ib_matcher::regex::cp::Regex;

let count = Rc::new(RefCell::new(0));
let re = Regex::builder()
    .callback("open_quote", {
        let count = count.clone();
        move |input, at, push| {
            if at < 2 || input.haystack()[at - 2] != b'\\' {
                let mut count = count.borrow_mut();
                *count += 1;
                push(0);
            }
        }
    })
    .callback("close_quote", move |input, at, push| {
        if at < 2 || input.haystack()[at - 2] != b'\\' {
            let mut count = count.borrow_mut();
            if *count > 0 {
                push(0);
            }
            *count -= 1;
        }
    })
    .build(r"'(open_quote).*?'(close_quote)")
    .unwrap();
let hay = r"'one' 'two\'three' 'four'";
assert_eq!(
    re.find_iter(hay).map(|m| &hay[m.span()]).collect::<Vec<_>>(),
    vec!["'one'", r"'two\'three'", "'four'"]
);

(In this simple example, just using '([^'\\]+?|\\')*' is actually enough, but there are more complex cases where balancing groups (or recursion/subroutines) are necessary.)

§Synchronization and cloning

In order to make the Regex API convenient, most of the routines hide the fact that a Cache is needed at all. To achieve this, a memory pool is used internally to retrieve Cache values in a thread safe way that also permits reuse. This in turn implies that every such search call requires some form of synchronization. Usually this synchronization is fast enough to not notice, but in some cases, it can be a bottleneck. This typically occurs when all of the following are true:

The same Regex is shared across multiple threads simultaneously, usually via a util::lazy::Lazy or something similar from the once_cell or lazy_static crates.
The primary unit of work in each thread is a regex search.
Searches are run on very short haystacks.

This particular case can lead to high contention on the pool used by a Regex internally, which can in turn increase latency to a noticeable effect. This cost can be mitigated in one of the following ways:

Use a distinct copy of a Regex in each thread, usually by cloning it. Cloning a Regex does not do a deep copy of its read-only component. But it does lead to each Regex having its own memory pool, which in turn eliminates the problem of contention. In general, this technique should not result in any additional memory usage when compared to sharing the same Regex across multiple threads simultaneously.
Use lower level APIs, like [Regex::try_find], which permit passing a Cache explicitly. In this case, it is up to you to determine how best to provide a Cache. For example, you might put a Cache in thread-local storage if your use case allows for it.

Overall, this is an issue that happens rarely in practice, but it can happen.

§Warning: spin-locks may be used in alloc-only mode

When this crate is built without the std feature and the high level APIs on a Regex are used, then a spin-lock will be used to synchronize access to an internal pool of Cache values. This may be undesirable because a spin-lock is effectively impossible to implement correctly in user space. That is, more concretely, the spin-lock could result in a deadlock.

If one wants to avoid the use of spin-locks when the std feature is disabled, then you must use APIs that accept a Cache value explicitly. For example, [Regex::try_find].

Struct Regex Copy item path

§Overview

§Example

§Case insensitivity

§Custom matching callbacks

§Look-around

§Balancing groups

§Synchronization and cloning

§Warning: spin-locks may be used in alloc-only mode

Implementations§

impl<'a> Regex<'a>

pub fn new(pattern: &str) -> Result<Self, BuildError>

pub fn config() -> Config

pub fn builder<'f1>() -> Builder<'a, 'f1>

§Example: change the line terminator

impl<'a> Regex<'a>

pub fn is_match<'h, I: Into<Input<'h>>>(&self, input: I) -> bool

§Example

§Example: consistency with search APIs

pub fn find<'h, I: Into<Input<'h>>>(&self, input: I) -> Option<Match>

§Example

pub fn captures<'h, I: Into<Input<'h>>>( &self, input: I, caps: &mut Captures, ) -> Result<(), MatchError>

§Example

pub fn find_iter<'h, I: Into<Input<'h>>>( &'h self, input: I, ) -> impl Iterator<Item = Match> + 'h

§Example

pub fn captures_iter<'h, I: Into<Input<'h>>>( &'h self, input: I, ) -> impl Iterator<Item = Captures> + 'h

§Example

Methods from Deref<Target = BoundedBacktracker>§

pub fn create_cache(&self) -> Cache

pub fn create_captures(&self) -> Captures

pub fn reset_cache(&self, cache: &mut Cache)

§Example

pub fn pattern_len(&self) -> usize

§Example

pub fn get_config(&self) -> &Config

pub fn get_nfa(&self) -> &NFA

pub fn max_haystack_len(&self) -> usize

§Example

pub fn try_is_match<'h, I: Into<Input<'h>>>( &self, cache: &mut Cache, input: I, ) -> Result<bool, MatchError>

§Errors

§Example

§Example: consistency with search APIs

pub fn try_find<'h, I: Into<Input<'h>>>( &self, cache: &mut Cache, input: I, ) -> Result<Option<Match>, MatchError>

§Errors

§Example

pub fn try_captures<'h, I: Into<Input<'h>>>( &self, cache: &mut Cache, input: I, caps: &mut Captures, ) -> Result<(), MatchError>

§Errors

§Example

pub fn try_find_iter<'r, 'c, 'h, I: Into<Input<'h>>>( &'r self, cache: &'c mut Cache, input: I, ) -> TryFindMatches<'r, 'c, 'h> ⓘ

§Example

pub fn try_captures_iter<'r, 'c, 'h, I: Into<Input<'h>>>( &'r self, cache: &'c mut Cache, input: I, ) -> TryCapturesMatches<'r, 'c, 'h> ⓘ

§Example

pub fn try_search( &self, cache: &mut Cache, input: &Input<'_>, caps: &mut Captures, ) -> Result<(), MatchError>

§Errors

§Example: specific pattern search

§Example: specifying the bounds of a search

pub fn try_search_slots( &self, cache: &mut Cache, input: &Input<'_>, slots: &mut [Option<NonMaxUsize>], ) -> Result<Option<PatternID>, MatchError>

§Errors

§Example

Trait Implementations§

impl Clone for Regex<'_>

fn clone(&self) -> Self

fn clone_from(&mut self, source: &Self)

impl Deref for Regex<'_>

type Target = BoundedBacktracker

fn deref(&self) -> &Self::Target

Auto Trait Implementations§

impl<'a> Freeze for Regex<'a>

impl<'a> !RefUnwindSafe for Regex<'a>

impl<'a> !Send for Regex<'a>

impl<'a> !Sync for Regex<'a>

impl<'a> Unpin for Regex<'a>

impl<'a> !UnwindSafe for Regex<'a>

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

Struct Regex

impl<T> Any for T
where T: 'static + ?Sized,

impl<T> Borrow<T> for T
where T: ?Sized,

impl<T> BorrowMut<T> for T
where T: ?Sized,

impl<T> CloneToUninit for T
where T: Clone,

impl<T, U> Into<U> for T
where U: From<T>,

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

impl<P, T> Receiver for P
where P: Deref<Target = T> + ?Sized, T: ?Sized,

impl<T> ToOwned for T
where T: Clone,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,