Regex

Struct Regex 

Source
pub struct Regex<'a> { /* private fields */ }
Available on crate features regex-automata and regex-lita only.
Expand description

A compiled regular expression for searching Unicode haystacks.

A Regex can be used to search haystacks, split haystacks into substrings or replace substrings in a haystack with a different substring. All searching is done with an implicit (?s:.)*? at the beginning and end of an pattern. To force an expression to match the whole string (or a prefix or a suffix), you can use anchored search or an anchor like ^ or $ (or \A and \z).

§Overview

The most important methods are as follows:

  • Regex::new compiles a regex using the default configuration. A Builder permits setting a non-default configuration. (For example, case insensitive matching, verbose mode and others.)
  • Regex::is_match reports whether a match exists in a particular haystack.
  • Regex::find reports the byte offsets of a match in a haystack, if one exists. [Regex::find_iter] returns an iterator over all such matches.
  • Regex::captures returns a Captures, which reports both the byte offsets of a match in a haystack and the byte offsets of each matching capture group from the regex in the haystack. [Regex::captures_iter] returns an iterator over all such matches.

§Example

use ib_matcher::regex::lita::Regex;

let re = Regex::new(r"^[0-9]{4}-[0-9]{2}-[0-9]{2}$")?;
assert!(re.is_match("2010-03-14"));

With IbMatcher’s Chinese pinyin and Japanese romaji matching:

// cargo add ib-matcher --features regex,pinyin,romaji
use ib_matcher::{
    matcher::{MatchConfig, PinyinMatchConfig, RomajiMatchConfig},
    regex::{lita::Regex, Match},
};

let config = MatchConfig::builder()
    .pinyin(PinyinMatchConfig::default())
    .romaji(RomajiMatchConfig::default())
    .build();

let re = Regex::builder()
    .ib(config.shallow_clone())
    .build("raki.suta")
    .unwrap();
assert_eq!(re.find("「らき☆すた」"), Some(Match::must(0, 3..18)));

let re = Regex::builder()
    .ib(config.shallow_clone())
    .build("pysou.*?(any|every)thing")
    .unwrap();
assert_eq!(re.find("拼音搜索Everything"), Some(Match::must(0, 0..22)));

let config = MatchConfig::builder()
    .pinyin(PinyinMatchConfig::default())
    .romaji(RomajiMatchConfig::default())
    .mix_lang(true)
    .build();
let re = Regex::builder()
    .ib(config.shallow_clone())
    .build("(?x)^zangsounofuri-?ren # Mixing pinyin and romaji")
    .unwrap();
assert_eq!(re.find("葬送のフリーレン"), Some(Match::must(0, 0..24)));

For more examples and the syntax, see crate::regex.

§Case insensitivity

To enable case insensitivity:

use ib_matcher::{matcher::{PinyinMatchConfig, PlainMatchConfig, MatchConfig}, regex::lita::Regex};

let re = Regex::builder().ib(MatchConfig::default()).build("foo").unwrap();
assert!(re.is_match("FOO"));

// Alternatively, with `case_insensitive()`:
let re = Regex::builder()
    .ib(MatchConfig::builder()
        .case_insensitive(true)
        .pinyin(PinyinMatchConfig::default())
        .build())
    .build("pyss")
    .unwrap();
assert!(re.is_match("PY搜索"));

Note that enabling syntax.case_insensitive will make ib (i.e. pinyin and romaji match) doesn’t work at the moment. You should only set MatchConfigBuilder::case_insensitive (PlainMatchConfigBuilder::case_insensitive).

If you need case insensitive character classes, you need to write (?i:[a-z]) instead at the moment.

§Synchronization and cloning

In order to make the Regex API convenient, most of the routines hide the fact that a Cache is needed at all. To achieve this, a memory pool is used internally to retrieve Cache values in a thread safe way that also permits reuse. This in turn implies that every such search call requires some form of synchronization. Usually this synchronization is fast enough to not notice, but in some cases, it can be a bottleneck. This typically occurs when all of the following are true:

  • The same Regex is shared across multiple threads simultaneously, usually via a util::lazy::Lazy or something similar from the once_cell or lazy_static crates.
  • The primary unit of work in each thread is a regex search.
  • Searches are run on very short haystacks.

This particular case can lead to high contention on the pool used by a Regex internally, which can in turn increase latency to a noticeable effect. This cost can be mitigated in one of the following ways:

  • Use a distinct copy of a Regex in each thread, usually by cloning it. Cloning a Regex does not do a deep copy of its read-only component. But it does lead to each Regex having its own memory pool, which in turn eliminates the problem of contention. In general, this technique should not result in any additional memory usage when compared to sharing the same Regex across multiple threads simultaneously.
  • Use lower level APIs, like [Regex::try_find], which permit passing a Cache explicitly. In this case, it is up to you to determine how best to provide a Cache. For example, you might put a Cache in thread-local storage if your use case allows for it.

Overall, this is an issue that happens rarely in practice, but it can happen.

§Warning: spin-locks may be used in alloc-only mode

When this crate is built without the std feature and the high level APIs on a Regex are used, then a spin-lock will be used to synchronize access to an internal pool of Cache values. This may be undesirable because a spin-lock is effectively impossible to implement correctly in user space. That is, more concretely, the spin-lock could result in a deadlock.

If one wants to avoid the use of spin-locks when the std feature is disabled, then you must use APIs that accept a Cache value explicitly. For example, [Regex::try_find].

Implementations§

Source§

impl<'a> Regex<'a>

Source

pub fn new(pattern: &str) -> Result<Self, BuildError>

Source

pub fn config() -> Config

Source

pub fn create_captures(&self) -> Captures

Create a new empty set of capturing groups that is guaranteed to be valid for the search APIs on this BoundedBacktracker.

A Captures value created for a specific BoundedBacktracker cannot be used with any other BoundedBacktracker.

This is a convenience function for Captures::all. See the Captures documentation for an explanation of its alternative constructors that permit the BoundedBacktracker to do less work during a search, and thus might make it faster.

Source

pub fn builder<'f1>() -> Builder<'a, 'f1>

Return a builder for configuring the construction of a Regex.

This is a convenience routine to avoid needing to import the Builder type in common cases.

§Example: change the line terminator

This example shows how to enable multi-line mode by default and change the line terminator to the NUL byte:

use ib_matcher::regex::{lita::Regex, util::{syntax, look::LookMatcher}, Match};

let mut lookm = LookMatcher::new();
lookm.set_line_terminator(b'\x00');
let re = Regex::builder()
    .syntax(syntax::Config::new().multi_line(true))
    .thompson(Regex::config().look_matcher(lookm))
    .build(r"^foo$")?;
let hay = "\x00foo\x00";
assert_eq!(Some(Match::must(0, 1..4)), re.find(hay));
Source§

impl<'a> Regex<'a>

High level convenience routines for using a regex to search a haystack.

Source

pub fn is_match<'h, I: Into<Input<'h>>>(&self, input: I) -> bool

Returns true if and only if this regex matches the given haystack.

This routine may short circuit if it knows that scanning future input will never lead to a different result. (Consider how this might make a difference given the regex a+ on the haystack aaaaaaaaaaaaaaa. This routine may stop after it sees the first a, but routines like find need to continue searching because + is greedy by default.)

§Example
use ib_matcher::regex::lita::Regex;

let re = Regex::new("foo[0-9]+bar")?;

assert!(re.is_match("foo12345bar"));
assert!(!re.is_match("foobar"));
§Example: consistency with search APIs

is_match is guaranteed to return true whenever find returns a match. This includes searches that are executed entirely within a codepoint:

use ib_matcher::regex::{lita::Regex, Input};

let re = Regex::new("a*")?;

// This doesn't match because the default configuration bans empty
// matches from splitting a codepoint.
assert!(!re.is_match(Input::new("☃").span(1..2)));
assert_eq!(None, re.find(Input::new("☃").span(1..2)));

Notice that when UTF-8 mode is disabled, then the above reports a match because the restriction against zero-width matches that split a codepoint has been lifted:

use ib_matcher::regex::{lita::Regex, Input, Match};

let re = Regex::builder()
    .thompson(Regex::config().utf8(false))
    .build("a*")?;

assert!(re.is_match(Input::new("☃").span(1..2)));
assert_eq!(
    Some(Match::must(0, 1..1)),
    re.find(Input::new("☃").span(1..2)),
);

A similar idea applies when using line anchors with CRLF mode enabled, which prevents them from matching between a \r and a \n.

use ib_matcher::regex::{lita::Regex, Input, Match};

let re = Regex::new(r"(?Rm:$)")?;
assert!(!re.is_match(Input::new("\r\n").span(1..1)));
// A regular line anchor, which only considers \n as a
// line terminator, will match.
let re = Regex::new(r"(?m:$)")?;
assert!(re.is_match(Input::new("\r\n").span(1..1)));
Source

pub fn find<'h, I: Into<Input<'h>>>(&self, input: I) -> Option<Match>

Executes a leftmost search and returns the first match that is found, if one exists.

§Example
use ib_matcher::regex::{lita::Regex, Match};

let re = Regex::new("foo[0-9]+")?;
assert_eq!(Some(Match::must(0, 0..8)), re.find("foo12345"));
Source

pub fn captures<'h, I: Into<Input<'h>>>( &self, input: I, caps: &mut Captures, ) -> Result<(), MatchError>

Executes a leftmost forward search and writes the spans of capturing groups that participated in a match into the provided Captures value. If no match was found, then Captures::is_match is guaranteed to return false.

§Example
use ib_matcher::regex::{lita::Regex, Span};

let re = Regex::new(r"^([0-9]{4})-([0-9]{2})-([0-9]{2})$")?;
let mut caps = re.create_captures();

re.captures("2010-03-14", &mut caps);
assert!(caps.is_match());
assert_eq!(Some(Span::from(0..4)), caps.get_group(1));
assert_eq!(Some(Span::from(5..7)), caps.get_group(2));
assert_eq!(Some(Span::from(8..10)), caps.get_group(3));

Trait Implementations§

Source§

impl<'a> Clone for Regex<'a>

Source§

fn clone(&self) -> Regex<'a>

Returns a duplicate of the value. Read more
1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more

Auto Trait Implementations§

§

impl<'a> Freeze for Regex<'a>

§

impl<'a> !RefUnwindSafe for Regex<'a>

§

impl<'a> !Send for Regex<'a>

§

impl<'a> !Sync for Regex<'a>

§

impl<'a> Unpin for Regex<'a>

§

impl<'a> !UnwindSafe for Regex<'a>

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T> Same for T

Source§

type Output = T

Should always be Self
Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.