Struct regex_automata::Input

source ·
pub struct Input<'h> { /* private fields */ }
Expand description

The parameters for a regex search including the haystack to search.

It turns out that regex searches have a few parameters, and in most cases, those parameters have defaults that work in the vast majority of cases. This Input type exists to make that common case seamnless while also providing an avenue for changing the parameters of a search. In particular, this type enables doing so without a combinatorial explosion of different methods and/or superfluous parameters in the common cases.

An Input permits configuring the following things:

  • Search only a substring of a haystack, while taking the broader context into account for resolving look-around assertions.
  • Indicating whether to search for all patterns in a regex, or to only search for one pattern in particular.
  • Whether to perform an anchored on unanchored search.
  • Whether to report a match as early as possible.

All of these parameters, except for the haystack, have sensible default values. This means that the minimal search configuration is simply a call to Input::new with your haystack. Setting any other parameter is optional.

Moreover, for any H that implements AsRef<[u8]>, there exists a From<H> for Input implementation. This is useful because many of the search APIs in this crate accept an Into<Input>. This means you can provide string or byte strings to these routines directly, and they’ll automatically get converted into an Input for you.

The lifetime parameter 'h refers to the lifetime of the haystack.

Organization

The API of Input is split into a few different parts:

A Input exposes so much because it is meant to be used by both callers of regex engines and implementors of regex engines. A constraining factor is that regex engines should accept a &Input as its lowest level API, which means that implementors should only use the “getter” APIs of a Input.

Valid bounds and search termination

An Input permits setting the bounds of a search via either Input::span or Input::range. The bounds set must be valid, or else a panic will occur. Bounds are valid if and only if:

  • The bounds represent a valid range into the input’s haystack.
  • or the end bound is a valid ending bound for the haystack and the start bound is exactly one greater than the start bound.

In the latter case, Input::is_done will return true and indicates any search receiving such an input should immediately return with no match.

Note that while Input is used for reverse searches in this crate, the Input::is_done predicate assumes a forward search. Because unsigned offsets are used internally, there is no way to tell from only the offsets whether a reverse search is done or not.

Regex engine support

Any regex engine accepting an Input must support at least the following things:

  • Searching a &[u8] for matches.
  • Searching a substring of &[u8] for a match, such that any match reported must appear entirely within that substring.
  • For a forwards search, a match should never be reported when Input::is_done returns true. (For reverse searches, termination should be handled outside of Input.)

Supporting other aspects of an Input are optional, but regex engines should handle aspects they don’t support gracefully. How this is done is generally up to the regex engine. This crate generally treats unsupported anchored modes as an error to report for example, but for simplicity, in the meta regex engine, trying to search with an invalid pattern ID just results in no match being reported.

Implementations§

source§

impl<'h> Input<'h>

source

pub fn new<H: ?Sized + AsRef<[u8]>>(haystack: &'h H) -> Input<'h>

Create a new search configuration for the given haystack.

source

pub fn span<S: Into<Span>>(self, span: S) -> Input<'h>

Set the span for this search.

This routine does not panic if the span given is not a valid range for this search’s haystack. If this search is run with an invalid range, then the most likely outcome is that the actual search execution will panic.

This routine is generic over how a span is provided. While a Span may be given directly, one may also provide a std::ops::Range<usize>. To provide anything supported by range syntax, use the Input::range method.

The default span is the entire haystack.

Note that Input::range overrides this method and vice versa.

Panics

This panics if the given span does not correspond to valid bounds in the haystack or the termination of a search.

Example

This example shows how the span of the search can impact whether a match is reported or not. This is particularly relevant for look-around operators, which might take things outside of the span into account when determining whether they match.

use regex_automata::{
    nfa::thompson::pikevm::PikeVM,
    Match, Input,
};

// Look for 'at', but as a distinct word.
let re = PikeVM::new(r"\bat\b")?;
let mut cache = re.create_cache();
let mut caps = re.create_captures();

// Our haystack contains 'at', but not as a distinct word.
let haystack = "batter";

// A standard search finds nothing, as expected.
let input = Input::new(haystack);
re.search(&mut cache, &input, &mut caps);
assert_eq!(None, caps.get_match());

// But if we wanted to search starting at position '1', we might
// slice the haystack. If we do this, it's impossible for the \b
// anchors to take the surrounding context into account! And thus,
// a match is produced.
let input = Input::new(&haystack[1..3]);
re.search(&mut cache, &input, &mut caps);
assert_eq!(Some(Match::must(0, 0..2)), caps.get_match());

// But if we specify the span of the search instead of slicing the
// haystack, then the regex engine can "see" outside of the span
// and resolve the anchors correctly.
let input = Input::new(haystack).span(1..3);
re.search(&mut cache, &input, &mut caps);
assert_eq!(None, caps.get_match());

This may seem a little ham-fisted, but this scenario tends to come up if some other regex engine found the match span and now you need to re-process that span to look for capturing groups. (e.g., Run a faster DFA first, find a match, then run the PikeVM on just the match span to resolve capturing groups.) In order to implement that sort of logic correctly, you need to set the span on the search instead of slicing the haystack directly.

The other advantage of using this routine to specify the bounds of the search is that the match offsets are still reported in terms of the original haystack. For example, the second search in the example above reported a match at position 0, even though at starts at offset 1 because we sliced the haystack.

source

pub fn range<R: RangeBounds<usize>>(self, range: R) -> Input<'h>

Like Input::span, but accepts any range instead.

This routine does not panic if the range given is not a valid range for this search’s haystack. If this search is run with an invalid range, then the most likely outcome is that the actual search execution will panic.

The default range is the entire haystack.

Note that Input::span overrides this method and vice versa.

Panics

This routine will panic if the given range could not be converted to a valid Range. For example, this would panic when given 0..=usize::MAX since it cannot be represented using a half-open interval in terms of usize.

This also panics if the given range does not correspond to valid bounds in the haystack or the termination of a search.

Example
use regex_automata::Input;

let input = Input::new("foobar");
assert_eq!(0..6, input.get_range());

let input = Input::new("foobar").range(2..=4);
assert_eq!(2..5, input.get_range());
source

pub fn anchored(self, mode: Anchored) -> Input<'h>

Sets the anchor mode of a search.

When a search is anchored (so that’s Anchored::Yes or Anchored::Pattern), a match must begin at the start of a search. When a search is not anchored (that’s Anchored::No), regex engines will behave as if the pattern started with a (?s-u:.)*?. This prefix permits a match to appear anywhere.

By default, the anchored mode is Anchored::No.

WARNING: this is subtly different than using a ^ at the start of your regex. A ^ forces a regex to match exclusively at the start of a haystack, regardless of where you begin your search. In contrast, anchoring a search will allow your regex to match anywhere in your haystack, but the match must start at the beginning of a search.

For example, consider the haystack aba and the following searches:

  1. The regex ^a is compiled with Anchored::No and searches aba starting at position 2. Since ^ requires the match to start at the beginning of the haystack and 2 > 0, no match is found.
  2. The regex a is compiled with Anchored::Yes and searches aba starting at position 2. This reports a match at [2, 3] since the match starts where the search started. Since there is no ^, there is no requirement for the match to start at the beginning of the haystack.
  3. The regex a is compiled with Anchored::Yes and searches aba starting at position 1. Since b corresponds to position 1 and since the search is anchored, it finds no match. While the regex matches at other positions, configuring the search to be anchored requires that it only report a match that begins at the same offset as the beginning of the search.
  4. The regex a is compiled with Anchored::No and searches aba starting at position 1. Since the search is not anchored and the regex does not start with ^, the search executes as if there is a (?s:.)*? prefix that permits it to match anywhere. Thus, it reports a match at [2, 3].

Note that the Anchored::Pattern mode is like Anchored::Yes, except it only reports matches for a particular pattern.

Example

This demonstrates the differences between an anchored search and a pattern that begins with ^ (as described in the above warning message).

use regex_automata::{
    nfa::thompson::pikevm::PikeVM,
    Anchored, Match, Input,
};

let haystack = "aba";

let re = PikeVM::new(r"^a")?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
let input = Input::new(haystack).span(2..3).anchored(Anchored::No);
re.search(&mut cache, &input, &mut caps);
// No match is found because 2 is not the beginning of the haystack,
// which is what ^ requires.
assert_eq!(None, caps.get_match());

let re = PikeVM::new(r"a")?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
let input = Input::new(haystack).span(2..3).anchored(Anchored::Yes);
re.search(&mut cache, &input, &mut caps);
// An anchored search can still match anywhere in the haystack, it just
// must begin at the start of the search which is '2' in this case.
assert_eq!(Some(Match::must(0, 2..3)), caps.get_match());

let re = PikeVM::new(r"a")?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
let input = Input::new(haystack).span(1..3).anchored(Anchored::Yes);
re.search(&mut cache, &input, &mut caps);
// No match is found since we start searching at offset 1 which
// corresponds to 'b'. Since there is no '(?s:.)*?' prefix, no match
// is found.
assert_eq!(None, caps.get_match());

let re = PikeVM::new(r"a")?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
let input = Input::new(haystack).span(1..3).anchored(Anchored::No);
re.search(&mut cache, &input, &mut caps);
// Since anchored=no, an implicit '(?s:.)*?' prefix was added to the
// pattern. Even though the search starts at 'b', the 'match anything'
// prefix allows the search to match 'a'.
let expected = Some(Match::must(0, 2..3));
assert_eq!(expected, caps.get_match());
source

pub fn earliest(self, yes: bool) -> Input<'h>

Whether to execute an “earliest” search or not.

When running a non-overlapping search, an “earliest” search will return the match location as early as possible. For example, given a pattern of foo[0-9]+ and a haystack of foo12345, a normal leftmost search will return foo12345 as a match. But an “earliest” search for regex engines that support “earliest” semantics will return foo1 as a match, since as soon as the first digit following foo is seen, it is known to have found a match.

Note that “earliest” semantics generally depend on the regex engine. Different regex engines may determine there is a match at different points. So there is no guarantee that “earliest” matches will always return the same offsets for all regex engines. The “earliest” notion is really about when the particular regex engine determines there is a match rather than a consistent semantic unto itself. This is often useful for implementing “did a match occur or not” predicates, but sometimes the offset is useful as well.

This is disabled by default.

Example

This example shows the difference between “earliest” searching and normal searching.

use regex_automata::{nfa::thompson::pikevm::PikeVM, Match, Input};

let re = PikeVM::new(r"foo[0-9]+")?;
let mut cache = re.create_cache();
let mut caps = re.create_captures();

// A normal search implements greediness like you expect.
let input = Input::new("foo12345");
re.search(&mut cache, &input, &mut caps);
assert_eq!(Some(Match::must(0, 0..8)), caps.get_match());

// When 'earliest' is enabled and the regex engine supports
// it, the search will bail once it knows a match has been
// found.
let input = Input::new("foo12345").earliest(true);
re.search(&mut cache, &input, &mut caps);
assert_eq!(Some(Match::must(0, 0..4)), caps.get_match());
source

pub fn set_span<S: Into<Span>>(&mut self, span: S)

Set the span for this search configuration.

This is like the Input::span method, except this mutates the span in place.

This routine is generic over how a span is provided. While a Span may be given directly, one may also provide a std::ops::Range<usize>.

Panics

This panics if the given span does not correspond to valid bounds in the haystack or the termination of a search.

Example
use regex_automata::Input;

let mut input = Input::new("foobar");
assert_eq!(0..6, input.get_range());
input.set_span(2..4);
assert_eq!(2..4, input.get_range());
source

pub fn set_range<R: RangeBounds<usize>>(&mut self, range: R)

Set the span for this search configuration given any range.

This is like the Input::range method, except this mutates the span in place.

This routine does not panic if the range given is not a valid range for this search’s haystack. If this search is run with an invalid range, then the most likely outcome is that the actual search execution will panic.

Panics

This routine will panic if the given range could not be converted to a valid Range. For example, this would panic when given 0..=usize::MAX since it cannot be represented using a half-open interval in terms of usize.

This also panics if the given span does not correspond to valid bounds in the haystack or the termination of a search.

Example
use regex_automata::Input;

let mut input = Input::new("foobar");
assert_eq!(0..6, input.get_range());
input.set_range(2..=4);
assert_eq!(2..5, input.get_range());
source

pub fn set_start(&mut self, start: usize)

Set the starting offset for the span for this search configuration.

This is a convenience routine for only mutating the start of a span without having to set the entire span.

Panics

This panics if the span resulting from the new start position does not correspond to valid bounds in the haystack or the termination of a search.

Example
use regex_automata::Input;

let mut input = Input::new("foobar");
assert_eq!(0..6, input.get_range());
input.set_start(5);
assert_eq!(5..6, input.get_range());
source

pub fn set_end(&mut self, end: usize)

Set the ending offset for the span for this search configuration.

This is a convenience routine for only mutating the end of a span without having to set the entire span.

Panics

This panics if the span resulting from the new end position does not correspond to valid bounds in the haystack or the termination of a search.

Example
use regex_automata::Input;

let mut input = Input::new("foobar");
assert_eq!(0..6, input.get_range());
input.set_end(5);
assert_eq!(0..5, input.get_range());
source

pub fn set_anchored(&mut self, mode: Anchored)

Set the anchor mode of a search.

This is like Input::anchored, except it mutates the search configuration in place.

Example
use regex_automata::{Anchored, Input, PatternID};

let mut input = Input::new("foobar");
assert_eq!(Anchored::No, input.get_anchored());

let pid = PatternID::must(5);
input.set_anchored(Anchored::Pattern(pid));
assert_eq!(Anchored::Pattern(pid), input.get_anchored());
source

pub fn set_earliest(&mut self, yes: bool)

Set whether the search should execute in “earliest” mode or not.

This is like Input::earliest, except it mutates the search configuration in place.

Example
use regex_automata::Input;

let mut input = Input::new("foobar");
assert!(!input.get_earliest());
input.set_earliest(true);
assert!(input.get_earliest());
source

pub fn haystack(&self) -> &[u8]

Return a borrow of the underlying haystack as a slice of bytes.

Example
use regex_automata::Input;

let input = Input::new("foobar");
assert_eq!(b"foobar", input.haystack());
source

pub fn start(&self) -> usize

Return the start position of this search.

This is a convenience routine for search.get_span().start().

When Input::is_done is false, this is guaranteed to return an offset that is less than or equal to Input::end. Otherwise, the offset is one greater than Input::end.

Example
use regex_automata::Input;

let input = Input::new("foobar");
assert_eq!(0, input.start());

let input = Input::new("foobar").span(2..4);
assert_eq!(2, input.start());
source

pub fn end(&self) -> usize

Return the end position of this search.

This is a convenience routine for search.get_span().end().

This is guaranteed to return an offset that is a valid exclusive end bound for this input’s haystack.

Example
use regex_automata::Input;

let input = Input::new("foobar");
assert_eq!(6, input.end());

let input = Input::new("foobar").span(2..4);
assert_eq!(4, input.end());
source

pub fn get_span(&self) -> Span

Return the span for this search configuration.

If one was not explicitly set, then the span corresponds to the entire range of the haystack.

When Input::is_done is false, the span returned is guaranteed to correspond to valid bounds for this input’s haystack.

Example
use regex_automata::{Input, Span};

let input = Input::new("foobar");
assert_eq!(Span { start: 0, end: 6 }, input.get_span());
source

pub fn get_range(&self) -> Range<usize>

Return the span as a range for this search configuration.

If one was not explicitly set, then the span corresponds to the entire range of the haystack.

When Input::is_done is false, the range returned is guaranteed to correspond to valid bounds for this input’s haystack.

Example
use regex_automata::Input;

let input = Input::new("foobar");
assert_eq!(0..6, input.get_range());
source

pub fn get_anchored(&self) -> Anchored

Return the anchored mode for this search configuration.

If no anchored mode was set, then it defaults to Anchored::No.

Example
use regex_automata::{Anchored, Input, PatternID};

let mut input = Input::new("foobar");
assert_eq!(Anchored::No, input.get_anchored());

let pid = PatternID::must(5);
input.set_anchored(Anchored::Pattern(pid));
assert_eq!(Anchored::Pattern(pid), input.get_anchored());
source

pub fn get_earliest(&self) -> bool

Return whether this search should execute in “earliest” mode.

Example
use regex_automata::Input;

let input = Input::new("foobar");
assert!(!input.get_earliest());
source

pub fn is_done(&self) -> bool

Return true if and only if this search can never return any other matches.

This occurs when the start position of this search is greater than the end position of the search.

Example
use regex_automata::Input;

let mut input = Input::new("foobar");
assert!(!input.is_done());
input.set_start(6);
assert!(!input.is_done());
input.set_start(7);
assert!(input.is_done());
source

pub fn is_char_boundary(&self, offset: usize) -> bool

Returns true if and only if the given offset in this search’s haystack falls on a valid UTF-8 encoded codepoint boundary.

If the haystack is not valid UTF-8, then the behavior of this routine is unspecified.

Example

This shows where codepoint boundaries do and don’t exist in valid UTF-8.

use regex_automata::Input;

let input = Input::new("☃");
assert!(input.is_char_boundary(0));
assert!(!input.is_char_boundary(1));
assert!(!input.is_char_boundary(2));
assert!(input.is_char_boundary(3));
assert!(!input.is_char_boundary(4));

Trait Implementations§

source§

impl<'h> Clone for Input<'h>

source§

fn clone(&self) -> Input<'h>

Returns a copy of the value. Read more
1.0.0 · source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
source§

impl<'h> Debug for Input<'h>

source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
source§

impl<'h, H: ?Sized + AsRef<[u8]>> From<&'h H> for Input<'h>

source§

fn from(haystack: &'h H) -> Input<'h>

Converts to this type from the input type.

Auto Trait Implementations§

§

impl<'h> RefUnwindSafe for Input<'h>

§

impl<'h> Send for Input<'h>

§

impl<'h> Sync for Input<'h>

§

impl<'h> Unpin for Input<'h>

§

impl<'h> UnwindSafe for Input<'h>

Blanket Implementations§

source§

impl<T> Any for T
where T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for T
where T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for T
where U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T> ToOwned for T
where T: Clone,

§

type Owned = T

The resulting type after obtaining ownership.
source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.