Crate regex_test
source ·Expand description
A crate for defining tests in a TOML format and applying them to regex engine implementations.
Generally speaking, if you aren’t writing your own regex engine and looking to test it, then this crate is probably not for you. Moreover, this crate does not come with any actual tests. It merely defines the test format and provides some convenient routines for executing tests within the context of Rust unit tests.
§Format
The entire test corpus is derived from zero or more TOML files. Each TOML
file contains zero or more tests, where each test is defined as a table via
[[test]]
.
Each test has the following fields:
name
- A name for the test. It must be unique within its file. A test’sRegexTest::full_name
is derived either via{group_name}/{name}
or{group_name}/{name}/{additional_name}
, with the latter being used only whenTestRunner::expand
is used. Thegroup_name
is derived from the file stem (the file name without the `.toml suffix).regex
- The regex to test. This is either a string or a (possibly empty) list of regex patterns. When using a list, the underlying regex engine is expected to support multiple patterns where each are identified starting from0
and incrementing by 1 for each subsequent pattern.haystack
- The text to search.bounds
- An optional field whose value is a table withstart
andend
fields, whose values must be valid for the givenhaystack
. When set, the search will only execute within these bounds. When absent, the bounds correspond tostart = 0
andend = haystack.len()
.matches
- Zero or more match values. Each match value can be in one of four formats:- A simple span, i.e.,
[5, 12]
, corresponding to the start and end of the match, in byte offsets. The start is inclusive and the end is exclusive. The pattern ID for the match is assumed to be0
. - A table corresponding to the matching pattern ID and the span of the
match. For example,
{ id = 5, span = [20, 21] }
. - A list of capture group spans, with the first corresponding to the
overall match and the pattern ID assumed to be
0
. For example,[[5, 10], [6, 8], [], [9, 10]]
, where[]
corresponds to a group present in the regex but one that did not participate in a match. - A table corresponding to the matching pattern ID and a list of spans
corresponding to the capture groups. For example,
{ id = 5, spans = [[5, 10], [6, 8], [], [9, 10]] }
. This is the most general, but also most verbose, syntax.
- A simple span, i.e.,
match-limit
- An optional field that specifies a limit on the number of matches. When absent, no limit is enforced and all matches should be reported by the regex engine. This can be useful, for example, when one only cares about the first match.compiles
- An optional field indicating whether the regex is expected to compile. It defaults totrue
when absent. Whentrue
, if the regex does not compile, then the test fails. Conversely, whenfalse
, if the regex does compile, then the test fails.anchored
- Whether to execute an anchored search or not. Note that this is not the same as adding a^
to the beginning of your regex pattern.^
always requires the regex to match at position0
, but an anchored search simply requires that the regex match at the starting position of the search. (The starting position of the search can be configured via the optionalbounds
field.)case-insensitive
- Whether to match the regex case insensitively. This is disabled by default. There is no real difference between using this field and adding a(?i)
to the beginning of your regex. (Some regex engines may not support(?i)
.)unescape
- When enabled, the haystack is unescaped. Sequences like\x00
are turned into their corresponding byte values. This permits one to write haystacks that contain invalid UTF-8 without embedding actual invalid UTF-8 into a TOML file (which is not allowed). There is generally no other reason to enableunescape
.unicode
- When enabled, the regex pattern should be compiled with its corresponding Unicode mode enabled. For example,[^a]
matches any UTF-8 encoding of any codepoint other thana
. Case insensitivty should be Unicode aware. Unicode classes like\pL
are available. The Perl classes\w
,\s
and\d
should be Unicode aware. And so on. This is an optional field and is enabled by default.utf8
- When this is enabled, all regex match substrings should be entirely valid UTF-8. While parts of the haystack the regex searches through may not be valid UTF-8, only the portions that are valid UTF-8 may be reported in match spans. Importantly, this includes zero-width matches. Zero-width matches must never split the UTF-8 encoding of a single codepoint when this is enabled. This is an optional field and is enabled by default.line-terminator
- This sets the line terminator used by the multi-line assertions(?m:^)
and(?m:$)
. It defaults to\n
. It must be exactly one byte. This field is automatically unescaped in order to permit a non-ASCII byte.match-kind
- May be one ofall
,leftmost-first
orleftmost-longest
. SeeMatchKind
for more details. This is an optional field and defaults toleftmost-first
.search-kind
- May be one ofearliest
,leftmost
oroverlapping
. SeeSearchKind
for more details. This is an optional field and defaults toleftmost
.
Re-exports§
pub extern crate anyhow;
Structs§
- Captures represents a single group of captured matches from a regex search.
- The result of compiling a regex.
- A single match, consisting of the pattern that matched and its span.
- A regex test describes the inputs and expected outputs of a regex match.
- A collection of regex tests.
- An iterator over regex tests.
- A span of contiguous bytes, from start to end, represented via byte offsets.
- The result of executing a regex search.
- A runner for executing regex tests.
Enums§
- The match semantics to use for a search.
- Represents the type of search to perform.