Reggy
A friendly regular expression dialect for text analytics. Typical regex features are removed/adjusted to make natural language queries easier. Unicode-aware and able to search a stream with several patterns at once.
cargo add reggy
API Usage
Use the high-level Pattern struct for simple search.
let mut p = new.unwrap;
assert_eq!;
Use the Ast struct to transpile to normal regex syntax.[^1]
let ast = parse.unwrap;
assert_eq!;
Search a Stream
Use the Search struct to search a stream with several patterns at once.
let mut search = compile.unwrap;
Call Search::next to begin searching. It will yield any matches deemed definitely-complete immediately.
let jane_match = new;
assert_eq!;
Call Search::next again to continue with the same search state.
Note that "John Doe" matched across the next boundary, and spans are relative to the start of the stream.
let john_match = new;
let money_match_1 = new;
let money_match_2 = new;
assert_eq!;
Call Search::finish to collect any not-definitely-complete matches once the stream is closed.
assert_eq!;
See more in the API docs.
Pattern Language
Reggy is case-insensitive by default. Spaces match any amount of whitespace (i.e. \s+). All the reserved characters mentioned below (\, (, ), ?, |, #, and !) may be escaped with a backslash for a literal match. Patterns are surrounded by implicit unicode word boundaries (i.e. \b). Empty patterns or subpatterns are not permitted.
Examples
Make a character optional with ?
dogs? matches dog and dogs
Create two or more alternatives with |
dog|cat matches dog and cat
Create a sub-pattern with (...)
the qualit(y|ies) required matches the quality required and the qualities required
the only( one)? around matches the only around and the only one around
Create a case-sensitive sub-pattern with (!...)
United States of America|(!USA) matches USA, not usa
Match digits with #
#.## matches 3.14
Definitely-Complete Matches
Reggy follows greedy matching semantics. A pattern may match after one step of a stream, yet may match a longer form depending on the next step. For example, ab|abb will match s.next("ab"), but a subsequent call to s.next("b") would create a longer match, "abb", which should supercede the match "ab".
Search only yields matches once they are definitely complete and cannot be superceded by future next calls. Each pattern has a maximum byte length L (this is why unbound quantifiers are absent from reggy). Once reggy has streamed at most L bytes (counting contiguous whitespace as 1 byte), past the start of a match without superceding it, that match will be yielded. Matches may be yielded earlier if the DFA reaches a dead state.
As a consequence, the Matches returned by a given Search are the same regardless of how a given haystack is segmented. Search::next returns Matches as soon as it practically can while respecting this invariant.
Implementation
The pattern language is parsed with lalrpop (grammar).
The search routines use a regex_automata::dense::DFA. Compared to other regex engines, the dense DFA is memory-intensive and slow to construct, but searches are fast. Unicode word boundaries are handled by the unicode_segmentation crate.
[^1]: The resulting patterns are equivalent, except that reggy parses any continuous substring of spaces in the pattern as \s+, which is transpiled as , and surrounds patterns with implicit word boundaries, which are not transpiled.