pidgin
build non-recursive grammars
Pidgin is a Rust port of list_matcher. It can be used to generate simple, non-recursive grammars which allow one to obtain structured matches: parse trees.
Pidgin's grammars are implemented underlyingly as regular expressions with named matching groups. Rust's regular expression engine is very fast but it has a limitation which makes parsing hierarchically structured patterns with repeating components difficult: you cannot use a particular match name more than once in a pattern. This makes a simple grammar such as
foo -> bar baz | baz bar
bar -> '1'
baz -> '2'
problematic. The foo
rule uses both bar
and baz
twice.
Pidgin allows you to work around this restriction by managing group renaming.
Another disadvantage of regular expressions is that the more expressive they are the more difficult they are to read. Pidgin allows you to construct expressive regular expressions programmatically such that the intention of the pattern doesn't become obscured.
Example
use Pidgin;
use Error;
The final println!
in the example above produces a BNF-esque description of the
grammar:
TOP := (?i) \b{monthday}(?-i:\s+){month}(?-i:\s+){year}|{month}(?-i:\s+){monthday}(:?,(?-i:\s+){year})?|{weekday}(:?,(?-i:\s+){month}(?-i:\s+){monthday},(?-i:\s+){year})?|{numeric_date}\b
monthday := (?i) \b[4-9]|30?|1[0-9]?|2[0-9]?\b
month := (?i) \bMa(?:y|r(:?ch)?)|Oct(:?ober)?|Dec(:?ember)?|Feb(:?ruary)?|Nov(:?ember)?|Sep(:?tember)?|A(?:pr(:?il)?|ug(:?ust)?)|J(?:u(?:ly?|ne?)|an(:?uary)?)\b
year := [12][0-9]{3}|[0-9]{2}
weekday := (?i) (?:\b(?:Fr(:?i(:?day)?)?|Mo(:?n(:?day)?)?|We(:?d(:?nesday)?)?|S(?:u(:?n(:?day)?)?|a(:?t(:?urday)?)?)|T(?:u(:?e(:?sday)?)?|h(:?u(:?rsday)?)?))\b)|(?-i:\b[FMR-UW]\b)
numeric_date := (?i) \b{year}(?-i:\s*)(?:\-(?-i:\s*){numeric_months}(?-i:\s*)\-|/(?-i:\s*){numeric_months}(?-i:\s*)/)(?-i:\s*){numeric_days}|{numeric_days}(?-i:\s*)(?:\-(?-i:\s*){numeric_months}(?-i:\s*)\-|/(?-i:\s*){numeric_months}(?-i:\s*)/)(?-i:\s*){year}|{numeric_months}(?-i:\s*)(?:\-(?-i:\s*){numeric_days}(?-i:\s*)\-|/(?-i:\s*){numeric_days}(?-i:\s*)/)(?-i:\s*){year}\b
numeric_days := (?i) \b[4-9]|30?|0[1-9]|1[0-9]?|2[0-9]?\b
numeric_months := (?i) \b[2-9]|1[01]?|0[1-9]\b
Limitations
Recursion
Because you must define sub-grammars before you use them, and because Rust's regular expression engine and formalism cannot represent it in any case, you cannot produce a recursive grammar in Pidgin. Something like this is impossible:
XP -> XP conj XP
Unbounded Repetition
Currently you can produce something like this
foo -> bar{1,3}
but not
foo -> bar+
I am planning to introduce Pidgin macros which will allow this, but I am not sure when those will arrive.
Order of Definition
Typically a grammar is defined top down like so:
TOP -> foo | bar
foo -> baz plugh | xyzzy
bar -> xyzzy baz plugh
baz -> '1' | '2'
plugh -> 'cat' | 'dog'
xyzzy -> 'green' | 'blue' | 'grue'
Because in Pidgin sub-grammars need to be defned before they can be used in the definition of other rules, construction has to go bottom up.
I am planning to write Pidgin macros which will allow one to define a grammar top-down, but this is currently still in the pie-in-the-sky stage.
Gotchas
Named Captures in Foreign Rules
Pidgin does not generate unbounded repetitions and it only attempts to match the examples you present it. Sometimes this is inconvenient: you want to match all phone numbers but you don't want to provide all phone numbers as patterns, say. To work around this you can supply a "foreign rule". This is just a regular expression you have acquired outside of pidgin.
pidgin.foreign_rule.unwrap;
The unwrap
is necessary in this case because your expression might not
compile, so foreign_rule
returns a Result<(),regex::Error>
so you can
handle this exception.
This is all fine, but suppose you provide a foreign rule with a named capture:
pidgin.foreign_rule.unwrap;
This will compile when you define the rule, but if you use foo
in more than
one place in your grammar, this will cause a regex::Error
when you attempt
to generate a pidgin::Matcher
from your grammar.
Manual Rule Ordering
Pidgin attempts alternations easiest to hardest. In effect it always tries to
pick low-hanging fruit. This means if told to match "elephant" and "cat" it
will generate the pattern cat|elephant
. Generally this is what you want, but
it may happen that you want to prioritize rules differently. Pidgin provides
a mechanism for this: if you at the same rule more than once, this makes the
rule an alternation with the alternate order you specify:
pidgin.rule;
pidgin.rule;
This is all fine and good, but it means you cannot replace one rule with another simply by redefining it. If you wish to do this, you must first remove the rule:
pidgin.remove_rule;
Phrase List Clearing on Compilation
Because the general pattern of use for Pidgin is that you compile a bunch of
sub-grammars and then use them in other grammars, and because you generall don't
want the alternates in one sub-grammar to be included in the next you compile,
the compile
method clears the phrase list. It does not clear the rule list.
You usually don't want to clear the rule list, but there is a clear
method
which clears both.
The clear
method does not clear any flags you have set. If you want that as
well, you're better off just building a fresh Pidgin.
Effect on Matching Efficiency
The principle motivation behind Pidgin is simply to produce something like an
abstract syntax tree one can use to better understand matched text, but the
regular expressions it generates generally match as well as or better than
naively constructed regular expressions represented as alternations of
expressions. There is a simple benchmark suite included in the benches/
directory demonstrating this for certain cases. The suite compares matching
speed for "naive" regular expressions of the form foo|bar|baz
to those
generated by Pidgin, which involve no backtracking and, in this case, would
be something like foo|ba[rz]
. The benchmark suite compares non-matching to
matching and bounded to unbounded patterns, where a bounded pattern must match
from the first character to the last and an unbounded pattern must match
somewhere in the string searched. The mean match/non-match times for these 8
cases were
pidgin | match | bounded | time |
---|---|---|---|
✓ | ✓ | 5.1287 ms | |
✓ | 27.151 ms | ||
✓ | 15.757 ms | ||
168.24 ms | |||
✓ | ✓ | ✓ | 3.0683 ms |
✓ | ✓ | 4.0253 ms | |
✓ | ✓ | 5.0138 ms | |
✓ | 1.5124 ms |
The full report can be generated by running