Crate gallop [−] [src]
Gallop - General LL(1) parser
Example
let mut grammar: Grammar = BTreeMap::new(); grammar.insert("START", vec![ vec![RuleElement::NonTerminal("a+")], ]); grammar.insert("a+", vec![ vec![RuleElement::Terminal('a'), RuleElement::NonTerminal("a*")], ]); grammar.insert("a*", vec![ vec![RuleElement::Terminal('a'), RuleElement::NonTerminal("a*")], vec![RuleElement::Empty], ]); let mut parser = Parser::new(&mut grammar).unwrap(); assert!(parser.parse("aaa").unwrap() == ParseTree::NonTerminal { symbol: "START", children: vec![ParseTree::Terminal('a'), ParseTree::Terminal('a'), ParseTree::Terminal('a')], });
Tutorial
We'll use the above code to explain how to use Gallop
, but we'll do this by stepping
backwards from the bottom and work our way up...
A sucessful parsing results in a ParseTree
, which has the following type:
pub enum ParseTree<'a> { Terminal(char), NonTerminal { symbol: &'a str, children: Vec<ParseTree<'a>>, } }
ParseTree::Terminal
represents a single captured Unicode scalar value from the input.ParseTree::NonTerminal
is a recursive data structure to represent the current part of the parse tree
To parse an input string, use the parse()
function from a Parser
:
assert!(parser.parse("aaa").unwrap() == ParseTree::NonTerminal { symbol: "START", children: vec![ParseTree::Terminal('a'), ParseTree::Terminal('a'), ParseTree::Terminal('a')], });
To create a Parser
, pass it a reference to a Grammar
:
let mut parser = Parser::new(&mut grammar).unwrap();
A Grammar
has the following type:
type Grammar<'a> = BTreeMap<NonTerminal<'a>, Vec<Rule<'a>>>;
The bare NonTerminal<'a>
represents the left-hand side of a production rule, whilst
each element in Vec<Rule<'a>>
contains the OR
-seperated right-hand side of a
production rule i.e:
1. a+ -> 'a' a* 2. a* -> 'a' a* | ε
is broken down into:
1. a+ -> 'a' a* 2. a* -> 'a' a* 3. a* -> ε
and is represented in code by:
grammar.insert("a+", vec![ vec![RuleElement::Terminal('a'), RuleElement::NonTerminal("a*")], ]); grammar.insert("a*", vec![ vec![RuleElement::Terminal('a'), RuleElement::NonTerminal("a*")], vec![RuleElement::Empty], ]);
As you can see in that last snippet of code, Rule
s are Vec<RuleElement>
s,
which there are three types:
RuleElement::Terminal(char)
- a single Unicode scalar value to scan againstRuleElement::NonTerminal(&str)
- an intermediary syntactic variableRuleElement::Empty
- the empty string
The last bit of code remaining is about the START
symbol, explained in the
following section.
Reserved Non-Terminals
The START Symbol
As Gallop
is a top-down parser, it needs to know which of it's NonTerminal
s
is the "start" symbol. To keep things consistent, Gallop
requires this to be START
:
let mut grammar: Grammar = BTreeMap::new(); grammar.insert("START", vec![ ... ]);
Convenience Non-Terminals
Instead of reinventing the wheel each time for commonly used tasks, there are a number of convenience non-terminals already predefined. These work just like every other non-terminal e.g:
let mut country_code: Grammar = BTreeMap::new(); country_code.insert("START", vec![ vec!["two-ascii-characters"], ]); country_code.insert("two-ascii-characters", vec![ vec!["ASCII-UPPERCASE", "ASCII-UPPERCASE"], ]);
Note: Using one of the following on the left-hand side of a Grammar
rule results
in a ReservedNonTerminal
GrammarError
:
ALPHABETIC
((0x00..0x10FFF).map(|c as char| c.is_lowercase() || c.is_uppercase())ALPHANUMERIC
(ALPHABETIC
,NUMERIC
)ASCII
(0x00..0x7F)ASCII-ALPHABETIC
(ASCII-LOWERCASE
,ASCII-UPPERCASE
)ASCII-ALPHANUMERIC
(ASCII-ALPHABETIC
,ASCII-DIGIT
)ASCII-CONTROL
(0x00..0x1F, 0x7F)ASCII-DIGIT
('0'..'9')ASCII-HEXDIGIT
(ASCII-DIGIT
, 'a'..'f', 'A'..'F')ASCII-HEXDIGIT-LOWERCASE
(ASCII-DIGIT
, 'a'..'f')ASCII-HEXDIGIT-UPPERCASE
(ASCII-DIGIT
, 'A'..'F')ASCII-LOWERCASE
('a'..'z')ASCII-UPPERCASE
('A'..'Z')ASCII-WHITESPACE
(0x0020, 0x0009, 0x000A, 0x000C, 0x000D)CONTROL
(0x00..0x10FFF).map(|c as char| c.is_control())LOWERCASE
(0x00..0x10FFF).map(|c as char| c.is_lowercase())NUMERIC
(0x00..0x10FFF).map(|c as char| c.is_numeric())UPPERCASE
(0x00..0x10FFF).map(|c as char| c.is_uppercase())WHITESPACE
(0x00..0x10FFF).map(|c as char| c.is_whitespace())
Rolling Up the Parse Tree
When dealing with complex inputs, it's sometimes more digestible to break the grammar
into lots of smaller, more composable non-terminals. The drawback in doing this
though, is the resulting ParseTree
can be really awkward to work with.
In the following example, we'll try to parse a string containing one or more 'a' characters:
let mut grammar: Grammar = BTreeMap::new(); grammar.insert("START", vec![ vec![ RuleElement::NonTerminal("one-or-more-a") ], ]); grammar.insert("one-or-more-a", vec![ vec![ RuleElement::Terminal('a'), RuleElement::NonTerminal("zero-or-more-a"), ], ]); grammar.insert("zero-or-more-a", vec![ vec![ RuleElement::Terminal('a'), RuleElement::NonTerminal("zero-or-more-a"), ], vec![ RuleElement::Empty, ], ]); let mut parser = Parser::new(&mut grammar).unwrap(); assert!(parser.parse("aaa").unwrap() == ParseTree::NonTerminal { symbol: "START", children: vec![ ParseTree::NonTerminal { symbol: "one-or-more-a", children: vec![ ParseTree::Terminal('a'), ParseTree::NonTerminal { symbol: "zero-or-more-a", children: vec![ ParseTree::Terminal('a'), ParseTree::NonTerminal { symbol: "zero-or-more-a", children: vec![ ParseTree::Terminal('a'), ParseTree::NonTerminal { symbol: "zero-or-more-a", children: vec![], }, ], }, ], }, ], }, ], });
Yuk! And the problem gets worse as the number of 'a' characters increases within the input string.
It would be nice if we could hide some of these intermediate non-terminals... this is
what rollup()
does:
- Specify the
NonTerminal
s to you want to be rolled up - When a rolled up non-terminal is found:
- It's childen are promoted to siblings
- The rolled up non-terminal is removed
Here's the same example from above, but now we rollup one-or-more-a
and zero-or-more-a
:
let mut grammar: Grammar = BTreeMap::new(); grammar.insert("START", vec![ vec![ RuleElement::NonTerminal("one-or-more-a") ], ]); grammar.insert("one-or-more-a", vec![ vec![ RuleElement::Terminal('a'), RuleElement::NonTerminal("zero-or-more-a"), ], ]); grammar.insert("zero-or-more-a", vec![ vec![ RuleElement::Terminal('a'), RuleElement::NonTerminal("zero-or-more-a"), ], vec![ RuleElement::Empty, ], ]); let mut parser = Parser::new(&mut grammar).unwrap(); parser.rollup(vec![ "one-or-more-a", "zero-or-more-a", ]); assert!(parser.parse("aaa").unwrap() == ParseTree::NonTerminal { symbol: "START", children: vec![ ParseTree::Terminal('a'), ParseTree::Terminal('a'), ParseTree::Terminal('a'), ], });
Much nicer!
NOTE: As a convenience, non-terminals ending with '*'
and '+'
are rolled up automatically:
let mut grammar: Grammar = BTreeMap::new(); grammar.insert("START", vec![ vec![ RuleElement::NonTerminal("a+") ], ]); grammar.insert("a+", vec![ vec![ RuleElement::Terminal('a'), RuleElement::NonTerminal("a*"), ], ]); grammar.insert("a*", vec![ vec![ RuleElement::Terminal('a'), RuleElement::NonTerminal("a*"), ], vec![ RuleElement::Empty, ], ]); let mut parser = Parser::new(&mut grammar).unwrap(); assert!(parser.parse("aaa").unwrap() == ParseTree::NonTerminal { symbol: "START", children: vec![ ParseTree::Terminal('a'), ParseTree::Terminal('a'), ParseTree::Terminal('a'), ], });
Structs
Parser |
Holds the state of the parser for the provided |
Enums
GrammarError |
Possible errors that could happen when creating a |
ParseError |
Possible errors that could happen during parsing |
ParseTree |
The results from a parsing of the input text |
RuleElement |
Elements that make up each |
Type Definitions
Grammar |
The left and right hand side of the production rules |
NonTerminal |
An intermediary syntactic variable |
Rule |
The right-hand side of a |
Terminal |
A single Unicode scalar value |