Expand description
Gallop - General LL(1) parser
Example
let mut grammar: Grammar = BTreeMap::new();
grammar.insert("START", vec![
vec![RuleElement::NonTerminal("a+")],
]);
grammar.insert("a+", vec![
vec![RuleElement::Terminal('a'), RuleElement::NonTerminal("a*")],
]);
grammar.insert("a*", vec![
vec![RuleElement::Terminal('a'), RuleElement::NonTerminal("a*")],
vec![RuleElement::Empty],
]);
let mut parser = Parser::new(&mut grammar).unwrap();
assert!(parser.parse("aaa").unwrap() == ParseTree::NonTerminal {
symbol: "START",
children: vec![ParseTree::Terminal('a'), ParseTree::Terminal('a'), ParseTree::Terminal('a')],
});
Tutorial
We’ll use the above code to explain how to use Gallop
, but we’ll do this by stepping
backwards from the bottom and work our way up…
A sucessful parsing results in a ParseTree
, which has the following type:
pub enum ParseTree<'a> {
Terminal(char),
NonTerminal {
symbol: &'a str,
children: Vec<ParseTree<'a>>,
}
}
ParseTree::Terminal
represents a single captured Unicode scalar value from the input.ParseTree::NonTerminal
is a recursive data structure to represent the current part of the parse tree
To parse an input string, use the parse()
function from a Parser
:
assert!(parser.parse("aaa").unwrap() == ParseTree::NonTerminal {
symbol: "START",
children: vec![ParseTree::Terminal('a'), ParseTree::Terminal('a'), ParseTree::Terminal('a')],
});
To create a Parser
, pass it a reference to a Grammar
:
let mut parser = Parser::new(&mut grammar).unwrap();
A Grammar
has the following type:
type Grammar<'a> = BTreeMap<NonTerminal<'a>, Vec<Rule<'a>>>;
The bare NonTerminal<'a>
represents the left-hand side of a production rule, whilst
each element in Vec<Rule<'a>>
contains the OR
-seperated right-hand side of a
production rule i.e:
1. a+ -> 'a' a*
2. a* -> 'a' a* | ε
is broken down into:
1. a+ -> 'a' a*
2. a* -> 'a' a*
3. a* -> ε
and is represented in code by:
grammar.insert("a+", vec![
vec![RuleElement::Terminal('a'), RuleElement::NonTerminal("a*")],
]);
grammar.insert("a*", vec![
vec![RuleElement::Terminal('a'), RuleElement::NonTerminal("a*")],
vec![RuleElement::Empty],
]);
As you can see in that last snippet of code, Rule
s are Vec<RuleElement>
s,
which there are three types:
RuleElement::Terminal(char)
- a single Unicode scalar value to scan againstRuleElement::NonTerminal(&str)
- an intermediary syntactic variableRuleElement::Empty
- the empty string
The last bit of code remaining is about the START
symbol, explained in the
following section.
Reserved Non-Terminals
The START Symbol
As Gallop
is a top-down parser, it needs to know which of it’s NonTerminal
s
is the “start” symbol. To keep things consistent, Gallop
requires this to be START
:
let mut grammar: Grammar = BTreeMap::new();
grammar.insert("START", vec![
...
]);
Convenience Non-Terminals
Instead of reinventing the wheel each time for commonly used tasks, there are a number of convenience non-terminals already predefined. These work just like every other non-terminal e.g:
let mut country_code: Grammar = BTreeMap::new();
country_code.insert("START", vec![
vec!["two-ascii-characters"],
]);
country_code.insert("two-ascii-characters", vec![
vec!["ASCII-UPPERCASE", "ASCII-UPPERCASE"],
]);
Note: Using one of the following on the left-hand side of a Grammar
rule results
in a ReservedNonTerminal
GrammarError
:
ALPHABETIC
((0x00..0x10FFF).map(|c as char| c.is_lowercase() || c.is_uppercase())ALPHANUMERIC
(ALPHABETIC
,NUMERIC
)ASCII
(0x00..0x7F)ASCII-ALPHABETIC
(ASCII-LOWERCASE
,ASCII-UPPERCASE
)ASCII-ALPHANUMERIC
(ASCII-ALPHABETIC
,ASCII-DIGIT
)ASCII-CONTROL
(0x00..0x1F, 0x7F)ASCII-DIGIT
(‘0’..‘9’)ASCII-HEXDIGIT
(ASCII-DIGIT
, ‘a’..‘f’, ‘A’..‘F’)ASCII-HEXDIGIT-LOWERCASE
(ASCII-DIGIT
, ‘a’..‘f’)ASCII-HEXDIGIT-UPPERCASE
(ASCII-DIGIT
, ‘A’..‘F’)ASCII-LOWERCASE
(‘a’..‘z’)ASCII-UPPERCASE
(‘A’..‘Z’)ASCII-WHITESPACE
(0x0020, 0x0009, 0x000A, 0x000C, 0x000D)CONTROL
(0x00..0x10FFF).map(|c as char| c.is_control())LOWERCASE
(0x00..0x10FFF).map(|c as char| c.is_lowercase())NUMERIC
(0x00..0x10FFF).map(|c as char| c.is_numeric())UPPERCASE
(0x00..0x10FFF).map(|c as char| c.is_uppercase())WHITESPACE
(0x00..0x10FFF).map(|c as char| c.is_whitespace())
Rolling Up the Parse Tree
When dealing with complex inputs, it’s sometimes more digestible to break the grammar
into lots of smaller, more composable non-terminals. The drawback in doing this
though, is the resulting ParseTree
can be really awkward to work with.
In the following example, we’ll try to parse a string containing one or more ‘a’ characters:
let mut grammar: Grammar = BTreeMap::new();
grammar.insert("START", vec![
vec![
RuleElement::NonTerminal("one-or-more-a")
],
]);
grammar.insert("one-or-more-a", vec![
vec![
RuleElement::Terminal('a'),
RuleElement::NonTerminal("zero-or-more-a"),
],
]);
grammar.insert("zero-or-more-a", vec![
vec![
RuleElement::Terminal('a'),
RuleElement::NonTerminal("zero-or-more-a"),
],
vec![
RuleElement::Empty,
],
]);
let mut parser = Parser::new(&mut grammar).unwrap();
assert!(parser.parse("aaa").unwrap() == ParseTree::NonTerminal {
symbol: "START",
children: vec![
ParseTree::NonTerminal {
symbol: "one-or-more-a",
children: vec![
ParseTree::Terminal('a'),
ParseTree::NonTerminal {
symbol: "zero-or-more-a",
children: vec![
ParseTree::Terminal('a'),
ParseTree::NonTerminal {
symbol: "zero-or-more-a",
children: vec![
ParseTree::Terminal('a'),
ParseTree::NonTerminal {
symbol: "zero-or-more-a",
children: vec![],
},
],
},
],
},
],
},
],
});
Yuk! And the problem gets worse as the number of ‘a’ characters increases within the input string.
It would be nice if we could hide some of these intermediate non-terminals… this is
what rollup()
does:
- Specify the
NonTerminal
s to you want to be rolled up - When a rolled up non-terminal is found:
- It’s childen are promoted to siblings
- The rolled up non-terminal is removed
Here’s the same example from above, but now we rollup one-or-more-a
and zero-or-more-a
:
let mut grammar: Grammar = BTreeMap::new();
grammar.insert("START", vec![
vec![
RuleElement::NonTerminal("one-or-more-a")
],
]);
grammar.insert("one-or-more-a", vec![
vec![
RuleElement::Terminal('a'),
RuleElement::NonTerminal("zero-or-more-a"),
],
]);
grammar.insert("zero-or-more-a", vec![
vec![
RuleElement::Terminal('a'),
RuleElement::NonTerminal("zero-or-more-a"),
],
vec![
RuleElement::Empty,
],
]);
let mut parser = Parser::new(&mut grammar).unwrap();
parser.rollup(vec![
"one-or-more-a",
"zero-or-more-a",
]);
assert!(parser.parse("aaa").unwrap() == ParseTree::NonTerminal {
symbol: "START",
children: vec![
ParseTree::Terminal('a'),
ParseTree::Terminal('a'),
ParseTree::Terminal('a'),
],
});
Much nicer!
NOTE: As a convenience, non-terminals ending with '*'
and '+'
are rolled up automatically:
let mut grammar: Grammar = BTreeMap::new();
grammar.insert("START", vec![
vec![
RuleElement::NonTerminal("a+")
],
]);
grammar.insert("a+", vec![
vec![
RuleElement::Terminal('a'),
RuleElement::NonTerminal("a*"),
],
]);
grammar.insert("a*", vec![
vec![
RuleElement::Terminal('a'),
RuleElement::NonTerminal("a*"),
],
vec![
RuleElement::Empty,
],
]);
let mut parser = Parser::new(&mut grammar).unwrap();
assert!(parser.parse("aaa").unwrap() == ParseTree::NonTerminal {
symbol: "START",
children: vec![
ParseTree::Terminal('a'),
ParseTree::Terminal('a'),
ParseTree::Terminal('a'),
],
});
Structs
Holds the state of the parser for the provided Grammar
Enums
Possible errors that could happen when creating a Parser
Possible errors that could happen during parsing
The results from a parsing of the input text
Elements that make up each Rule
Type Definitions
The left and right hand side of the production rules
An intermediary syntactic variable
The right-hand side of a Grammar
production rule
A single Unicode scalar value