Expand description
Gallop - General LL(1) parser
§Example
let mut grammar: Grammar = BTreeMap::new();
grammar.insert("START", vec![
vec![RuleElement::NonTerminal("a+")],
]);
grammar.insert("a+", vec![
vec![RuleElement::Terminal('a'), RuleElement::NonTerminal("a*")],
]);
grammar.insert("a*", vec![
vec![RuleElement::Terminal('a'), RuleElement::NonTerminal("a*")],
vec![RuleElement::Empty],
]);
let mut parser = Parser::new(&mut grammar).unwrap();
assert!(parser.parse("aaa").unwrap() == ParseTree::NonTerminal {
symbol: "START",
children: vec![ParseTree::Terminal('a'), ParseTree::Terminal('a'), ParseTree::Terminal('a')],
});§Tutorial
We’ll use the above code to explain how to use Gallop, but we’ll do this by stepping
backwards from the bottom and work our way up…
A sucessful parsing results in a ParseTree, which has the following type:
pub enum ParseTree<'a> {
Terminal(char),
NonTerminal {
symbol: &'a str,
children: Vec<ParseTree<'a>>,
}
}ParseTree::Terminalrepresents a single captured Unicode scalar value from the input.ParseTree::NonTerminalis a recursive data structure to represent the current part of the parse tree
To parse an input string, use the parse() function from a Parser:
assert!(parser.parse("aaa").unwrap() == ParseTree::NonTerminal {
symbol: "START",
children: vec![ParseTree::Terminal('a'), ParseTree::Terminal('a'), ParseTree::Terminal('a')],
});To create a Parser, pass it a reference to a Grammar:
let mut parser = Parser::new(&mut grammar).unwrap();A Grammar has the following type:
type Grammar<'a> = BTreeMap<NonTerminal<'a>, Vec<Rule<'a>>>;The bare NonTerminal<'a> represents the left-hand side of a production rule, whilst
each element in Vec<Rule<'a>> contains the OR-seperated right-hand side of a
production rule i.e:
1. a+ -> 'a' a*
2. a* -> 'a' a* | εis broken down into:
1. a+ -> 'a' a*
2. a* -> 'a' a*
3. a* -> εand is represented in code by:
grammar.insert("a+", vec![
vec![RuleElement::Terminal('a'), RuleElement::NonTerminal("a*")],
]);
grammar.insert("a*", vec![
vec![RuleElement::Terminal('a'), RuleElement::NonTerminal("a*")],
vec![RuleElement::Empty],
]);
As you can see in that last snippet of code, Rules are Vec<RuleElement>s,
which there are three types:
RuleElement::Terminal(char)- a single Unicode scalar value to scan againstRuleElement::NonTerminal(&str)- an intermediary syntactic variableRuleElement::Empty- the empty string
The last bit of code remaining is about the START symbol, explained in the
following section.
§Reserved Non-Terminals
The START Symbol
As Gallop is a top-down parser, it needs to know which of it’s NonTerminals
is the “start” symbol. To keep things consistent, Gallop requires this to be START:
let mut grammar: Grammar = BTreeMap::new();
grammar.insert("START", vec![
...
]);
Convenience Non-Terminals
Instead of reinventing the wheel each time for commonly used tasks, there are a number of convenience non-terminals already predefined. These work just like every other non-terminal e.g:
let mut country_code: Grammar = BTreeMap::new();
country_code.insert("START", vec![
vec!["two-ascii-characters"],
]);
country_code.insert("two-ascii-characters", vec![
vec!["ASCII-UPPERCASE", "ASCII-UPPERCASE"],
]);Note: Using one of the following on the left-hand side of a Grammar rule results
in a ReservedNonTerminal GrammarError:
ALPHABETIC((0x00..0x10FFF).map(|c as char| c.is_lowercase() || c.is_uppercase())ALPHANUMERIC(ALPHABETIC,NUMERIC)ASCII(0x00..0x7F)ASCII-ALPHABETIC(ASCII-LOWERCASE,ASCII-UPPERCASE)ASCII-ALPHANUMERIC(ASCII-ALPHABETIC,ASCII-DIGIT)ASCII-CONTROL(0x00..0x1F, 0x7F)ASCII-DIGIT(‘0’..‘9’)ASCII-HEXDIGIT(ASCII-DIGIT, ‘a’..‘f’, ‘A’..‘F’)ASCII-HEXDIGIT-LOWERCASE(ASCII-DIGIT, ‘a’..‘f’)ASCII-HEXDIGIT-UPPERCASE(ASCII-DIGIT, ‘A’..‘F’)ASCII-LOWERCASE(‘a’..‘z’)ASCII-UPPERCASE(‘A’..‘Z’)ASCII-WHITESPACE(0x0020, 0x0009, 0x000A, 0x000C, 0x000D)CONTROL(0x00..0x10FFF).map(|c as char| c.is_control())LOWERCASE(0x00..0x10FFF).map(|c as char| c.is_lowercase())NUMERIC(0x00..0x10FFF).map(|c as char| c.is_numeric())UPPERCASE(0x00..0x10FFF).map(|c as char| c.is_uppercase())WHITESPACE(0x00..0x10FFF).map(|c as char| c.is_whitespace())
§Rolling Up the Parse Tree
When dealing with complex inputs, it’s sometimes more digestible to break the grammar
into lots of smaller, more composable non-terminals. The drawback in doing this
though, is the resulting ParseTree can be really awkward to work with.
In the following example, we’ll try to parse a string containing one or more ‘a’ characters:
let mut grammar: Grammar = BTreeMap::new();
grammar.insert("START", vec![
vec![
RuleElement::NonTerminal("one-or-more-a")
],
]);
grammar.insert("one-or-more-a", vec![
vec![
RuleElement::Terminal('a'),
RuleElement::NonTerminal("zero-or-more-a"),
],
]);
grammar.insert("zero-or-more-a", vec![
vec![
RuleElement::Terminal('a'),
RuleElement::NonTerminal("zero-or-more-a"),
],
vec![
RuleElement::Empty,
],
]);
let mut parser = Parser::new(&mut grammar).unwrap();
assert!(parser.parse("aaa").unwrap() == ParseTree::NonTerminal {
symbol: "START",
children: vec![
ParseTree::NonTerminal {
symbol: "one-or-more-a",
children: vec![
ParseTree::Terminal('a'),
ParseTree::NonTerminal {
symbol: "zero-or-more-a",
children: vec![
ParseTree::Terminal('a'),
ParseTree::NonTerminal {
symbol: "zero-or-more-a",
children: vec![
ParseTree::Terminal('a'),
ParseTree::NonTerminal {
symbol: "zero-or-more-a",
children: vec![],
},
],
},
],
},
],
},
],
});
Yuk! And the problem gets worse as the number of ‘a’ characters increases within the input string.
It would be nice if we could hide some of these intermediate non-terminals… this is
what rollup() does:
- Specify the
NonTerminals to you want to be rolled up - When a rolled up non-terminal is found:
- It’s childen are promoted to siblings
- The rolled up non-terminal is removed
Here’s the same example from above, but now we rollup one-or-more-a and zero-or-more-a:
let mut grammar: Grammar = BTreeMap::new();
grammar.insert("START", vec![
vec![
RuleElement::NonTerminal("one-or-more-a")
],
]);
grammar.insert("one-or-more-a", vec![
vec![
RuleElement::Terminal('a'),
RuleElement::NonTerminal("zero-or-more-a"),
],
]);
grammar.insert("zero-or-more-a", vec![
vec![
RuleElement::Terminal('a'),
RuleElement::NonTerminal("zero-or-more-a"),
],
vec![
RuleElement::Empty,
],
]);
let mut parser = Parser::new(&mut grammar).unwrap();
parser.rollup(vec![
"one-or-more-a",
"zero-or-more-a",
]);
assert!(parser.parse("aaa").unwrap() == ParseTree::NonTerminal {
symbol: "START",
children: vec![
ParseTree::Terminal('a'),
ParseTree::Terminal('a'),
ParseTree::Terminal('a'),
],
});
Much nicer!
NOTE: As a convenience, non-terminals ending with '*' and '+' are rolled up automatically:
let mut grammar: Grammar = BTreeMap::new();
grammar.insert("START", vec![
vec![
RuleElement::NonTerminal("a+")
],
]);
grammar.insert("a+", vec![
vec![
RuleElement::Terminal('a'),
RuleElement::NonTerminal("a*"),
],
]);
grammar.insert("a*", vec![
vec![
RuleElement::Terminal('a'),
RuleElement::NonTerminal("a*"),
],
vec![
RuleElement::Empty,
],
]);
let mut parser = Parser::new(&mut grammar).unwrap();
assert!(parser.parse("aaa").unwrap() == ParseTree::NonTerminal {
symbol: "START",
children: vec![
ParseTree::Terminal('a'),
ParseTree::Terminal('a'),
ParseTree::Terminal('a'),
],
});Structs§
- Parser
- Holds the state of the parser for the provided
Grammar
Enums§
- Grammar
Error - Possible errors that could happen when creating a
Parser - Parse
Error - Possible errors that could happen during parsing
- Parse
Tree - The results from a parsing of the input text
- Rule
Element - Elements that make up each
Rule
Type Aliases§
- Grammar
- The left and right hand side of the production rules
- NonTerminal
- An intermediary syntactic variable
- Rule
- The right-hand side of a
Grammarproduction rule - Terminal
- A single Unicode scalar value