Expand description

Gallop - General LL(1) parser

Example

let mut grammar: Grammar = BTreeMap::new();

grammar.insert("START", vec![
  vec![RuleElement::NonTerminal("a+")],
]);

grammar.insert("a+", vec![
  vec![RuleElement::Terminal('a'), RuleElement::NonTerminal("a*")],
]);

grammar.insert("a*", vec![
  vec![RuleElement::Terminal('a'), RuleElement::NonTerminal("a*")],
  vec![RuleElement::Empty],
]);

let mut parser = Parser::new(&mut grammar).unwrap();

assert!(parser.parse("aaa").unwrap() == ParseTree::NonTerminal {
    symbol:   "START",
    children: vec![ParseTree::Terminal('a'), ParseTree::Terminal('a'), ParseTree::Terminal('a')],
});

Tutorial

We’ll use the above code to explain how to use Gallop, but we’ll do this by stepping backwards from the bottom and work our way up…

A sucessful parsing results in a ParseTree, which has the following type:

pub enum ParseTree<'a> {
  Terminal(char),
  NonTerminal {
    symbol: &'a str,
    children: Vec<ParseTree<'a>>,
  }
}
  • ParseTree::Terminal represents a single captured Unicode scalar value from the input.
  • ParseTree::NonTerminal is a recursive data structure to represent the current part of the parse tree

To parse an input string, use the parse() function from a Parser:

assert!(parser.parse("aaa").unwrap() == ParseTree::NonTerminal {
    symbol:   "START",
    children: vec![ParseTree::Terminal('a'), ParseTree::Terminal('a'), ParseTree::Terminal('a')],
});

To create a Parser, pass it a reference to a Grammar:

let mut parser = Parser::new(&mut grammar).unwrap();

A Grammar has the following type:

type Grammar<'a> = BTreeMap<NonTerminal<'a>, Vec<Rule<'a>>>;

The bare NonTerminal<'a> represents the left-hand side of a production rule, whilst each element in Vec<Rule<'a>> contains the OR-seperated right-hand side of a production rule i.e:

1. a+ -> 'a' a*
2. a* -> 'a' a* | ε

is broken down into:

1. a+ -> 'a' a*
2. a* -> 'a' a*
3. a* -> ε

and is represented in code by:

grammar.insert("a+", vec![
  vec![RuleElement::Terminal('a'), RuleElement::NonTerminal("a*")],
]);

grammar.insert("a*", vec![
  vec![RuleElement::Terminal('a'), RuleElement::NonTerminal("a*")],
  vec![RuleElement::Empty],
]);

As you can see in that last snippet of code, Rules are Vec<RuleElement>s, which there are three types:

  1. RuleElement::Terminal(char) - a single Unicode scalar value to scan against
  2. RuleElement::NonTerminal(&str) - an intermediary syntactic variable
  3. RuleElement::Empty- the empty string

The last bit of code remaining is about the START symbol, explained in the following section.

Reserved Non-Terminals

The START Symbol

As Gallop is a top-down parser, it needs to know which of it’s NonTerminals is the “start” symbol. To keep things consistent, Gallop requires this to be START:

let mut grammar: Grammar = BTreeMap::new();

grammar.insert("START", vec![
  ...
]);

Convenience Non-Terminals

Instead of reinventing the wheel each time for commonly used tasks, there are a number of convenience non-terminals already predefined. These work just like every other non-terminal e.g:

let mut country_code: Grammar = BTreeMap::new();

country_code.insert("START", vec![
  vec!["two-ascii-characters"],
]);

country_code.insert("two-ascii-characters", vec![
  vec!["ASCII-UPPERCASE", "ASCII-UPPERCASE"],
]);

Note: Using one of the following on the left-hand side of a Grammar rule results in a ReservedNonTerminal GrammarError:

  • ALPHABETIC ((0x00..0x10FFF).map(|c as char| c.is_lowercase() || c.is_uppercase())
  • ALPHANUMERIC (ALPHABETIC, NUMERIC)
  • ASCII (0x00..0x7F)
  • ASCII-ALPHABETIC (ASCII-LOWERCASE, ASCII-UPPERCASE)
  • ASCII-ALPHANUMERIC (ASCII-ALPHABETIC, ASCII-DIGIT)
  • ASCII-CONTROL (0x00..0x1F, 0x7F)
  • ASCII-DIGIT (‘0’..‘9’)
  • ASCII-HEXDIGIT (ASCII-DIGIT, ‘a’..‘f’, ‘A’..‘F’)
  • ASCII-HEXDIGIT-LOWERCASE (ASCII-DIGIT, ‘a’..‘f’)
  • ASCII-HEXDIGIT-UPPERCASE (ASCII-DIGIT, ‘A’..‘F’)
  • ASCII-LOWERCASE (‘a’..‘z’)
  • ASCII-UPPERCASE (‘A’..‘Z’)
  • ASCII-WHITESPACE (0x0020, 0x0009, 0x000A, 0x000C, 0x000D)
  • CONTROL (0x00..0x10FFF).map(|c as char| c.is_control())
  • LOWERCASE (0x00..0x10FFF).map(|c as char| c.is_lowercase())
  • NUMERIC (0x00..0x10FFF).map(|c as char| c.is_numeric())
  • UPPERCASE (0x00..0x10FFF).map(|c as char| c.is_uppercase())
  • WHITESPACE (0x00..0x10FFF).map(|c as char| c.is_whitespace())

Rolling Up the Parse Tree

When dealing with complex inputs, it’s sometimes more digestible to break the grammar into lots of smaller, more composable non-terminals. The drawback in doing this though, is the resulting ParseTree can be really awkward to work with.

In the following example, we’ll try to parse a string containing one or more ‘a’ characters:

let mut grammar: Grammar = BTreeMap::new();

grammar.insert("START", vec![
  vec![
    RuleElement::NonTerminal("one-or-more-a")
  ],
]);

grammar.insert("one-or-more-a", vec![
  vec![
    RuleElement::Terminal('a'),
    RuleElement::NonTerminal("zero-or-more-a"),
  ],
]);

grammar.insert("zero-or-more-a", vec![
  vec![
    RuleElement::Terminal('a'),
    RuleElement::NonTerminal("zero-or-more-a"),
  ],
  vec![
    RuleElement::Empty,
  ],
]);

let mut parser = Parser::new(&mut grammar).unwrap();

assert!(parser.parse("aaa").unwrap() == ParseTree::NonTerminal {
    symbol:   "START",
    children: vec![
        ParseTree::NonTerminal {
            symbol:   "one-or-more-a",
            children: vec![
                ParseTree::Terminal('a'),
                ParseTree::NonTerminal {
                    symbol:   "zero-or-more-a",
                    children: vec![
                        ParseTree::Terminal('a'),
                        ParseTree::NonTerminal {
                            symbol:   "zero-or-more-a",
                            children: vec![
                                ParseTree::Terminal('a'),
                                ParseTree::NonTerminal {
                                    symbol:   "zero-or-more-a",
                                    children: vec![],
                                },
                            ],
                        },
                    ],
                },
            ],
        },
    ],
});

Yuk! And the problem gets worse as the number of ‘a’ characters increases within the input string. It would be nice if we could hide some of these intermediate non-terminals… this is what rollup() does:

  • Specify the NonTerminals to you want to be rolled up
  • When a rolled up non-terminal is found:
    • It’s childen are promoted to siblings
    • The rolled up non-terminal is removed

Here’s the same example from above, but now we rollup one-or-more-a and zero-or-more-a:

let mut grammar: Grammar = BTreeMap::new();

grammar.insert("START", vec![
  vec![
    RuleElement::NonTerminal("one-or-more-a")
  ],
]);

grammar.insert("one-or-more-a", vec![
  vec![
    RuleElement::Terminal('a'),
    RuleElement::NonTerminal("zero-or-more-a"),
  ],
]);

grammar.insert("zero-or-more-a", vec![
  vec![
    RuleElement::Terminal('a'),
    RuleElement::NonTerminal("zero-or-more-a"),
  ],
  vec![
    RuleElement::Empty,
  ],
]);

let mut parser = Parser::new(&mut grammar).unwrap();

parser.rollup(vec![
    "one-or-more-a",
    "zero-or-more-a",
]);

assert!(parser.parse("aaa").unwrap() == ParseTree::NonTerminal {
    symbol:   "START",
    children: vec![
      ParseTree::Terminal('a'),
      ParseTree::Terminal('a'),
      ParseTree::Terminal('a'),
    ],
});

Much nicer!

NOTE: As a convenience, non-terminals ending with '*' and '+' are rolled up automatically:

let mut grammar: Grammar = BTreeMap::new();

grammar.insert("START", vec![
  vec![
    RuleElement::NonTerminal("a+")
  ],
]);

grammar.insert("a+", vec![
  vec![
    RuleElement::Terminal('a'),
    RuleElement::NonTerminal("a*"),
  ],
]);

grammar.insert("a*", vec![
  vec![
    RuleElement::Terminal('a'),
    RuleElement::NonTerminal("a*"),
  ],
  vec![
    RuleElement::Empty,
  ],
]);

let mut parser = Parser::new(&mut grammar).unwrap();

assert!(parser.parse("aaa").unwrap() == ParseTree::NonTerminal {
    symbol:   "START",
    children: vec![
      ParseTree::Terminal('a'),
      ParseTree::Terminal('a'),
      ParseTree::Terminal('a'),
    ],
});

Structs

Holds the state of the parser for the provided Grammar

Enums

Possible errors that could happen when creating a Parser

Possible errors that could happen during parsing

The results from a parsing of the input text

Elements that make up each Rule

Type Definitions

The left and right hand side of the production rules

An intermediary syntactic variable

The right-hand side of a Grammar production rule

A single Unicode scalar value