Crate gallop[][src]

Gallop - General LL(1) parser

Example

This example is not tested
let mut grammar: Grammar = BTreeMap::new();

grammar.insert("START", vec![
  vec![RuleElement::NonTerminal("a+")],
]);

grammar.insert("a+", vec![
  vec![RuleElement::Terminal('a'), RuleElement::NonTerminal("a*")],
]);

grammar.insert("a*", vec![
  vec![RuleElement::Terminal('a'), RuleElement::NonTerminal("a*")],
  vec![RuleElement::Empty],
]);

let mut parser = Parser::new(&mut grammar).unwrap();

assert!(parser.parse("aaa").unwrap() == ParseTree::NonTerminal {
    symbol:   "START",
    children: vec![ParseTree::Terminal('a'), ParseTree::Terminal('a'), ParseTree::Terminal('a')],
});

Tutorial

We'll use the above code to explain how to use Gallop, but we'll do this by stepping backwards from the bottom and work our way up...

A sucessful parsing results in a ParseTree, which has the following type:

This example is not tested
pub enum ParseTree<'a> {
  Terminal(char),
  NonTerminal {
    symbol: &'a str,
    children: Vec<ParseTree<'a>>,
  }
}
  • ParseTree::Terminal represents a single captured Unicode scalar value from the input.
  • ParseTree::NonTerminal is a recursive data structure to represent the current part of the parse tree

To parse an input string, use the parse() function from a Parser:

This example is not tested
assert!(parser.parse("aaa").unwrap() == ParseTree::NonTerminal {
    symbol:   "START",
    children: vec![ParseTree::Terminal('a'), ParseTree::Terminal('a'), ParseTree::Terminal('a')],
});

To create a Parser, pass it a reference to a Grammar:

This example is not tested
let mut parser = Parser::new(&mut grammar).unwrap();

A Grammar has the following type:

This example is not tested
type Grammar<'a> = BTreeMap<NonTerminal<'a>, Vec<Rule<'a>>>;

The bare NonTerminal<'a> represents the left-hand side of a production rule, whilst each element in Vec<Rule<'a>> contains the OR-seperated right-hand side of a production rule i.e:

This example is not tested
1. a+ -> 'a' a*
2. a* -> 'a' a* | ε

is broken down into:

This example is not tested
1. a+ -> 'a' a*
2. a* -> 'a' a*
3. a* -> ε

and is represented in code by:

This example is not tested
grammar.insert("a+", vec![
  vec![RuleElement::Terminal('a'), RuleElement::NonTerminal("a*")],
]);

grammar.insert("a*", vec![
  vec![RuleElement::Terminal('a'), RuleElement::NonTerminal("a*")],
  vec![RuleElement::Empty],
]);

As you can see in that last snippet of code, Rules are Vec<RuleElement>s, which there are three types:

  1. RuleElement::Terminal(char) - a single Unicode scalar value to scan against
  2. RuleElement::NonTerminal(&str) - an intermediary syntactic variable
  3. RuleElement::Empty- the empty string

The last bit of code remaining is about the START symbol, explained in the following section.

Reserved Non-Terminals

The START Symbol

As Gallop is a top-down parser, it needs to know which of it's NonTerminals is the "start" symbol. To keep things consistent, Gallop requires this to be START:

This example is not tested
let mut grammar: Grammar = BTreeMap::new();

grammar.insert("START", vec![
  ...
]);

Convenience Non-Terminals

Instead of reinventing the wheel each time for commonly used tasks, there are a number of convenience non-terminals already predefined. These work just like every other non-terminal e.g:

This example is not tested
let mut country_code: Grammar = BTreeMap::new();

country_code.insert("START", vec![
  vec!["two-ascii-characters"],
]);

country_code.insert("two-ascii-characters", vec![
  vec!["ASCII-UPPERCASE", "ASCII-UPPERCASE"],
]);

Note: Using one of the following on the left-hand side of a Grammar rule results in a ReservedNonTerminal GrammarError:

  • ALPHABETIC ((0x00..0x10FFF).map(|c as char| c.is_lowercase() || c.is_uppercase())
  • ALPHANUMERIC (ALPHABETIC, NUMERIC)
  • ASCII (0x00..0x7F)
  • ASCII-ALPHABETIC (ASCII-LOWERCASE, ASCII-UPPERCASE)
  • ASCII-ALPHANUMERIC (ASCII-ALPHABETIC, ASCII-DIGIT)
  • ASCII-CONTROL (0x00..0x1F, 0x7F)
  • ASCII-DIGIT ('0'..'9')
  • ASCII-HEXDIGIT (ASCII-DIGIT, 'a'..'f', 'A'..'F')
  • ASCII-HEXDIGIT-LOWERCASE (ASCII-DIGIT, 'a'..'f')
  • ASCII-HEXDIGIT-UPPERCASE (ASCII-DIGIT, 'A'..'F')
  • ASCII-LOWERCASE ('a'..'z')
  • ASCII-UPPERCASE ('A'..'Z')
  • ASCII-WHITESPACE (0x0020, 0x0009, 0x000A, 0x000C, 0x000D)
  • CONTROL (0x00..0x10FFF).map(|c as char| c.is_control())
  • LOWERCASE (0x00..0x10FFF).map(|c as char| c.is_lowercase())
  • NUMERIC (0x00..0x10FFF).map(|c as char| c.is_numeric())
  • UPPERCASE (0x00..0x10FFF).map(|c as char| c.is_uppercase())
  • WHITESPACE (0x00..0x10FFF).map(|c as char| c.is_whitespace())

Rolling Up the Parse Tree

When dealing with complex inputs, it's sometimes more digestible to break the grammar into lots of smaller, more composable non-terminals. The drawback in doing this though, is the resulting ParseTree can be really awkward to work with.

In the following example, we'll try to parse a string containing one or more 'a' characters:

This example is not tested
let mut grammar: Grammar = BTreeMap::new();

grammar.insert("START", vec![
  vec![
    RuleElement::NonTerminal("one-or-more-a")
  ],
]);

grammar.insert("one-or-more-a", vec![
  vec![
    RuleElement::Terminal('a'),
    RuleElement::NonTerminal("zero-or-more-a"),
  ],
]);

grammar.insert("zero-or-more-a", vec![
  vec![
    RuleElement::Terminal('a'),
    RuleElement::NonTerminal("zero-or-more-a"),
  ],
  vec![
    RuleElement::Empty,
  ],
]);

let mut parser = Parser::new(&mut grammar).unwrap();

assert!(parser.parse("aaa").unwrap() == ParseTree::NonTerminal {
    symbol:   "START",
    children: vec![
        ParseTree::NonTerminal {
            symbol:   "one-or-more-a",
            children: vec![
                ParseTree::Terminal('a'),
                ParseTree::NonTerminal {
                    symbol:   "zero-or-more-a",
                    children: vec![
                        ParseTree::Terminal('a'),
                        ParseTree::NonTerminal {
                            symbol:   "zero-or-more-a",
                            children: vec![
                                ParseTree::Terminal('a'),
                                ParseTree::NonTerminal {
                                    symbol:   "zero-or-more-a",
                                    children: vec![],
                                },
                            ],
                        },
                    ],
                },
            ],
        },
    ],
});

Yuk! And the problem gets worse as the number of 'a' characters increases within the input string. It would be nice if we could hide some of these intermediate non-terminals... this is what rollup() does:

  • Specify the NonTerminals to you want to be rolled up
  • When a rolled up non-terminal is found:
    • It's childen are promoted to siblings
    • The rolled up non-terminal is removed

Here's the same example from above, but now we rollup one-or-more-a and zero-or-more-a:

This example is not tested
let mut grammar: Grammar = BTreeMap::new();

grammar.insert("START", vec![
  vec![
    RuleElement::NonTerminal("one-or-more-a")
  ],
]);

grammar.insert("one-or-more-a", vec![
  vec![
    RuleElement::Terminal('a'),
    RuleElement::NonTerminal("zero-or-more-a"),
  ],
]);

grammar.insert("zero-or-more-a", vec![
  vec![
    RuleElement::Terminal('a'),
    RuleElement::NonTerminal("zero-or-more-a"),
  ],
  vec![
    RuleElement::Empty,
  ],
]);

let mut parser = Parser::new(&mut grammar).unwrap();

parser.rollup(vec![
    "one-or-more-a",
    "zero-or-more-a",
]);

assert!(parser.parse("aaa").unwrap() == ParseTree::NonTerminal {
    symbol:   "START",
    children: vec![
      ParseTree::Terminal('a'),
      ParseTree::Terminal('a'),
      ParseTree::Terminal('a'),
    ],
});

Much nicer!

NOTE: As a convenience, non-terminals ending with '*' and '+' are rolled up automatically:

This example is not tested
let mut grammar: Grammar = BTreeMap::new();

grammar.insert("START", vec![
  vec![
    RuleElement::NonTerminal("a+")
  ],
]);

grammar.insert("a+", vec![
  vec![
    RuleElement::Terminal('a'),
    RuleElement::NonTerminal("a*"),
  ],
]);

grammar.insert("a*", vec![
  vec![
    RuleElement::Terminal('a'),
    RuleElement::NonTerminal("a*"),
  ],
  vec![
    RuleElement::Empty,
  ],
]);

let mut parser = Parser::new(&mut grammar).unwrap();

assert!(parser.parse("aaa").unwrap() == ParseTree::NonTerminal {
    symbol:   "START",
    children: vec![
      ParseTree::Terminal('a'),
      ParseTree::Terminal('a'),
      ParseTree::Terminal('a'),
    ],
});

Structs

Parser

Holds the state of the parser for the provided Grammar

Enums

GrammarError

Possible errors that could happen when creating a Parser

ParseError

Possible errors that could happen during parsing

ParseTree

The results from a parsing of the input text

RuleElement

Elements that make up each Rule

Type Definitions

Grammar

The left and right hand side of the production rules

NonTerminal

An intermediary syntactic variable

Rule

The right-hand side of a Grammar production rule

Terminal

A single Unicode scalar value